title besides title

 

Thursday, November 29, 2012

PHP : Web Automation - [11.11] Converting HTML to ASCII

11.11.1 Problem

You need to convert HTML to readable, formatted ASCII text.

11.11.2 Solution

If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:
$file = escapeshellarg($file);
$ascii = `lynx -dump $file`;

11.11.3 Discussion

If you can't use an external formatter, the pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables or frames, though).
Example 11-4. pc_html2ascii( )
function pc_html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
                    '$2 ($1)', $s);

  // convert <br>, <hr>, <p>, <div> to line breaks
  $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
  $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
  $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
  
  // convert bold and italic
  $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
  $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);

  // decode named entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));

  // decode numbered entities
  $s = preg_replace('//e','chr(\\1)',$s);
  
  // remove any remaining tags
  $s = strip_tags($s);
  
  return $s;
}