title besides title

 

Thursday, November 29, 2012

PHP : Web Automation - [11.12] Removing HTML and PHP Tags

11.12.1 Problem

You want to remove HTML and PHP tags from a string or file.

11.12.2 Solution

Use strip_tags( ) to remove HTML and PHP tags from a string:
$html = '<a href="http://www.oreilly.com">I <b>love computer books.</b></a>';
print strip_tags($html);
I love computer books.
Use fgetss( ) to remove them from a file as you read in lines:
$fh = fopen('test.html','r') or die($php_errormsg);
while ($s = fgetss($fh,1024)) {
    print $s;
}
fclose($fh)                  or die($php_errormsg);

11.12.3 Discussion

While fgetss( ) is convenient if you need to strip tags from a file as you read it in, it may get confused if tags span lines or if they span the buffer that fgetss( ) reads from the file. At the price of increased memory usage, reading the entire file into a string provides better results:
$no_tags = strip_tags(join('',file('test.html')));
Both strip_tags( ) and fgetss( ) can be told not to remove certain tags by specifying those tags as a last argument. The tag specification is case-insensitive, and for pairs of tags, you only have to specify the opening tag. For example, this removes all but <b></b> tags from $html:
$html = '<a href="http://www.oreilly.com">I <b>love</b> computer books.</a>';
print strip_tags($html,'<b>');
I <b>love</b> computer books.