13.8.1 Problem
You want to capture text inside HTML
tags. For example, you want to find all the headings in a HTML document.
13.8.2 Solution
Read the HTML file into a string and use nongreedy matching in
your pattern:
$html = join('',file($file));
preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);
In this example, $matches[2] contains an array of
captured headings.
13.8.3 Discussion
True parsing of HTML is difficult using a simple regular
expression. This is one advantage of using XHTML; it's significantly easier to
validate and parse.
For instance, the pattern in the Solution is smart enough to
find only matching headings, so <h1>Dr.
Strangelove<h1> is okay, because it's wrapped inside
<h1> tags, but not <h2>How I
Learned to Stop Worrying and
Love the Bomb</h3>, because the opening tag is
an <h2> while the closing tag is not.
This technique also works for finding all text inside bold and
italic tags:
$html = join('',file($file));
preg_match('#<([bi])>(.+?)</\1>#is', $html, $matches);
However, it breaks on nested headings. Using that regular
expression on:
<b>Dr. Strangelove or: <i>How I Learned to Stop Worrying and Love the Bomb</i></b>
doesn't capture the text inside the <i> tags as
a separate item.
This wasn't a problem earlier; because headings are block level
elements, it's illegal to nest them. However, as inline elements, nested bold
and italic tags are valid.
Captured text can be processed by looping through the array of
matches. For example, this code parses a document for its headings and
pretty-prints them with indentation according to the heading level:
$html = join('',file($file));
preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);
for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}
So, with one representation of this recipe in HTML:
$html =<<<_END_
<h1>PHP Cookbook</h1>
Other Chapters
<h2>Regular Expressions</h2>
Other Recipes
<h3>Capturing Text Inside of HTML Tags</h3>
<h4>Problem</h4>
<h4>Solution</h4>
<h4>Discussion</h4>
<h4>See Also</h4>
_END_;
preg_match_all('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);
for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}
You get:
PHP Cookbook
Regular Expressions
Capturing Text Inside of HTML Tags
Problem
Solution
Discussion
See Also
By capturing the heading level and heading text separately, you
can directly access the level and treat it as an integer when calculating the
indentation size. To avoid a two-space indent for all lines, subtract 1 from the
level.