13.5.1 Problem
13.5.2 Solution
// find all bolded sections
preg_match_all('#<b>.+?</b>#', $html, $matches);
// find all bolded sections
preg_match_all('#<b>.+</b>#U', $html, $matches);
13.5.3 Discussion
By default, all regular expressions in PHP are what's known as
greedy. This means a quantifier always tries to match as many characters
as possible.
For example, take the pattern p.*, which matches a
p and then 0 or more characters, and match it against the string
php. A greedy regular expression finds one match, because after it
grabs the opening p, it continues on and also matches the hp.
A nongreedy regular expression, on the other hand, finds a pair of matches. As
before, it matches the p and also the h, but then instead of
continuing on, it backs off and leaves the final p uncaptured. A second
match then goes ahead and takes the closing letter.
The following code shows that the greedy match finds only one
hit; the nongreedy ones find two:
print preg_match_all('/p.*/', "php"); // greedy
print preg_match_all('/p.*?/', "php"); // nongreedy
print preg_match_all('/p.*/U', "php"); // nongreedy
1
2
2
Greedy matching is also known as maximal matching and nongreedy
matching can be called minimal matching, because these options
match either the maximum or minimum number of characters possible.
Initially, all regular expressions were strictly greedy.
Therefore, you can't use this syntax with ereg( ) or ereg_replace(
). Greedy matching isn't supported by the older engine that powers these
functions; instead, you must use Perl-compatible functions.
Nongreedy
matching is frequently useful when trying to perform simplistic HTML parsing.
Let's say you want to find all text between bold tags. With greedy matching, you
get this:
$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
(
[0] => I am bold.</b> <i>I am italic.</i> <b>I am also bold.
)
Because there's a second set of bold tags, the pattern extends
past the first </b>, which makes it impossible to correctly break
up the HTML. If you use minimal matching, each set of tags is self-contained:
$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+?)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
(
[0] => I am bold.
[1] => I am also bold.
)
Of course, this can break down if your markup isn't 100% valid,
and there are stray bold tags lying around.[2] If your goal is just to remove all (or
some) HTML tags from a block of text, you're better off not using a regular
expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See Section 11.12 for more details.
[2] It's possible to have valid HTML and still get into trouble. For instance, if you have bold tags inside a comment. A true HTML parser ignores this section, but our pattern won't.
Finally, even
though the idea of nongreedy matching comes from Perl, the -U modifier
is incompatible with Perl and is unique to PHP's Perl-compatible regular
expressions. It inverts all quantifiers, turning them from greedy to nongreedy
and also the reverse. So, to get a greedy quantifier inside of a pattern
operating under a trailing /U, just add a ? to the end, the
same way you would normally turn a greedy quantifier into a nongreedy one.