12.5.1 Problem
You want to parse
an XML document and format it on an event basis, such as when the parser
encounters a new opening or closing element tag. For instance, you want to turn
an RSS feed into HTML.
12.5.2 Solution
$xml = xml_parser_create();
$obj = new Parser_Object; // a class to assist with parsing
xml_set_object($xml,$obj);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);
$fp = fopen('data.xml', 'r') or die("Can't read XML data.");
while ($data = fread($fp, 4096)) {
xml_parse($xml, $data, feof($fp)) or die("Can't parse XML data");
}
fclose($fp);
xml_parser_free($xml);
12.5.3 Discussion
These XML parsing functions require the expat library. However, because Apache 1.3.7 and later is
bundled with expat, this library is already installed on most machines.
Therefore, PHP enables these functions by default, and you don't need to
explicitly configure PHP to support XML.
expat parses XML documents and allows you to configure
the parser to call functions when it encounters different parts of the file,
such as an opening or closing element tag or character data (the text between
tags). Based on the tag name, you can then choose whether to format or ignore
the data. This is known as event-based parsing
and contrasts with DOM XML, which use a tree-based parser.
A popular API for event-based XML parsing is SAX: Simple API
for XML. Originally developed only for Java, SAX has
spread to other languages. PHP's XML functions follow SAX conventions. For more
on the latest version of SAX — SAX2 — see SAX2 by
David Brownell (O'Reilly).
PHP supports two interfaces to expat: a procedural one
and an object-oriented one. Since the procedural interface practically forces
you to use global variables to accomplish any meaningful task, we prefer the
object-oriented version. With the object-oriented interface, you can bind an
object to the parser and interact with the object while processing XML. This
allows you to use object properties instead of global variables.
Here's an example application of expat that shows how to process an RSS feed and transform it into HTML. For more on RSS, see Section 12.12. The script starts with the standard XML processing code, followed by
the objects created to parse RSS specifically:
$xml = xml_parser_create( );
$rss = new pc_RSS_parser;
xml_set_object($xml, $rss);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);
$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}
fclose($fp);
xml_parser_free($xml);
After creating a new XML parser and an instance of the
pc_RSS_parser class, configure the parser.
First, bind the object to the parser; this tells the parser to call the object's
methods instead of global functions. Then call xml_set_element_handler(
) and
xml_set_character_data_handler( ) to specify the method names the
parser should call when it encounters elements and character data. The first
argument to both functions is the parser instance; the other arguments are the
function names. With xml_set_element_handler( ), the middle and last
arguments are the functions to call when a tag opens and closes, respectively.
The xml_set_character_data_handler( ) function takes only one
additional argument — the function to call when it processes character data.
Because an object has been associated with our parser, when
that parser finds the string <tag>data</tag>, it calls
$rss->start_element( ) when it reaches
<tag>; $rss->character_data( ) when it reaches data; and
$rss->end_element( ) when it reaches
</tag>. The parser can't be configured to automatically call
individual methods for each specific tag; instead, you must handle this
yourself. However, the PEAR
package XML_Transform provides an easy way to assign handlers on a
tag-by-by basis.
The last XML parser configuration
option tells the parser not to automatically convert all tags to uppercase. By
default, the parser folds tags into capital letters, so <tag> and
<TAG> both become the same element. Since XML is case-sensitive,
and most feeds use lowercase element names, this feature should be disabled.
With the parser configured, feed the data to the parser:
$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}
fclose($fp);
In order to curb memory usage, load the file in 4096-byte
chunks, and feed each piece to the parser one at a time. This requires you to
write the handler functions that will accommodate text arriving in multiple
calls and not assume the entire string comes in all at once.
Last, while PHP cleans up any open parsers when the request
ends, you can also manually close the parser by calling xml_parser_free(
) .
Now that the generic parsing is properly set up, add the
pc_RSS_item and
pc_RSS_parser classes, as shown in Examples Example 12-1 and Example 12-2, to handle a RSS document.
Example 12-1. pc_RSS_item
class pc_RSS_item {
var $title = '';
var $description = '';
var $link = '';
function display() {
printf('<p><a href="%s">%s</a><br />%s</p>',
$this->link,htmlspecialchars($this->title),
htmlspecialchars($this->description));
}
}
Example 12-2. pc_RSS_parser
class pc_RSS_parser {
var $tag;
var $item;
function start_element($parser, $tag, $attributes) {
if ('item' == $tag) {
$this->item = new pc_RSS_item;
} elseif (!empty($this->item)) {
$this->tag = $tag;
}
}
function end_element($parser, $tag) {
if ('item' == $tag) {
$this->item->display();
unset($this->item);
}
}
function character_data($parser, $data) {
if (!empty($this->item)) {
if (isset($this->item->{$this->tag})) {
$this->item->{$this->tag} .= trim($data);
}
}
}
}
The pc_RSS_item class provides an interface to an
individual feed item. This removes the details of displaying each item from the
general parsing code and makes it easy to reset the data for a new item by
calling unset( ).
The pc_RSS_item::display( )
method prints out an HTML-formatted RSS item. It calls htmlspecialchars(
) to reencode any necessary entities,
because expat decodes them into regular characters while parsing the
document. This reencoding, however, breaks on feeds that place HTML in the title
and description instead of plaintext.
Within pc_RSS_parser( ), the start_element(
) method takes three parameters: the XML parser,
the name of the tag, and an array of attribute/value pairs (if any) from the
element. PHP automatically supplies these values to the handler as part of the
parsing process.
The start_element( ) method checks the value of
$tag. If it's item, the parser's found a new RSS item, and a
new pc_RSS_item object is instantiated. Otherwise, it checks to see if
$this->item is empty( ); if it isn't, the parser is inside
an item element. It's then necessary to record the tag's name, so that
the character_data( ) method knows which
property to assign its value to. If it is empty, this part of the RSS feed isn't
necessary for our application, and it's ignored.
When the parser finds a closing item tag, the
corresponding end_element( ) method first
prints the RSS item, then cleans up by deleting the object.
Finally, the character_data( ) method is responsible
for assigning the values of title, description, and
link to the RSS item. After making sure it's inside an item
element, it checks that the current tag is one of the properties of
pc_RSS_item. Without this check, if the parser encountered an element
other than those three, its value would also be assigned to the object. The
{ } s are needed to set the object property dereferencing
order. Notice how trim($data) is appended to the property instead of a
direct assignment. This is done to handle cases in which the character data is
split across the 4096-byte chunks retrieved by fread( ); it also
removes the surrounding whitespace found in the RSS feed.
If you run the code on this sample RSS feed:
<?xml version="1.0"?>
<rss version="0.93">
<channel>
<title>PHP Announcements</title>
<link>http://www.php.net/</link>
<description>All the latest information on PHP.</description>
<item>
<title>PHP 5.0 Released!</title>
<link>http://www.php.net/downloads.php</link>
<description>The newest version of PHP is now available.</description>
</item>
</channel>
</rss>
It produces this HTML:
<p><a href="http://www.php.net/downloads.php">PHP 5.0 Released!</a><br /> The newest version of PHP is now available.</p>