Most of the time, PHP is part
of a web server, sending content to browsers. Even when you run it from the command
line, it usually performs a task and then prints some output. PHP can also be
useful, however, playing the role of a web browser — retrieving URLs and then
operating on the content. Most recipes in this chapter cover retrieving URLs and
processing the results, although there are a few other tasks in here as well,
such as using templates and processing server logs.
There are four ways to retrieve a remote URL in PHP. Choosing
one method over another depends on your needs for simplicity, control, and
portability. The four methods are to use fopen( ) , fsockopen( ), the cURL
extension, or the HTTP_Request class from PEAR.
Using fopen( ) is simple and convenient. We discuss it
in Section 11.2. The fopen( ) function automatically follows redirects, so if
you use this function to retrieve the directory http://www.example.com/people and the server redirects you to
http://www.example.com/people/, you'll get the contents of the
directory index page, not a message telling you that the URL has moved. The
fopen( ) function also works with both HTTP and FTP. The downsides to
fopen( ) include: it can handle only HTTP GET requests (not HEAD or
POST), you can't send additional headers or any cookies with the request, and
you can retrieve only the response body with it, not response headers.
Using fsockopen( ) requires
more work but gives you more flexibility. We use fsockopen( ) in Section 11.3. After opening a socket with fsockopen( ), you need to print
the appropriate HTTP request to that socket and then read and parse the
response. This lets you add headers to the request and gives you access to all
the response headers. However, you need to have additional code to properly
parse the response and take any appropriate action, such as following a
redirect.
If you have access to the cURL extension or PEAR's
HTTP_Request class, you should use those rather than fsockopen(
). cURL supports a number of different protocols (including HTTPS,
discussed in Section 11.6) and gives you access to response headers. We use cURL in most of the
recipes in this chapter. To use cURL, you must have the cURL library installed,
available at http://curl.haxx.se. Also, PHP must be built with the
--with-curl configuration option.
PEAR's HTTP_Request class, which we use in Section 11.3, Section 11.4, and Section 11.5, doesn't support HTTPS, but does give you access to headers and can use
any HTTP method. If this PEAR module isn't installed on your system, you can
download it from http://pear.php.net/get/HTTP_Request. As long as the module's
files are in your include_path, you can use it, making it a very
portable solution.
Section 11.7 helps you go behind the scenes of an HTTP request to examine the
headers in a request and response. If a request you're making from a program
isn't giving you the results you're looking for, examining the headers often
provides clues as to what's wrong.
Once you've retrieved the contents of a web page into a
program, use Section 11.8 through Section
11.12 to help you manipulate those page contents. Section 11.8 demonstrates how to mark up certain words in a page with blocks of
color. This technique is useful for highlighting search terms, for example. Section 11.9 provides a function to find all the links in a page. This is an
essential building block for a web spider or a link checker. Converting between
plain ASCII and HTML is covered in Section 11.10 and Section 11.11. Section 11.12 shows how to remove all HTML and PHP tags from a web page.
Another kind of page manipulation is using a templating system.
Discussed in Section 11.13, templates give you freedom to change the look and feel of your web
pages without changing the PHP plumbing that populates the pages with dynamic
data. Similarly, you can make changes to the code that drives the pages without
affecting the look and feel. Section 11.14 discusses a common server administration task — parsing your web
server's access log files.
Two sample programs use the link extractor from Section 11.9. The program in Section 11.15 scans the links in a page and reports which are still valid, which
have been moved, and which no longer work. The program in Section 11.16 reports on the freshness of links. It tells you when a linked-to page
was last modified and if it's been moved.