Example 11-6,
fresh-links.php,
is a
modification of the program in Section 11.15 that produces a list of links and their last
modified time. If the server on which a URL lives doesn't provide a last
modified time, the program reports the URL's last modified time as the time the
URL was requested. If the program can't retrieve the URL successfully, it prints
out the status code it got when it tried to retrieve the
URL. Run the program by passing it a URL to scan for links:
% fresh-links.php http://www.oreilly.com http://www.oreilly.com/index.html: Fri Aug 16 16:48:34 2002 http://www.oreillynet.com: Mon Aug 19 10:18:54 2002 http://conferences.oreilly.com: Fri Aug 16 19:41:46 2002 http://international.oreilly.com: Fri Mar 29 18:06:32 2002 http://safari.oreilly.com: 302 http://www.oreilly.com/catalog/search.html: Tue Apr 2 19:05:57 2002 http://www.oreilly.com/oreilly/press/: 302 ...
This output is from a run of the program at about 10:20 A.M.
EDT on August 19, 2002. The link to http://www.oreillynet.com is very fresh, but the others are of
varying ages. The link to http://www.oreilly.com/oreilly/press/ doesn't have a last
modified time next to it; it has instead, an HTTP status
code (302). This means it's been moved elsewhere, as reported by the output of
stale-links.php in Section 11.15.
The program to find fresh links is conceptually almost
identical to the program to find stale links. It uses the same
pc_link_extractor( ) function from Section 11.10; however, it uses the HTTP_Request
class instead of cURL to retrieve URLs. The code to get the base URL specified
on the command line is inside a loop so that it can follow any redirects that
are returned.
Once a page has been retrieved, the program uses the
pc_link_extractor( ) function to get a list of links in the page. Then,
after prepending a base URL to each link if necessary, sendRequest( ) is called on each link found in the original page. Since
we need just the headers of these responses, we use the HEAD method instead of GET. Instead
of printing out a new location for moved links, however, it prints out a
formatted version of the Last-Modified header if it's available.
Example 11-6. fresh-links.php
require 'HTTP/Request.php';
function pc_link_extractor($s) {
$a = array();
if (preg_match_all('/<A\s+.*?HREF=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/A>/i',
$s,$matches,PREG_SET_ORDER)) {
foreach($matches as $match) {
array_push($a,array($match[1],$match[2]));
}
}
return $a;
}
$url = $_SERVER['argv'][1];
// retrieve URLs in a loop to follow redirects
$done = 0;
while (! $done) {
$req = new HTTP_Request($url);
$req->sendRequest();
if ($response_code = $req->getResponseCode()) {
if ((intval($response_code/100) == 3) &&
($location = $req->getResponseHeader('Location'))) {
$url = $location;
} else {
$done = 1;
}
} else {
return false;
}
}
// compute base url from url
// this doesn't pay attention to a <base> tag in the page
$base_url = preg_replace('{^(.*/)([^/]*)$}','\\1',$req->_url->getURL());
// keep track of the links we visit so we don't visit each more than once
$seen_links = array();
if ($body = $req->getResponseBody()) {
$links = pc_link_extractor($body);
foreach ($links as $link) {
// skip https URLs
if (preg_match('{^https://}',$link[0])) {
continue;
}
// resolve relative links
if (! (preg_match('{^(http|mailto):}',$link[0]))) {
$link[0] = $base_url.$link[0];
}
// skip this link if we've seen it already
if ($seen_links[$link[0]]) {
continue;
}
// mark this link as seen
$seen_links[$link[0]] = true;
// print the link we're visiting
print $link[0].': ';
flush();
// visit the link
$req2 = new HTTP_Request($link[0],
array('method' => HTTP_REQUEST_METHOD_HEAD));
$now = time();
$req2->sendRequest();
$response_code = $req2->getResponseCode();
// if the retrieval is successful
if ($response_code == 200) {
// get the Last-Modified header
if ($lm = $req2->getResponseHeader('Last-Modified')) {
$lm_utc = strtotime($lm);
} else {
// or set Last-Modified to now
$lm_utc = $now;
}
print strftime('%c',$lm_utc);
} else {
// otherwise, print the response code
print $response_code;
}
print "\n";
}
}