11.14.1 Problem
11.14.2 Solution
Open the file and parse each line with a regular expression
that matches the log file format. This regular expression matches the NCSA
Combined Log Format:
$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
([0-9\-]+) "(.*)" "(.*)"$/';
11.14.3 Discussion
This program parses the NCSA Combined
Log Format lines and displays a list of pages sorted by the number of requests
for each page:
$log_file = '/usr/local/apache/logs/access.log';
$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
([0-9\-]+) "(.*)" "(.*)"$/';
$fh = fopen($log_file,'r') or die($php_errormsg);
$i = 1;
$requests = array();
while (! feof($fh)) {
// read each line and trim off leading/trailing whitespace
if ($s = trim(fgets($fh,16384))) {
// match the line to the pattern
if (preg_match($pattern,$s,$matches)) {
/* put each part of the match in an appropriately-named
* variable */
list($whole_match,$remote_host,$logname,$user,$time,
$method,$request,$protocol,$status,$bytes,$referer,
$user_agent) = $matches;
// keep track of the count of each request
$requests[$request]++;
} else {
// complain if the line didn't match the pattern
error_log("Can't parse line $i: $s");
}
}
$i++;
}
fclose($fh) or die($php_errormsg);
// sort the array (in reverse) by number of requests
arsort($requests);
// print formatted results
foreach ($requests as $request => $accesses) {
printf("%6d %s\n",$accesses,$request);
}
The pattern used in preg_match( ) matches Combined Log
Format lines such as:
10.1.1.162 - david [20/Jul/2001:13:05:02 -0400] "GET /sklar.css HTTP/1.0" 200 278 "-" "Mozilla/4.77 [en] (WinNT; U)" 10.1.1.248 - - [14/Mar/2002:13:31:37 -0500] "GET /php-cookbook/colors.html HTTP/1.1" 200 460 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"
In the first line, 10.1.1.162 is the IP address that
the request came from. Depending on the server configuration, this could be a
hostname instead. When the $matches array is assigned to the list of
separate variables, the hostname is stored in $remote_host. The next
hyphen (-) means that the remote host didn't supply a username via identd,[1] so $logname is set to -.
[1] identd, defined in RFC 1413, is supposed to be a good way to identify users remotely. However, it's not very secure or reliable. A good explanation of why is at http://www.clock.org/~fair/opinion/identd.html.
The string david is a username provided by the browser
using HTTP Basic Authentication and is put in $user. The date and time of the request, stored in $time, is
in brackets. This date and time format isn't understood
by strtotime( ), so if you wanted to do calculations based on request
date and time, you have to do some further processing to extract each piece of
the formatted time string. Next, in quotes, is the first line of the request.
This is composed of the method (GET, POST, HEAD, etc.) which is stored in
$method; the requested URI, which is stored in $request, and
the protocol, which is stored in $protocol. For GET requests, the query
string is part of the URI. For POST requests, the request body that contains the
variables isn't logged.
After the request comes the request status, stored in
$status. Status 200 means the request was successful. After
the status is the size in bytes of the response, stored in $bytes. The
last two elements of the line, each in quotes, are the referring page if any,
stored in $referer[2] and the user agent string identifying the browser that
made the request, stored in $user_agent.
[2] The correct way to spell this word is "referrer." However, since the original HTTP specification (RFC 1945) misspelled it as "referer," the three-R spelling is frequently used in context.
Once the log file line has been parsed into distinct variables,
you can do the needed calculations. In this case, just keep a counter in the
$requests array of how many times each URI is requested. After looping
through all lines in the file, print out a sorted, formatted list of requests
and counts.
Calculating statistics this way from web server access logs is
easy, but it's not very flexible. The program needs to be modified for different
kinds of reports, restricted date ranges, report formatting, and many other
features. A better solution for comprehensive web site statistics is to use a
program such as analog,
available for free at http://www.analog.cx. It has many types of reports and
configuration options that should satisfy just about every need you may have.