7.2. Access logs

To date the only log file we have met is the error log. There is no logging in our current server configuration when things aren't going wrong. Probably we want to log the requests that are made of our server. These are the access logs.

We need to decide what we want to log and where to log it to. We may want more than one log file for different sets of data.

As (almost) ever, the means to get this functionality is to load a module: log_config_module from mod_log_config.so.

This provides us with one particularly useful command: CustomLog. This allows us to specify what information to record and where to record it for each query/response at the server. This power comes at the price of almost complete syntactic obscurity at first glance. But in all honesty it's not that bad.

Suppose we wanted to record just the following information about each query processed by the server:


LoadModule      log_config_module       modules/mod_log_config.so
CustomLog       logs/request.log        "%t %U %h %s"

Each of the elements beginning with a percentage character is called an escape code and is converted into some piece of logged information. A complete list of the codes is given in Appendix B.

Four escape sequences

%t

The time of the request

%U

The URL requested

%h

The client hostname

%s

Status code of the request

To illustrate what they indicate and what they don't, we will request three URLs and note a number of problems in the logged output.

The requested URLs


[22/Feb/2007:15:51:43 +0000] /index.html 131.111.10.33 304
[22/Feb/2007:15:52:04 +0000] /index.html 131.111.10.33 304
[22/Feb/2007:15:52:19 +0000] /gamma.html 131.111.10.33 304

Notice how requests for the top-level directory are being logged as requests for /index.html. It appears that apache applys at least some of it's internal processing to the value it loggs with %U, in this case using dir_module to locate a suitable index document.

Problems with the output as it stands

The simplest way to address the issue of which website was queried is to move the CustomLog lines into the virtual host sections and to have two different log files. This gives them the flexibility to log different things too.

<VirtualHost *>
ServerName	www.dept.cam.ac.uk
DocumentRoot	/srv/www/WWW
CustomLog	/var/log/apache2/www.log "%t %U %h %s"
</VirtualHost>

If we really wanted a single log file with the virtual host information we could use the %v escape code to record it.

To enable the use of hostnames rather than addresses, we must arrange for the web server to do DNS lookups for the IP addresses on each incoming query. We will do this with the HostnameLookups command. This command is a core Apache command and not part of the logging module. In some circumstances, hostnames will be looked up even without this command. For example if you do any access controls based on host names as we will be in Section 10.2. We will set this on globally. If either website wanted to record IP addresses rather than hostnames then it can do so by using %s rather than %h.

HostnameLookups On

This results in log entries that look like this:


[22/Feb/2007:15:59:06 +0000] /index.html mnementh.csi.cam.ac.uk 200
[22/Feb/2007:15:58:40 +0000] /alpha.html mnementh.csi.cam.ac.uk 200
[22/Feb/2007:15:58:47 +0000] /beta.html mnementh.csi.cam.ac.uk 200

Common Log Format. A standard format for access log files is used by many utilities and replied on by the web analysis programs. It is called "The Common Log Format" (CLF) and is shown below.


CustomLog /var/log/apache2/www.log "%h %l %u %t \"%r\" %>s %b"

This produces log entries that look like this:


mnementh.csi.cam.ac.uk - - [22/Feb/2007:16:01:48 +0000] "GET / HTTP/1.1" 
200 306
mnementh.csi.cam.ac.uk - - [22/Feb/2007:16:01:53 +0000] "GET /nonsuch.html 
HTTP/1.1" 404 210

The escape sequences used in the Common Log Format

%h

The client's hostname

%l

If the web server was doing IDENT (RFC1413) lookups then the returned userid would be here.

%u

If the client had authenticated as a particular user for this request the userid would be recorded here. We discuss authentication in detail in Section 10.3.

%t

The time of the request.

%r

The first line of the query.

%>s

The status code finally returned to the client. There is a subtle difference between %s and %>s. In most cases they are identical. In cases where the URL gets remapped then %s gives the status code of the initial lookup and %>s the code of the final lookup (and the code passed back to the client).

%b

The number of data bytes (i.e. excluding the headers) sent back to the client in the case of successful completion.

Named formats. A common requirement is for all virtual hosts to log in the same format. To assist with this it is possible to name a format definition and to then refer to the format's name in the CustomLog line.


LogFormat     "%h %l %u %t \"%r\" %>s %b"  clf

<VirtualHost *>
ServerName	www.dept.cam.ac.uk
DocumentRoot	/srv/www/WWW
CustomLog	/var/log/apache2/www.log clf
</VirtualHost>

Logging headers. One very useful escape code is %{fubar}i which will log the value of incoming header fubar. We could use this as %{Host}i to record the queried Host header, for example, to check our virtual hosting was working as expected.

The values of incoming headers are used to define a common alternative to "Common Log Format". This is called "Combined Log Format" since it combines the Access Log with the Referer and User Agent logs that were maintained in separate files by early versions of Apache.


LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined 

Additional escape sequences used in the Combined Log Format

%{Referer}i

The value of the 'Referer:' header which most browsers include in requests. This contains the URL of the page that contain the link the user queried to bring them here. This can be particularly useful to find remote pages that have bad links onto your site.

%{User-Agent}i

The value of the 'User-Agent:' header which most browsers include in requests. If present, this normally identifies the browser being used, though it can be trivially changed on most browsers.