Replaying an Apache Logfile with Python

NOTE: This is an old article I wrote in October 2008, it’s still relevant today. It was originally posted on luckydonkey.com which I am in the process of retiring.

In order to test for a memory leak in New Metal Army I needed to replay my apache log files against my test server. Using Twill this was a doddle.

The only slightly icky thing about the script is the regular expression to parse a line from the apache log file (in Combined Log Format). I got this from RegExp Buddy (pretty much the only reason I run Windows nowadays) but I am sure you can get similar expressions from other regular expression libraries.

Anyway, I’m just chucking this out there incase someone else finds it useful.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env python
#Script to replay a file called apache.log against a server.

import re
import twill

test_server="my.test.server.com"

reobj = re.compile(r'^(?P<client>\S+)\s+(?P<auth>\S+\s+\S+)\s+\[(?P<datetime>[^]]+)\]\s+"(?:GET|POST|HEAD) (?P<file>[^ ?"]+)\??(?P<parameters>[^ ?"]+)? HTTP/[0-9.]+"\s+(?P<status>[0-9]+)\s+(?P<size>[-0-9]+)\s+"(?P<referrer>[^"]*)"\s+"(?P<useragent>[^"]*)"$', re.MULTILINE)
browser = twill.get_browser()

def filter_url(url):
    return False


for line in open("apache.log"):
    match = reobj.search(line)
    if match is None:
        continue

    f = match.group("file")
    p = match.group("parameters")
    d = match.group("datetime")
    path = "?".join([f, p]) if p else f

    url = test_server+path

    if(filter_url(url)):
        continue

    try:
        print d, url
        browser.go(url)
    except ValueError:
        #this comes from twill parsing the HTML and it being malformed.
        #I don't really care about that, as long as I get the page.
        pass

Hope that helps someone.