Scraping Apache Logs with Shell Commands

Recently, I needed to extract some download statistics from apache log files. There were some files I needed download trends for and could use keywords to target them. I looked at some other analysis tools out there like AWStats. While that was excellent, I decided to look for some simpler way to grab the data I needed. Fortunately, the log files were organized per day. For example, one file is called "20101215.log". The lexicographical ordering that this date format provides makes gathering daily summaries very easy. My goal was to generate a tabular text file that I could import into Excel. I wanted to create some graphs quickly and this is a tool I'm familiar with. (my LaTeX skills are very rusty!) To start with, I could grep for keywords and use "wc" to count the number of results.

for f in `ls -1 20*.log`
do
  grep "euca-centos-5.3" $f |wc -l
done

This gives me a list of numbers representing how many times the file with that string in the name (euca-centos-5.3) was accessed per day. I can make this more useful by including the date on each row like this;

...
do
  basename $f .log |tr -d "
"
  echo -n " "
...

This code uses the filename, which encodes the date, to provide the date string via "basename" and using "tr" to trim the carriage return. Notice the quote on that line, and on the next.

Now, I get output that looks like "20090513 11". The next step is to add in more fields for other files. To do that, I added another loop to iterate through a set of "categories" I wanted counts for. Here's an example of this code.

categories="euca-centos-5.3 euca-debian-5.0 euca-fedora-10 euca-fedora-11 euca-ubuntu-9.04"
for i in $categories
do
  grep $i $f |wc -l |tr -d "
"
done

There is something else about the access logs that I learned after looking at the data. Just counting references to a file can give oddly inflated results. This is because some browsers will download a file in many parts. You'll see 206 status codes for partial content. So, where a client may have downloaded the file once, you'll see (say) 50 log entries. My first attempt to filter these simply removed the 206 lines from consideration. This fails because it doesn't count that download at all. After looking around, I found information about using "cut" to get fields from the logs. Since the logs start each line with the requester IP address, I can search for the search term, cut the IP address, then sort by IP and count unique results. That removes the multiple hits from a single client. The down-side is that if there were multiple legitimate downloads from one client in a given day, those would be missed. I felt that was a smaller problem. Here is the complete script I ran;

#!/bin/bash
categories="euca-centos-5.3 euca-debian-5.0 euca-fedora-10 euca-fedora-11 euca-ubuntu-9.04 i386 x86_64"
echo "date" $categories
for f in `ls -1 20*.log`
do
  basename $f .log |tr -d " "
  echo -n " "
  for i in $categories
  do
    grep $i $f |cut -d " " -f1 |sort -u |wc -l |tr -d " "
  done
  echo ""
done

The first echo puts column headers into the text output. Each iteration through the inner loop adds a count for a category, then the last echo inserts a newline. This can be imported into excel very easily. Make sure to indicate that space delimits the data (versus comma). Also make sure to indicate the date format for that first column as yyyymmdd.

Coders Like Us

Things I've learned and stuff I'm thinking about.

Scraping Apache Logs with Shell Commands

One thought on “Scraping Apache Logs with Shell Commands”

Leave a comment Cancel reply

Share this:

Related

One thought on “Scraping Apache Logs with Shell Commands”

Leave a comment Cancel reply