10.10.14

Parsing Apache log using OpenRefine

Recently I was looking for a quick way to explore some apache log file. I didn't want to set up any software and I wanted to analyze some very precise path for a specific user, or what happen after a specific error. So I thought about OpenRefine and its parsing capabilities.

The recipe doesn't replace an analytical tool to understand your traffic but help to go behind the curtain and drill down to analyze specific IP address or user, type of error code and patterns



First download your log file on your machine and create a new project. You will need to parse the file as Line-based text file.

Once we have our project load up, we need to parse the different information in their field. Apache log comes as one line per log. The Apache documentation will help us to find the different part of the log.

All the split are done using the option Edit column > Split into several columns ...
  1. Split the Column 1 on a dash (-) to separate the IP address with a max of two columns
  2. Split the Column 1 2 on square left bracket (]) to separate the username and the address
  3. Split the Column 1 2 1 on square right bracket ([) to separate the username and the timestamp
  4. Split the Column 1 2 2 on the dash (-) to generate all the six other columns
  5. Split the Column 1 2 2 3 on a space to separate the answer time from the object size
  6. Using the re-order column option removes all unnecessary columns
  7. Rename all the column one by one using the option Edit column > Rename this column
  8. Convert the timetamp field into a time format. In this case we manually replace each month name by its number before changing the format toDate() and defining the date format.

    value.replace('Jan','01').replace('Feb','02').replace('Mar','03').replace('Apr','04').replace('May','05').replace('Jun','06').replace('Jui','07').replace('Aug','08').replace('Sep','09').replace('Oct','10').replace('Nov','11').replace('Dec','12').toDate('dd/MM/yyyy:hh:mm:ss')
  9. You can further parse the User-Agent HTTP request header field to extract each information.

The steps 1 to 8 are summarized in the following recipe. You can copy the code provided below and reapply it to your project to gain time (under Undo/Redo > Apply)

Now you can use the filter text search function to explore your log. I will recommand the reading of this article: Use Refine to explore and profile your data (facet, filter, flag)