How to summarize raw access logs from ProxySG with Linux Commands

book

Article ID: 168164

calendar_today

Updated On:

Products

ProxySG Software - SGOS

Issue/Introduction

How can you summarize ProxySG log files when a Blue Coat Reporter is not available?


 

Resolution

You can use the awk command on a Linux machine or Cygwin on a Microsoft Windows machine.

#Preparation1
If the work machine is a Linux, gawk must be installed.
If it is a Windows machine, Cygwin software should be installed. https://www.cygwin.com/
Instead of Cygwin, GNU Awk for Windows could be used, but another Linux general command(sort, uniq, etc..) will be used during the work, so recommendation is Cygwin.

#Preparation2
Download attached simple-csv.awk file. This awk script file is a converter from original SG's .log file to TAB separated CSV file.
Original SG's .log file is separated by " "(space) as CSV separator, but this is not a quite good separator, because CSV file generally has some space characters in its own field.

It had better convert from original .log files to the CSV file with different separator for gawk so that the gawk would not take this spaces for separator,

gawk -f simple-csv.awk 730515131405.log > 730515131405.csv
*For example, 3GB of 
730515131405.log file takes 20 min to convert to *.csv file.
 

Workaround

Once *.csv file is created, it will be able to easily get summary of Accesslog.

- see top 10 accessed URLs.

$ sudo gawk -F '\t' '{print $16$19}' 730515131405.csv | sort | uniq -c | sort -rn | head -10
  95715 www.google-analytics.com/__utm.gif
  87560 m.map.c.yimg.jp/m
  70042 www.google.co.jp/
  69513 www.google.com/
  64718 ads.yahoo.com/imp
  60630 fileserver.glam.com/glamjp/widgets/img/none.gif
  59439 pagead2.googlesyndication.com/pagead/show_ads.js
  49512 yads.yahoo.co.jp/tag
  46737 platform.twitter.com/widgets/tweet_button.1400006231.html
  42494 cas.criteo.com/delivery/ajs.php

- see top 10 client ip's.
$ sudo gawk -F '\t' '{print $3}' 730515131405.csv | sort | uniq -c | sort -rn | head -10
 161174 xxx.yy.20.78
 114061 xxx.yy.48.144
  70124 xxx.yy.174.78
  68975 xxx.yy.166.61
  50255 xxx.yy.166.81
  41553 xxx.yy.161.168
  38073 xxx.yy.18.14
  30303 xxx.yy.149.45
  26758 xxx.yy.173.44
  25343 xxx.yy.161.203

- see top 10 categories of URL filter.
$ sudo gawk -F '\t' '{print $8}' 730515131405.csv | sort | uniq -c | sort -rn | head -10
 801295 "Search Engines/Portals"
 545589 "Web Ads/Analytics"
 378543 "Technology/Internet"
 358854 "Content Servers"
 223359 "Social Networking"
 168370 "Business/Economy"
 151033 "Reference"
 139254 "HTTP_White_URL;Web Ads/Analytics"
 132283 "none"
 118138 "Shopping"

 

Attachments

simple-csv.awk get_app