How to summarize large size of access logs without using either Symantec reporter or Microsoft Excel.
search cancel

How to summarize large size of access logs without using either Symantec reporter or Microsoft Excel.

book

Article ID: 168164

calendar_today

Updated On:

Products

ProxySG Software - SGOS Cloud Secure Web Gateway - Cloud SWG

Issue/Introduction

The popular way to analyze accesslog is to use Microsoft Excel(MS Excel). However, most of access log files are too large and it cannot be open with MS Excel. We can use legacy Symantec reporter as well, but it's too old and Cloud Secure Web Gateway(Cloud SWG, formally WSS) users don't have it. How can we summarize such accsess log files without using either MS Excel or reporter?

Resolution

Linux command "awk" or "gawk" is useful for large size of acceslog.

#Pr-requiement:

  • OS

    Linux
    or
    WSL(Windows subsystem for Linux)
    or
    Cygwin linux on Windows

  • tool or command

    Python3
    nfk (if multi bytes code is in the accesslog)
    awk or gawk
    sort
    head
    unique

#Preparation. Below are commands which are slightly different under the accesslog types.

- If the accesslog is not formally CSV and it contains multi bytes characters, it will be converted to utf8 code first. And then converting the file to csv file. For example, Edge SWG outputs accesslog as space separated format and it may contain multi bytes characters as Japanese Kanji.

$nkf -w SG_main__xxxx-.log > SG_main__xxxx-utf8.log
$python3 convert2csv.py SG_main__xxxx-utf8.log SG_main__xxxx-utf8.csv

(SG_main__xxxx.log is original log file)
convert2csv.py is downloadable from this article

- If the accesslog is not formally CSV and it does not contain multi bytes characters, it will simply be converted to CSV file. For example, Edge SWG outputs accesslog as space separated format without multi bytes characters.

Just run

$python3 convert2csv.py SG_main__xxxx.log SG_main__xxxx.csv


- If the accesslog is formally CSV and it contains multi bytes characters, only converting the file to utf8 code. For example, Cloud SWG outputs accesslog as CSV file and it may contain multi bytes characters as Japanese Kanji.

$nkf -w logdownload-xxxx-2025-07-02Txxxxx. csv > logdownload-xxxx-2025-07-02Txxxxx-utf-8. csv


- If the accesslog is formally CSV and it does not contain multi bytes characters, nothing to do.

 

#How to summarize

-The file is too large, want to separate the file by every 30min between 8:00 and 10:00.
$ gawk -F "," '$2~"^\"08:[0-2]" {print $0}' accesslog.csv > 0800-0830.csv
$ gawk -F "," '$2~"^\"08:[3-5]" {print $0}' accesslog.csv > 0830-0900.csv
$ gawk -F "," '$2~"^\"09:[0-2]" {print $0}' accesslog.csv > 0900-0930.csv
$ gawk -F "," '$2~"^\"09:[3-5]" {print $0}' accesslog.csv > 0930-1000.csv

*Note: $2 is the position of the fields of the "time". Symantec Edge swg's accesslog refers $2 field is typically #2 and it's "time". For cloud swg, "time" filed is different position. It typically at around #83.
...rs-icap-service,cs-auth-groups,time,x-client-agent-ip,..

^\" is stands for the strings is beginning of each field + next character is ", such as "08:01:04". If the csv file doesn't have " at the beginning of each field, it doesn't need \", but it needs '^'.

-See number of accesses going up in the morning.

$ gawk -F "," '$2~"^\"07:[4][0-9]:" {print $0}' accesslog.csv | wc
 390335 4762075 188568263
$ gawk -F "," '$2~"^\"07:[5][0-9]:" {print $0}' accesslog.csv | wc
 578796 6902066 278649103
$ gawk -F "," '$2~"^\"08:[0][0-9]:" {print $0}' accesslog.csv | wc
 580221 6989392 279943462
$ gawk -F "," '$2~"^\"08:[1][0-9]:" {print $0}' accesslog.csv | wc
 550523 6678404 266026841
r$ gawk -F "," '$2~"^\"08:[2][0-9]:" {print $0}' accesslog.csv | wc
 540851 6541162 261165697
$ gawk -F "," '$2~"^\"08:[3][0-9]:" {print $0}' accesslog.csv | wc
 483593 5845913 233940979
$ gawk -F "," '$2~"^\"08:[4][0-9]:" {print $0}' accesslog.csv | wc
 469658 5745813 227613990
$ gawk -F "," '$2~"^\"08:[5][0-9]:" {print $0}' accesslog.csv | wc
 465159 5649098 225398550

This shows a peek access is between 7:50 and 8:30. 

- See top 10 accessed URLs.
$ sudo gawk -F "," '{print $21$23$25}' accesslog.csv | sort | uniq -c | sort -rn | head -10
  95715 xxxxx
  87560 xxxxx     <-- there are dummy URLs.
  70042 xxxxx
  .....

*Note: $21 is for host, $23 is for "uri path" and $25 is for "uri path extension" in accesslog of Edge swg.

- See top 10 client ip's.
$ sudo gawk -F "," '{print $38}' accesslog.csv | sort | uniq -c | sort -rn | head -10
 161174 xxx.yy.zz.tt
 114061 xxx.yy.zz.tt   <-- there are dummy IP addresses
  70124 xxx.yy.zz.tt

*Note: $38 is for client IP in accesslog of Edge swg.


- See top 10 categories of URL filter.
$ sudo gawk -F "," '{print $13}' accesslog.csv | sort | uniq -c | sort -rn | head -10
 801295 "Search Engines/Portals"
 545589 "Web Ads/Analytics"
 378543 "Technology/Internet"
 358854 "Content Servers"
 223359 "Social Networking"
 168370 "Business/Economy"
 151033 "Reference"
 139254 "HTTP_White_URL;Web Ads/Analytics"
 132283 "none"
 118138 "Shopping"

*Note: $13 is for Category Name in accesslog of Edge swg.


 

Attachments

convert2csv.py get_app