Article summary
Access logs from AWS CloudFront distributions and AWS Elastic Load Balancers can be essential to diagnosing problems with an AWS infrastructure. AWS provides the ability to store these logs in AWS S3 buckets.
However, the log files are often in very many small files which need to be combined in order to get a full picture of the traffic that they represent. In order to make this process easier, I wrote a few scripts which help me to quickly download and format the logs in a usable format for review.
Dependenices
Both of my approaches rely on the AWS CLI, which must be configured to authenticate as the appropriate IAM user that has access to the target S3 bucket. One of the approaches relies relies on python
or ruby
(something with sophisticated regex handling) to do some string substitution.
aws-cli, ~> 1.11.91
python, ~> 2.7.13
orruby, ~> 2.2.2
AWS CloudFront
Access logs from CloudFront distributions can be sent to a specific AWS S3 bucket as detailed in the AWS documentation.
There are different files for each date, hour, and specific edge server that handled the request. In my case, this amounted to over 140 files for a single day. Fortunately, all of the files use tab-separated fields, which are easy to work with on the command line and in other editors.
Steps
Pull down the files for today only:
aws s3 sync --exclude "*" --include "*$(date +%Y-%m-%d)*" "s3://cloudfront-log-bucket/" "./tmp"
Decompress the files:
gzip -d ./tmp/*.gz
Remove the first two lines from each file (version and field header information), and then sort everything:
tail -q -n+3 ./tmp/* | sort
Full solution
#!/bin/bash
mkdir ./tmp
aws s3 sync --exclude "*" --include "*$(date +%Y-%m-%d)*" s3://cloudfront-log-bucket ./tmp
gzip -d ./tmp/*.gz
tail -q -n+3 ./tmp/* | sort > "$(date +%Y-%m-%d)_cloudfront_logfile.tsv"
rm -rf ./tmp
Running this will create a TSV (tab-separated value) file for all of today’s CloudFront access log entries, sorted by date and time.
AWS Elastic Load Balancer
Access logs from Elastic Load Balancers can also be sent to a specific AWS S3 bucket, as detailed in the AWS documentation.
There are different files for each specific ELB endpoint, and for every five-minute interval. (So, at least 288 files per day per endpoint). Files are space-delimited (with quoted strings).
Steps
Pull down the files for today only:
aws s3 sync "s3://elb-log-bucket/AWSLogs/1accountnumber9/elasticloadbalancing/us-west-2/$(date +%Y)/$(date +%m)/$(date +%d)/" ./tmp/
Decompress the files:
gzip -d ./tmp/*.gz
Output all lines, and sort them:
tail -q ./tmp/* | sort
Split the lines on space delimeters (which don’t break up quote fields), and convert to tab delimeters (for easier handling):
Ruby
STDIN.readlines.each{ |line|
puts line.split(/\s(?=(?:[^"]|"[^"]*")*$)/).join("\t")
}
Python
import sys, re
[sys.stdout.write('\t'.join(re.split(r'\s(?=(?:[^\"]|\"[^\"]*\")*$)', line)) + '\n') for line in sys.stdin]
Full solution
#!/bin/bash
which ruby
ruby=$?
which python
python=$?
mkdir ./tmp
aws s3 sync "s3://elb-log-bucket/AWSLogs/1accountnumber9/elasticloadbalancing/us-west-2/$(date +%Y)/$(date +%m)/$(date +%d)/" ./tmp/
gzip -d ./tmp/*.gz
if [ ${ruby} = 0 ]
then tail -q ./tmp/* | sort | ruby -e 'STDIN.readlines.each{|line| puts line.split(/\s(?=(?:[^"]|"[^"]*")*$)/).join("\t")}' > "$(date +%Y-%m-%d)_elb_logfile.tsv"
elif [ ${python} = 0 ]
then tail -q ./tmp/* | sort | python -c "import sys, re; [sys.stdout.write('\t'.join(re.split(r'\s(?=(?:[^\"]|\"[^\"]*\")*$)', line)) + '\n') for line in sys.stdin]" > "$(date +%Y-%m-%d)_elb_logfile.tsv"
else
echo "Could not find Ruby or Python"
exit 1
fi
rm -rf ./tmp
Conclusion
The final product can be easily tweaked to consistently return a different day or accept a date argument. Once you’re armed with properly combined, sorted, and formatted log files, it is quite easy to track down specific events. The TSV format works well for command-line processing, text editors, and even spreadsheet tools like Microsoft Excel.