log 日志格式如下
113.221.56.131 - [05/Feb/2015:18:31:19 +0800] " ab.baidu.com GET /media/game/a_.jpg HTTP/1.1" 200 169334 "http://laoma123.ybgj01.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQWubi 133)" " 113.120.80.216, 113.21.213.35 - [05/Feb/2015:18:33:22 +0800] " ab.baidu.net GET /media/game/a_.jpg HTTP/1.1" 200 169334 "http://a155622.ybgj7.net/" "Mozilla/5.0 (Linux; U; Android 4.1.2; zh-cn; GT-P3100 Build/JZO54K) AppleWebKit/533.1 (KHTML, like Gecko)Version/4.0 MQQBrowser/5.3 Mobile Safari/533.1 V1_AND_SQ_5.0.0_146_YYB2_D QQ/5.0.0.2215" "
本想着shell脚本使用
awk ‘{arr[$8]+=$11}END {for (i in arr) print i "\t" arr[i]}‘ access.log
但是匹配起来不大方便,也为了锻炼下python技巧。所以写了个python小脚本。请轻喷。
#!/usr/bin/env python #coding=utf-8 #查看nginx日志的获取日志url目录,并计算获取的链接大小 import os,re,sys,datetime reload(sys) sys.setdefaultencoding(‘utf-8‘) lastday = datetime.date.today() - datetime.timedelta(days=1) yesterday = lastday.strftime(‘%Y-%m-%d‘) nginx_log_path = "/usr/local/nginx/logs/access.log"+yesterday pattern_path = re.compile(r‘GET\s*(.*)\s*HTTP‘) pattern_size = re.compile(r‘HTTP/1.1"\s\?*\d{3}\s\?*(\d*)‘) def path_size(log_path): dic = {} f= file(log_path) for line in f: m_size = pattern_size.search(line) m_path = pattern_path.search(line) if m_path and m_size: size = int(m_size.group(1)) path = m_path.group(1) if path in dic:#如果之前有匹配,那么初始化size为之前的数值 size_init = int(dic[path]) else: size_init = 0 size = size + size_init dic[path] = size f.close() return dic def run(): pa_si = path_size(nginx_log_path) sor_l = sorted(pa_si.iteritems(),key = lambda x:x[1] ,reverse = True)#按照url文件的大小倒序 filename = ‘/tmp/nginx_log_check.log‘+ yesterday f = open(filename,‘a+‘) for k,v in sor_l: a = ‘%s\t\t\t%s‘% (k,v) print >>f,a if __name__ == ‘__main__‘: run()
本文出自 “往学习的路上前景” 博客,转载请与作者联系!
一个简易的python脚本统计nginx日志里的url及大小
原文地址:http://jonyisme.blog.51cto.com/3690784/1617985