一次用bash+python分析NGINX日志的记录

时间：2015-12-15 19:33:20 阅读：190 评论：0 收藏：0 [点我收藏+]

标签：python awk grep linux bash

目标：

按文章类型分别统计文章PV，并按PV倒序显示

分析：

从NGINX日志按URL特征可以取出所有文章页URL，并从URL中得到ID

拿ID到数据库中可查询出文章所属类型type

具体操作：

从日志中取出所有详情页URL特征片段，并排重统计每个文章的访问量，将结果另存为m1214.cnt

cat access.log | grep -o "GET http://www.***.com/content.* HTTP" | grep -Po "\d(/.*\.html)" | sort | uniq -c > m1214.cnt

对m1214.cnt每行前面的空格删除，并将每列之前用tab分隔

第一列：访问量

第二列：URL特征片段

cat m1214.cnt | awk ‘{print $1"\t"$2}‘ > m1214_1.cnt

按第一列倒序排列

cat m1214_1.cnt | sort -rn -k 1 > m1214.sort

使用python读取m1214.sort，将第二列特征片段转为文章ID，将到数据库查询type

#coding=utf-8
import os,sys
import MySQLdb
import re
import math


def c_decode(str):
	#URL片段解码为数据库ID，省略。
	return int(id)



if __name__ == "__main__":
	try:
		conn=MySQLdb.connect(‘***‘, ‘***‘, ‘***‘, ‘***‘)
		cur=conn.cursor()
		cur.execute("set names utf8")
	except MySQLdb.Error,e:
		print("Mysql Error")
		sys.exit()
	else:
		for line in open("./m1214.sort"):
			list = line.split("\t")
			pattern = re.compile(r‘\d\/(\d*\w*[0-9])\.html‘)
			match = pattern.match(list[1])
			if match:
				id = str(c_decode(match.group(1)))
				sql="SELECT id, title FROM content WHERE id = ‘"+id+"‘ AND type = 11";
				cur.execute(sql)
				result = cur.fetchone()
				if result:
					fpd = open(‘./type_11.sort‘, ‘a‘)
					fpd.write(list[0]+"\t"+"http://www.***.com/content/id/"+id+"\t"+result[1]+"\r\n")
					fpd.close()

结束：

方法比较笨拙，但用到了很多linux常用的命令和python编程，特此记录

本文出自 “小兵yuri” 博客，请务必保留此出处http://87453343.blog.51cto.com/8606892/1723312

一次用bash+python分析NGINX日志的记录

标签：python awk grep linux bash

原文地址：http://87453343.blog.51cto.com/8606892/1723312

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行