码迷,mamicode.com
首页 > 其他好文 > 详细

爬取豆瓣中的战狼影评,保存在CSV

时间:2017-09-23 20:13:25      阅读:412      评论:0      收藏:0      [点我收藏+]

标签:www   thread   arch   默认   time   with   遇到   start   ros   

下面是源代码,第一把爬取的数据保存在CSV,保存的过程中遇到钟种坑,不过还好弄好了,写入csv是要特别注意数据流写入的编码格式,window下所有文件默认都是gbk编码的,所以如果你的网页数据
编码格式是utf-8的,那你就要注意了,在写入时加上encoding=‘utf-8‘,这次的代码写得好虐心,哎。。。。。。




# -*- coding: utf-8 -*-
import requests,os,re,csv,time
from lxml import etree
import logging
import threadpool
import url,log,random

class douban(object):

#def __init__(self,*args):
#self.args = args


def save_csv(self,a):
with open(‘name.csv‘,‘w‘,encoding=‘utf-8‘) as csvfile:
fieldnames = [‘name‘,‘text‘]
writer = csv.writer(csvfile)
writer.writerow(fieldnames)
for ii in a:
r1 = list(ii)
#print(r1)
writer.writerow(r1)

def data(self):
headers ={}
user_agent_list = [ \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
data1 = []
headers[‘User-Agent‘] = random.choice(user_agent_list)
headers[‘Cookie‘] = ‘bid=zckzLSu-4yo; ct=y; ll="118281"; ps=y; _ga=GA1.2.674505911.1497101804; __yadk_uid=YpeEdxJogOjKSTGvgtroL3njlJTyFnY0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1506143923%2C%22https%3A%2F%2Fwww.douban.com%2Fsearch%3Fcat%3D1002%26q%3D%25E6%2588%2598%25E7%258B%25BC%22%5D; dbcl2="152833433:M8pNlsM5zpM"; ck=zZoL; _pk_id.100001.4cf6=c541707976c90957.1505826348.8.1506146949.1506136834.; _pk_ses.100001.4cf6=*; __utma=30149280.674505911.1497101804.1506136834.1506143935.20; __utmb=30149280.1.10.1506143935; __utmc=30149280; __utmz=30149280.1505632166.12.12.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmv=30149280.15283; __utma=223695111.674505911.1497101804.1506136834.1506143935.8; __utmb=223695111.0.10.1506143935; __utmc=223695111; __utmz=223695111.1505826348.1.1.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/search; push_noty_num=0; push_doumail_num=0; ap=1‘
#print(headers)

for i in url.start_url:
try:
while (1):
time.sleep(3)
html = requests.get(i,headers=headers)
html.encoding = ‘utf-8‘
reqs= str(html.text)
if html.status_code ==200:
log.kk(‘下载网页数据成功‘)
break
else:
log.kk(‘下载网页数据失败‘)
except requests.exceptions.ReadTimeout as e:
log.kk(e)

selector = etree.HTML(reqs)
nickname = selector.xpath(‘//span[@class="comment-info"]/a/text()‘)
pinglun = selector.xpath(‘//div[@class="comment"]/p/text()‘)
r = list(zip(nickname,pinglun))
self.save_csv(r)


if __name__ ==‘__main__‘:

cu = douban()
cu.data()

爬取豆瓣中的战狼影评,保存在CSV

标签:www   thread   arch   默认   time   with   遇到   start   ros   

原文地址:http://www.cnblogs.com/Huangsh2017Come-on/p/7582106.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!