码迷,mamicode.com
首页 > 其他好文 > 详细

爬取豆瓣的tp250电影名单

时间:2018-11-26 00:13:33      阅读:355      评论:0      收藏:0      [点我收藏+]

标签:turn   find   top   bsp   电影   +=   https   def   urllib   

# https://movie.douban.com/top250?start=25&filter= 要爬取的网页 import re from urllib.request import urlopen def getPage(url): response=urlopen(url) return response.read().decode(‘utf-8‘) def parsePage(s): ret=com.finditer(s) for i in ret: ret={ ‘id‘: i.group(‘id‘), ‘move_name‘:i.group(‘move_name‘), ‘move_d‘:i.group( ‘move_d‘), ‘move_t‘:i.group(‘move_t‘), ‘content‘: i.group(‘content‘), ‘infor‘:i.group(‘infor‘) } yield ret def main(num): url=‘https://movie.douban.com/top250?start=%s&filter= ‘ %num res=getPage(url) ret=parsePage(res) print(ret) f=open(‘move‘,mode=‘a+‘,encoding=‘utf-8‘) for obj in ret: print(obj) data1=str(obj).replace(‘\\n‘,‘‘) data2=data1.replace(‘ ‘,‘‘) f.write(data2 + ‘\n‘) f.close() com=re.compile(‘<div class="item">(?:.*?)<em class="">(?P<id>.*?)</em>(?:.*?)alt=(?P<move_name>.*?)src(?:.*?)导演:‘ ‘(?P<move_d>.*?)&nbsp;(?:.*?)<br>(?P<move_t>.*?)&nbsp(?:.*?)&nbsp;/&nbsp;(?P<content>.*?)</p>(?:.*?)<span class="inq">(?P<infor>.*?)</span>‘,re.S) count=0 for i in range(10): main(count) count+=25

爬取豆瓣的tp250电影名单

标签:turn   find   top   bsp   电影   +=   https   def   urllib   

原文地址:http://blog.51cto.com/13747953/2321800

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!