python requests抓取猫眼电影

时间：2017-10-12 15:38:00 阅读：234 评论：0 收藏：0 [点我收藏+]

标签：map import utf-8 dump images 字典 open file compile

1. 网址：http://maoyan.com/board/4?

技术分享

2. 代码：

 1 import json
 2 from multiprocessing import Pool
 3 import requests
 4 from requests.exceptions import RequestException
 5 import re
 6 
 7 
 8 def get_one_page_html(url):
 9     try:
10         response = requests.get(url)
11         if response.status_code == 200:
12             return response.text
13         return None
14     except RequestException:
15         return None
16 
17 def parse_one_page(html):
18     pattern = re.compile(‘<dd>.*?board-index.*?>(\d+)</i>.*?alt.*?src="(.*?)".*?name"><a‘
19                +‘.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>‘
20                +‘.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>‘, re.S)# .可以匹配任意的换行符
21 
22     items = re.findall(pattern,html)
23     #(‘1‘, ‘http://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c‘, ‘霸王别姬‘, ‘\n                主演：张国荣,张丰毅,巩俐\n        ‘, ‘上映时间：1993-01-01(中国香港)‘, ‘9.‘, ‘6‘),
24     for item in items:
25         yield {
26             ‘index‘ : item[0],
27             ‘image‘ : item[1],
28             ‘title‘:item[2],
29             ‘actor‘ : item[3].strip()[3:],
30             ‘time‘: item[4].strip()[5:],
31             ‘score‘ : item[5] + item[6]
32         }
33 
34 def write_to_file(content):
35     with open(‘result.txt‘, ‘a‘, encoding=‘utf-8‘)as f:
36         f.write(json.dumps(content, ensure_ascii=False) + ‘\n‘)#导入快捷见alt+enter,content内容是个字典，我们要把它变成字符串写入文件,加入换行符，每行一个
37         f.close()
38 
39 def main(offset):
40     url = ‘http://maoyan.com/board/4?offset=‘ + str(offset)
41     html = get_one_page_html(url)
42     for item in parse_one_page(html):
43         print(item)
44         write_to_file(item)  #会变成unicode编码，若想result.txt里面是中文,需要修改write_to_file函数，加上encoding=‘utf-8’和ensure_ascii=False
45 
46 if __name__ == ‘__main__‘:
47     # for i in range(10):
48     #     main(i*10)
49 
50     pool = Pool()
51     pool.map(main, [i*10 for i in range(10)])

View Code

3. 结果：

注意：

1.正则匹配要好好看看

2.将输出的内容格式化，变成一个生成器字典

3.写到文件的时候把unicode编码变成中文显示

4.进程池Pool。实现秒抓

python requests抓取猫眼电影

标签：map import utf-8 dump images 字典 open file compile

原文地址：http://www.cnblogs.com/rrl92/p/7656128.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行