python3爬虫 -----华东交大校园新闻爬取

时间：2019-03-30 10:26:39 阅读：174 评论：0 收藏：0 [点我收藏+]

标签：网页 with open new 校园 utf-8 wow coding with user

如果爬取较多最好sleep一下，，，，

 1 import requests
 2 import requests.exceptions
 3 import re
 4 import json
 5 
 6 #请求头，防止防爬虫的网页
 7 headers={
 8 "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
 9 }
10 
11 #获取一张网页上的内容
12 def get_one_page(url):
13     try:
14         res = requests.get(url, headers=headers)
15         if res.status_code == 200:
16             return res.text
17         return None
18     except Exception:
19         return None
20 
21 #根据网页上的内容，再析取新闻标题
22 def parse_one_page(html):
23     pattern=re.compile(‘<td align="left".*?<a href.*?>(.*?)</a>.*?</td>‘,re.S)
24     items=re.findall(pattern,html)
25     return items
26     # for item in items:
27     #     yield {
28     #          "title":item.split()
29     #     }
30 
31 
32 #写入文件
33 def write_to_file(content):
34     with open(‘news_ecjtu.txt‘,‘a‘,encoding=‘utf-8‘) as f:
35         f.write(json.dumps(content,ensure_ascii=False)+‘\n‘)
36         f.close()
37 
38 
39 def main(page):
40     if(page):
41         page+=1
42         url=‘http://xw.ecjtu.jx.cn/1083/list‘+str(page)+‘.htm‘
43     else:
44         url=‘http://xw.ecjtu.jx.cn/1083/list.htm‘
45     html=get_one_page(url)
46 
47     for item in parse_one_page(html):
48        write_to_file(item)
49 
50 
51 if __name__ == ‘__main__‘:
52     for i in range(10): #582
53         main(i)

标签：网页 with open new 校园 utf-8 wow coding with user

原文地址：https://www.cnblogs.com/z-712/p/10625255.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行