一个简书的爬虫，可以设定页码，抓取文章标题、简介以及链接

时间：2018-11-03 10:27:09 阅读：208 评论：0 收藏：0 [点我收藏+]

标签：目录 pre def file scroll rom 输入 tle range

 1 #coding=utf-8
 2 import requests
 3 from bs4 import BeautifulSoup
 4 
 5 m=input("请输入想要抓取的页码数量:")
 6 for i in range(1,int(m)):
 7     url="https://www.jianshu.com/?page="+str(i)
 8     headers={
 9         ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0‘,
10         ‘Accept‘: ‘text/html, */*; q=0.01‘,
11         ‘Accept-Language‘: ‘zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2‘,
12         ‘Accept-Encoding‘: ‘gzip, deflate‘,
13         ‘Referer‘: ‘https://www.jianshu.com/‘,
14         ‘X-INFINITESCROLL‘: ‘true‘,
15         ‘X-Requested-With‘: ‘XMLHttpRequest‘,
16         ‘Connection‘: ‘close‘,
17         }
18     html=requests.get(url=url,headers=headers)
19     soup = BeautifulSoup(html.text.encode(html.encoding).decode(‘utf-8‘), ‘html.parser‘)
20     # 以格式化的形式打印html
21     #print(soup.prettify())
22     titles = soup.find_all(‘a‘, ‘title‘)
23     titlesp = soup.find_all(‘p‘, ‘abstract‘)
24     with open(r"./文章简介.txt","a",encoding=‘utf-8‘) as file:
25         for (title,titlep) in zip(titles,titlesp):
26             file.write(title.string+‘\n‘)
27             file.write(titlep.string+‘\n‘)
28             file.write("https://www.jianshu.com" + title.get(‘href‘)+‘\n\n‘)</code>
29 
30 print("执行完毕，保存在目录：./文章简介.txt")

环境：python3

模块：requests、bs4

一个简书的爬虫，可以设定页码，抓取文章标题、简介以及链接

标签：目录 pre def file scroll rom 输入 tle range

原文地址：https://www.cnblogs.com/0day-li/p/9899842.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行