scrapy爬取杰书网小说

时间：2020-11-01 21:11:16 阅读：16 评论：0 收藏：0 [点我收藏+]

标签：break scrapy 封ip name 防止 features com print response

‘‘‘
爬取网站杰书网
网站地址 http://www.jieqishu.com
本脚本只为学习
‘‘‘
import requests
from bs4 import BeautifulSoup
import time,random

book_name = ‘jieqishu‘ # 爬虫名
book_url = ‘http://www.jieqishu.com‘ + ‘/‘ + book_name + ‘/‘ #拼接小说地址)
response = requests.get(url= book_url)

response.encoding = response.apparent_encoding #转码
soup = BeautifulSoup(response.text, features=‘html.parser‘)
a = soup.find(id=‘list‘)
dd_all = a.find_all(‘dd‘)
http_all = []

for i in dd_all:
http_all.append(book_url + i.find(‘a‘).attrs.get(‘href‘))
http_all = http_all[8:] #从开头开始截取都为7章
m = 5 #测试限定爬取次数
with open(book_name+‘.txt‘, ‘w‘) as f:
n = 0 #计数
for i in http_all:
if m==n:break
h = requests.get(url=i)
h.encoding = h.apparent_encoding
hb = BeautifulSoup(h.text, features=‘html.parser‘)
tar_t = hb.find(id=‘content‘)
tar_h = hb.find("h1").text
f.write(tar_h+‘\n‘)
for j in tar_t:
if str(j)!="<br/>":
f.write(str(j).lstrip()+‘\n‘)
time.sleep(random.randint(3, 6))#增加爬取时间间隔，防止被封ip
n+=1
f.write(‘\n\n‘)
print(‘第%d章写入完成!‘%n)
f.close()

scrapy爬取杰书网小说

标签：break scrapy 封ip name 防止 features com print response

原文地址：https://blog.51cto.com/12070874/2543132

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行