爬取笔趣阁小说

时间：2019-11-30 19:24:52 阅读：130 评论：0 收藏：0 [点我收藏+]

标签：web code style parse 连载 text inner fir ack

《修罗武神》是在17K小说网上连载的网络小说，作者为善良的蜜蜂。小说讲述了一个少年从下界二等门派外门弟子成长为上界翘楚人物的故事。该书曾入选“第三届橙瓜网络文学奖”百强作品。

技术图片

编程只是实现目的的工具。

所以重点是分析我们的需求。

获取小说目录页面是基本。这里有各个章节的链接，标题等等内容。这是我们需要的。

有了各个章节的链接，就需要进入其中获得各个章节的内容。

1.首先是爬取网站的内容

 1 def get_content(url):
 2 
 3     try:
 4         headers = {
 5             ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36‘,
 6         }
 7 
 8         r = requests.get(url=url, headers=headers)
 9         r.encoding = ‘utf-8‘
10         content = r.text
11         return content
12     except:
13         s = sys.exc_info()
14         print("Error ‘%s‘ happened on line %d" % (s[1], s[2].tb_lineno))
15         return " ERROR "

2.解析内容

 1 def praseContent(content):
 2     soup = BeautifulSoup(content,‘html.parser‘)
 3     chapter = soup.find(name=‘div‘,class_="bookname").h1.text
 4     content = soup.find(name=‘div‘,id="content").text
 5     save(chapter, content)
 6     next1 = soup.find(name=‘div‘,class_="bottem1").find_all(‘a‘)[2].get(‘href‘)
 7     # 如果存在下一个章节的链接，则将链接加入队列
 8     if next1 != ‘/0_638/‘:
 9         q.put(base_url+next1)
10     print(next1)

接下来就是完整代码

 1 import requests
 2 import time
 3 import sys
 4 import os
 5 import queue
 6 from bs4 import BeautifulSoup 
 7 # 用一个队列保存url
 8 q = queue.Queue()
 9 # 首先我们写好抓取网页的函数
10 def get_content(url):
11 
12     try:
13         headers = {
14             ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36‘,
15         }
16 
17         r = requests.get(url=url, headers=headers)
18         r.encoding = ‘utf-8‘
19         content = r.text
20         return content
21     except:
22         s = sys.exc_info()
23         print("Error ‘%s‘ happened on line %d" % (s[1], s[2].tb_lineno))
24         return " ERROR "
25 
26 # 解析内容
27 def praseContent(content):
28     soup = BeautifulSoup(content,‘html.parser‘)
29     chapter = soup.find(name=‘div‘,class_="bookname").h1.text
30     content = soup.find(name=‘div‘,id="content").text
31     save(chapter, content)
32     next1 = soup.find(name=‘div‘,class_="bottem1").find_all(‘a‘)[2].get(‘href‘)
33     # 如果存在下一个章节的链接，则将链接加入队列
34     if next1 != ‘/0_638/‘:
35         q.put(base_url+next1)
36     print(next1)
37 # 保存数据到txt
38 def save(chapter, content):
39     filename = "修罗武神.txt"
40     f =open(filename, "a+",encoding=‘utf-8‘)
41     f.write("".join(chapter)+‘\n‘)
42     f.write("".join(content.split())+‘\n‘) 
43     f.close
44 
45 # 主程序
46 def main():
47     start_time = time.time()
48     q.put(first_url)
49     # 如果队列为空，则继续
50     while not q.empty():
51         content = get_content(q.get())
52         praseContent(content)
53     end_time = time.time()
54     project_time = end_time - start_time
55     print(‘程序用时‘, project_time)
56 
57 # 接口地址
58 base_url = ‘https://www.xbiquge6.com‘
59 first_url = ‘https://www.xbiquge6.com/0_638/1124120.html‘
60 if __name__ == ‘__main__‘:
61     main()

学习爬取小说的过程还是很困难的，但成功的收获也很值得。

伴随着一些问题的解决，对于一些基本的操作也弄清楚了。对于这些东西的最好的学习方式，就是在使用中学习，通过解决问题的方式来搞定这些知识。按需索取，才能更有针对性。

爬取笔趣阁小说

标签：web code style parse 连载 text inner fir ack

原文地址：https://www.cnblogs.com/wt714/p/11963497.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行