简单爬取小说

时间：2019-05-25 22:48:06 阅读：203 评论：0 收藏：0 [点我收藏+]

标签：content 数据 ref tle with href 获取 http www

import urllib.request
import re

#爬取小说是最基础的爬虫，学会思路就能去做一些高级爬虫，思路一样，只是用的库或者JS或者异步等问题不同而已

url = "https://www.qb5200.tw/xiaoshuo/36/36143/"#爬取的小说网址

with urllib.request.urlopen(url) as doc:
    html = doc.read()#读取网页
html = html.decode("gbk")#解码
title = re.findall(r‘<meta property="og:title" content="(.*?)"/>‘, html)[0]
fb = open(‘%s.text‘ % title, ‘w‘, encoding=‘gbk‘)
urls = re.findall(r‘<dd><a href ="(.*?)">(.*?)</a></dd>‘, html)
for i in urls:
    chapter_url = i[0]#获取每章小说的主要地址，地址不完整
    chapter_name = i[1]#获取每章的章名
    chapter_url = "https://www.qb5200.tw%s" % chapter_url#将基地址与每章主要地址拼接
    chapter_html = urllib.request.urlopen(chapter_url).read()#解析每章网页
    chapter_html = chapter_html.decode("gbk")#decode（“gbk”）或utf-8取决于原网页的编码
    chapter_content = re.findall(r‘<div id="content" class="showtxt">(.*?)</div>‘, chapter_html)[0]
    chapter_content = chapter_content.replace("&nbsp;", "")#用正则将无效数据替换掉
    chapter_content = chapter_content.replace("<br /><br />","")#用正则将<br/>（换行）替换
    fb.write(chapter_name)#写入txt文件中
    fb.write(chapter_content)
    fb.write(‘\n‘)将换行写入

简单爬取小说

标签：content 数据 ref tle with href 获取 http www

原文地址：https://www.cnblogs.com/persistence-ok/p/10924300.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行