爬取免费小说

时间：2020-02-13 12:36:31 阅读：89 评论：0 收藏：0 [点我收藏+]

标签：main htm format request orm color 小说 alt 内容

　　今天小编学些了用xpath爬取小说网，权当练习。

　　xpath是路径语言。

　　小说（免费）网址：http://book.zongheng.com/showchapter/896071.html

　　首先，小编随便点开了一个小说。

技术图片

　　可以看到，小说每个章节的title，url都在ul这个无序标签里面，只需用xpath遍历即可。

　　首先先打开对应的一个网址，查看对应的小说的内容位置。

技术图片

　　可以看到，第一章节的内容都在div标签中，且class属性为content。

import requests
from lxml import etree



def get_chapter_name(url):
    html = requests.get(url).text
    page_source = etree.HTML(html)
    chapters_url = page_source.xpath(‘//ul[@class="chapter-list clearfix"]/li/a/@href‘)
    chapters_name = page_source.xpath(‘//ul[@class="chapter-list clearfix"]/li/a/text()‘)
    for chapter_url, chapter_name in zip(chapters_url, chapters_name):
        get_text(chapter_url, chapter_name)
    print(‘完毕！！！‘)



def get_text(chapter_url, chapter_name):
    # 获取网页内容
    html = requests.get(chapter_url).text
    page = etree.HTML(html)
    text_tag = page.xpath(‘//div[@class="content"]//p//text()‘)
    text = ‘\n‘.join(text_tag)
    path = ‘破天传人/{}.txt‘.format(chapter_name)
    with open(path, ‘w‘, encoding=‘utf-8‘) as f:
        f.write(text)
        print(path + ‘   写入完毕！！！‘)


if __name__ == ‘__main__‘:
    url = ‘http://book.zongheng.com/showchapter/896071.html‘
    get_chapter_name(url)

技术图片

爬取免费小说

标签：main htm format request orm color 小说 alt 内容

原文地址：https://www.cnblogs.com/a-runner/p/12302914.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行