python超简化的18行代码爬一本小说

时间：2018-08-26 18:21:30 阅读：227 评论：0 收藏：0 [点我收藏+]

标签：webkit 2.0 decode chap ssi code wan python exp

import urllib.request
import re
def getnvvel():
html = urllib.request.urlopen("http://www.quanshuwang.com/book/44/44683").read().decode(‘gbk‘) # download sould code
urls = re.findall(r‘<li><a href="(.?)" title=".?">(.?)</a></li>‘, html) # regular expression
title = "douluo" # Normoally,you should use request.urlopen
f = open(‘../novel/%s.txt‘ % title, ‘w‘) # create a douluo.txt
for url in urls:
chapter_url = url[0]
chapter_title = url[1]
chapter_content_list = urllib.request.urlopen(chapter_url).read().decode("gbk")
chapter_content_list = re.findall(r‘</script> .?<br />(.*?)<script type="text/javascript">‘, chapter_content_list, re.S)
for chapter_content in chapter_content_list:
chapter_content = chapter_content.replace(" ", "")
chapter_content = chapter_content.replace("<br />", "")
f.write(chapter_title) # type chapter_title in douluo.txt
f.write(chapter_content) # type chapter_content in douluo.txt
f.write(‘\n‘) #为了分行更清楚
getnvvel()

如果你想你的代码不容易被发现你可以加上一个header比如

headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36‘}

html = request.urlopen(url, headers=headers)

当然为了和谐你也可以

import time

在后面某个位置加上下载的位置加上一个

time.sleep(1)

当然，想要加上一些其他防爬虫的东西你就得自己再努力深造了

python超简化的18行代码爬一本小说

标签：webkit 2.0 decode chap ssi code wan python exp

原文地址：http://blog.51cto.com/13603552/2164532

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行