Pytho网页类容抓取

时间：2016-07-08 10:12:45 阅读：227 评论：0 收藏：0 [点我收藏+]

标签：

之前用百度的云收藏时，觉得这个功能挺神奇的，不管打开什么样的网页总能准确地抓取其中正文部分。前不久看到python做网页内容抓取。实现起来也挺容易的。

直接上代码：

  1 # -*- coding:utf-8 -*-
  2 #!/usr/bin/env python
  3 
  4 # modified 2016-07-04
  5 
  6 import sys
  7 reload(sys)
  8 sys.setdefaultencoding( "utf-8" )
  9 
 10 import requests
 11 import bs4
 12 import time
 13 import random
 14 
 15 # ========================= Global variables =========================
 16 
 17 fileName = "8480.txt"
 18 
 19 headLink = "http://www.81zw.com/book/8480/"
 20 next_href = "http://www.81zw.com/book/8480/655310.html"
 21 
 22 
 23 # ========================= test =========================
 24 
 25 # # set test flag
 26 # test_flag = True
 27 
 28 # # get contents
 29 # response = requests.get(next_href)
 30 # if response.status_code == requests.codes.ok :
 31 #     soup = bs4.BeautifulSoup(response.content , "html.parser" )
 32 # else :
 33 #     test_flag = False
 34 # if test_flag:
 35 #     # test for next link
 36 #     link_div = soup.find_all(‘div‘,class_=‘bottem1‘)
 37 #     next_link = link_div[0].find_all(‘a‘)[2]
 38 #     print "---------- Next Link : ----------"
 39 #     print next_link.get(‘href‘)
 40 
 41 #     # find contents
 42 #     contents = soup.find_all(‘div‘, id =‘content‘)
 43 #     print "---------- Contents: ----------"
 44 #     print contents[0].text.replace(u‘\xa0‘, ‘‘) 
 45 
 46 #     # find title
 47 #     h1_title = soup.find_all(‘h1‘)
 48 #     print " ---------- Title: ------------- "
 49 #     print h1_title[0].text
 50 
 51 
 52 
 53 # ========================= Get contents =========================
 54 MaxLoop = 2600
 55 error_flag = 0
 56 MaxRetryTimes = 20
 57 
 58 # create null file
 59 f = open(fileName , ‘w‘)
 60 f.close()
 61 
 62 while error_flag==0 and MaxLoop > 0 :
 63     MaxLoop = MaxLoop-1
 64 
 65     # get web content by url link address
 66     RetryTimes = 0
 67     while True :
 68         response = requests.get(next_href)
 69         
 70         if response.status_code == requests.codes.ok :
 71             soup = bs4.BeautifulSoup(response.content , "html.parser" )
 72             break 
 73         else :
 74             r = random.random()*5
 75             time.sleep(r)
 76             RetryTimes = RetryTimes + 1 
 77             print u"尝试第%d次" % RetryTimes
 78 
 79     # get next link
 80     link_div = soup.find_all(‘div‘,class_=‘bottem1‘)
 81     next_link = link_div[0].find_all(‘a‘)[2]
 82 
 83     contents = soup.find_all(‘div‘, id =‘content‘)
 84 
 85     h1_title = soup.find_all(‘h1‘)
 86     #  
 87     chapter_contents = "\n\n" +h1_title[0].text + "\n\n" + contents[0].text.replace(u‘\xa0‘, ‘ ‘)
 88 
 89     f = open(fileName , ‘a‘)
 90     f.write( chapter_contents )
 91     f.close()
 92 
 93     # get next link address
 94     next_href = next_link.get(‘href‘)
 95     nPos = next_href.find("http://")
 96     if nPos == -1 :
 97         next_href = headLink + next_href
 98     elif nPos == 0 :
 99         pass
100     else :
101         error_flag = 1
102     print next_href

以一本小说为例来测试，要抓取网页中的文章标题，正文，下一页链接。

中间注释的部分用作测试的，可以测试看能不能正确抓取到网页中的内容，下面的部分是抓取页面的内容并保存到txt文本文件。

虽然这个有点不太智能，每抓取一篇文章都要自己分析一次，保证能正确抓到标题，正文和下一页链接。但总的来说使用起来比较简单，对于长篇的文章抓取还是很有用的。

现在的代码中，只保存了文章的文本部分，对于图片部分，现在还不知道怎么处理，后面再试试。

这里用的是Requests 和 bs4 两个库，分别用来获取html文档和分析html，使用起来非常方便。

Pytho网页类容抓取

标签：

原文地址：http://www.cnblogs.com/wujbclzw/p/5652355.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行