码迷,mamicode.com
首页 > 编程语言 > 详细

Python爬虫(二)

时间:2017-04-20 17:08:02      阅读:143      评论:0      收藏:0      [点我收藏+]

标签:spider   awl   open   pen   __init__   nbsp   title   begin   font   

爬取电影吧一个帖子里的所有楼主发言:

# python2
# -*- coding: utf-8 -*-

import urllib2
import string
import re

class Baidu_Spider:
    feature_pattern = re.compile(rid="post_content.*?>\s+(.*?)</div>, re.S)
    replaceList = [(&#39;, \‘), (&quot;, \")]

    def __init__(self, url):
        self.url = url + ?see_lz=1

    def crawl_tieba_lz(self):
        begin_page = urllib2.urlopen(self.url).read()
        self.print_page_title(begin_page)
        count = self.get_page_count(begin_page)
        self.handle_data(count)

    def handle_data(self, count):
        f = open(tieba_lz.txt, w+)
        for i in range(count):
            url = self.url + &pn= + str(i+1)
            hint =  + str(i+1) + 

            print 正在下载%s: %s % (hint, url)
            page = urllib2.urlopen(url).read()
            features = re.findall(self.feature_pattern, page)
            print hint + 下载完成
            print 共有%d条记录 % len(features)

            f.write(hint + :\n)
            for feature in features:
                feature = self.handle_record(feature)
                print feature
                f.write(feature + \n\n)
        f.close()
        print done

    def handle_record(self, record):
        record = re.sub(r(<|</)br>, \n, record)
        record = re.sub(r<.*?>, ‘‘, record)
        for item in self.replaceList:
            record = record.replace(item[0], item[1])
        return record

    def get_page_count(self, page):
        result = re.search(rclass="red">(\d+?)</span>, page, re.S)
        if result:
            count = int(result.group(1))
            print 一共%d页 % count
        else:
            count = 0;
            print 无法获取页数
        return count

    def print_page_title(self, page):
        result = re.search(r<h1.*?>(.*?)</h1>, page, re.S)
        if result:
            title = result.group(1)
            print 标题: %s % title
        else:
            print 无法获取标题

spider = Baidu_Spider(http://tieba.baidu.com/p/4082863285)
spider.crawl_tieba_lz()

 

Python爬虫(二)

标签:spider   awl   open   pen   __init__   nbsp   title   begin   font   

原文地址:http://www.cnblogs.com/gattaca/p/6739436.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!