码迷,mamicode.com
首页 > 其他好文 > 详细

正则匹配的爬虫

时间:2018-08-12 17:21:41      阅读:169      评论:0      收藏:0      [点我收藏+]

标签:new   ons   text   5.0   window   nal   save   pattern   user   

import requests
import re
class Anjuke(object):
    def __init__(self):
        self.url = "https://beijing.anjuke.com/sale/huairou/o5/"
        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3"}
        self.pattern = re.compile(‘<ul id="houselist-mod-new" class="houselist-mod houselist-mod-new">(.*?)</ul>‘,re.S)
        self.second_pattern = re.compile(‘<(.*?)>|&(.*?);|\s‘)


    def send_request(self):
        reponse = requests.get(self.url, headers=self.headers)
        data = reponse.content.decode()
        print(data)
        return data

    def save_data(self,result_data):
        with open(‘anjuke.text‘,‘a‘) as f:
            for data in result_data:
                second_content = self.second_pattern.sub(‘‘, data) + ‘\n\n‘
                f.write(second_content)
    def analysis_data(self,data):
        result_list = self.pattern.findall(data)
        return result_list

    def run(self):
        data = self.send_request()
        result_list = self.analysis_data(data)
        print(result_list)
        self.save_data(result_list)

if __name__ == ‘__main__‘:
    Anjuke().run()

























正则匹配的爬虫

标签:new   ons   text   5.0   window   nal   save   pattern   user   

原文地址:https://www.cnblogs.com/hanjian200ok/p/9463165.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!