爬虫2.3-scrapy框架-post、shell、验证码

时间：2018-12-31 17:25:17 阅读：189 评论：0 收藏：0 [点我收藏+]

标签：验证码文件 write request 获取 star name 验证函数

scrapy框架-post请求和shell

scrapy框架-post请求和shell

1. post请求

scrapy框架在开始时，会直接调用start_requests(self) 函数，所以需要重写start_requests方法，并且不调用start_urls里面的url，之后再使用回调函数进入真正的数据解析函数

class RenrenSpider(scrapy.Spider):
    name = ‘renren‘
    allowed_domains = [‘renren.com‘]
    start_urls = [‘http://renren.com/‘]

    def start_requests(self):
        url = "http://www.renren.com/PLogin.do"
        data = { ‘email‘: ‘970138074@qq.com‘, ‘password‘:‘pythonspider‘, }
        requese = scrapy.FormRequest(url, formdata=data, callback=self.parse_page)  
        # post方法请求页面，最好使用FormRequest函数
        yield requese

    def parse_page(self, response):
        with open(‘renren.html‘, ‘w‘, encoding=‘utf-8‘) as fp:
            fp.write(response.text)
        # 将页面写成html文件，用浏览器打开即可证明post请求成功。

2. scrapy shell

当我们想测试xpath语法得到的结果时，不停启动整个项目实际上是很笨重的，所以scrapy shell可以帮助我们测试数据解析语句效果

cmd
>> cd [projectname]
>> scrapy shell url
>> 返回一堆可以使用的对象，这里没有深入研究，只使用了response
>> title = response.xpath(r"//h[@class=‘ph‘]/text()").get()
>> title
>> 数据。。
>> contents = response.xpath(r"//td[@id=‘article_content‘]//text()").getall()  # 获取td标签下所有的文本，所以使用getall() 获取，返回一个列表
>> content = ‘‘.join(contents).strip()  # 将contens列表中的所有本文去除换行和空格送入
>> content 
>> 显示一堆数据

3. 验证码识别

思路：

找到登陆的url，username，password表单格式，以及验证码url，然后将验证码下载到本地，此时有两种识别方法

1 将验证码展示在屏幕上，人工识别，手动输入

2 阿里云验证码识别服务，将图片下载后按照阿里云验证码识别的要求将数据发送给它，等待结果，处理json数据，提取验证码。

爬虫2.3-scrapy框架-post、shell、验证码

标签：验证码文件 write request 获取 star name 验证函数

原文地址：https://www.cnblogs.com/bitterzZ/p/10202161.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行