02.Scrapy-Demo

时间：2020-06-24 11:50:31 阅读：47 评论：0 收藏：0 [点我收藏+]

标签：get selector 源码 project -- href yield dom 入门实战

Scrapy入门实战

采集目标：采集西祠网的IP代理包括 IP PORT

1. 新建项目

scrapy startproject xicidailiSpider
# scrapy 新建项目  项目名

2. 创建爬虫

scrapy genspider xicidaili xicidaili.com
# scrapy 产生爬虫  爬虫名字   网站域名
# 注意：爬虫名字一定不能与项目名字一致！

技术图片

可以看到，在项目的spiders下得到了一个爬虫文件

解释爬虫文件

import scrapy # 导入scrapy

# 创建爬虫类 并且继承自scrapy.Spider --> 爬虫最基础的类
# 另外几个类都是继承自这个类
class XicidailiSpider(scrapy.Spider):
    #爬虫名字 --> 必须唯一
    name = ‘xicidaili‘	
    # 允许采集的域名
    allowed_domains = [‘xicidaili.com‘] 
    # 开始采集的网站
    start_urls = [‘http://xicidaili.com/‘]
	# 解析响应数据 提取数据 或者网址等 response就是网页源码
    def parse(self, response):
        pass

3. 分析网址

提取数据

正则表达式（基础必回难掌握）
XPath --> 从HTML中国提取数据语法
CSS --> 从HTML中国提取数据语法

response.xpath("xpath语法").get()

get() 是得到一个元素

getall() 是多个元素

class XicidailiSpider(scrapy.Spider):
    name = ‘xicidaili‘
    allowed_domains = [‘xicidaili.com‘]
    start_urls = [‘https://www.xicidaili.com/nn/‘]
    # start_urls = [f‘https://www.xicidaili.com/nn/{page}‘ for page in range(1,3685)]

    def parse(self, response):
        # 提取数据
        # response.xpath("//tr/td[2]/text()")
        selectors = response.xpath("//tr")
        for selector in selectors:
            ip = selector.xpath("./td[2]/text()").get() # . 在当前节点下继续选择
            port = selector.xpath("./td[3]/text()").get()

            # ip = selector.xpath("./td[2]/text()").extract_first()  # 与get等价
            # port = selector.xpath("./td[3]/text()").extract_first()
            print(ip,port)

4. 运行爬虫

scrapy crawl 爬虫名字

		# 翻页操作
        next_page = response.xpath(‘//a[@class="next_page"]/@href‘).get()
        if next_page:
            print(next_page)
            # 拼接网址
            next_url = response.urljoin(next_page)
            yield scrapy.Request(next_url,callback=self.parse) # yield 生成器
            
# Request() 发送请求 类似requests.get() 
# callback 是回调函数 将发出去的请求得到的响应还交给自己(self.parse)处理
# 注意：回调函数不要写() 只写方法名字

02.Scrapy-Demo

标签：get selector 源码 project -- href yield dom 入门实战

原文地址：https://www.cnblogs.com/yanadoude/p/13186446.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行