标签:简单 clear tpc latest 使用 enabled ssl cache 一个
pip3.7 install Scrapy
J-pro:myproject will$ scrapy Scrapy 2.1.0 - project: myproject Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy Use "scrapy <command> -h" to see more info about a command
scrapy startproject myproject
J-pro:myproject will$ ls -al total 8 drwxr-xr-x 4 will staff 128 6 11 23:47 . drwxr-xr-x 3 will staff 96 6 11 23:47 .. drwxr-xr-x 10 will staff 320 6 11 23:47 myproject // 项目目录 -rw-r--r-- 1 will staff 261 6 11 23:18 scrapy.cfg // 项目配置文件 J-pro:myproject will$ cd myproject/ J-pro:myproject will$ ls -al total 56 drwxr-xr-x 10 will staff 320 6 11 23:47 . drwxr-xr-x 4 will staff 128 6 11 23:47 .. -rw-r--r-- 1 will staff 0 6 11 23:03 __init__.py drwxr-xr-x 5 will staff 160 6 11 23:42 __pycache__ -rw-r--r-- 1 will staff 8407 6 11 23:47 items.json // 爬虫抓爬下来的数据JSON -rw-r--r-- 1 will staff 369 6 11 23:42 items.py // 定义需要提取数据的结构文件 -rw-r--r-- 1 will staff 3587 6 11 23:18 middlewares.py // 中间件文件,是和Scrapy的请求/响应处理相关联的框架 -rw-r--r-- 1 will staff 283 6 11 23:18 pipelines.py // 用来对items里面提取的数据进一步处理,如保存等 -rw-r--r-- 1 will staff 3115 6 11 23:18 settings.py // 设置文件 drwxr-xr-x 6 will staff 192 6 11 23:47 spiders // 存储爬虫代码目录
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DetailItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() reply = scrapy.Field() pass
import scrapy from myproject.items import DetailItem import sys class MySpider(scrapy.Spider): """ name:scrapy唯一定位实例的属性,必须唯一 allowed_domains:允许爬取的域名列表,不设置表示允许爬取所有 start_urls:起始爬取列表 start_requests:它就是从start_urls中读取链接,然后使用make_requests_from_url生成Request, 这就意味我们可以在start_requests方法中根据我们自己的需求往start_urls中写入 我们自定义的规律的链接 parse:回调函数,处理response并返回处理后的数据和需要跟进的url log:打印日志信息 closed:关闭spider """ # 设置name name = "spidertieba" # 设定域名 allowed_domains = ["baidu.com"] # 填写爬取地址 start_urls = [ "http://tieba.baidu.com/f?kw=%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB&ie=utf-8", ] # 编写爬取方法 def parse(self, response): for line in response.xpath(‘//li[@class=" j_thread_list clearfix"]‘): # 初始化item对象保存爬取的信息 item = DetailItem() # 这部分是爬取部分,使用xpath的方式选择信息,具体方法根据网页结构而定 item[‘title‘] = line.xpath(‘.//div[contains(@class,"threadlist_title pull_left j_th_tit ")]/a/text()‘).extract() item[‘author‘] = line.xpath(‘.//div[contains(@class,"threadlist_author pull_right")]//span[contains(@class,"frs-author-name-wrap")]/a/text()‘).extract() item[‘reply‘] = line.xpath(‘.//div[contains(@class,"col2_left j_threadlist_li_left")]/span/text()‘).extract() yield item
scrapy crawl spidertieba -o items.json
J-pro:myproject will$ scrapy crawl spidertieba -o items.json 2020-06-12 23:05:12 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: myproject) 2020-06-12 23:05:13 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Darwin-18.7.0-x86_64-i386-64bit 2020-06-12 23:05:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-06-12 23:05:13 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME‘: ‘myproject‘, ‘NEWSPIDER_MODULE‘: ‘myproject.spiders‘, ‘ROBOTSTXT_OBEY‘: True, ‘SPIDER_MODULES‘: [‘myproject.spiders‘]} 2020-06-12 23:05:13 [scrapy.extensions.telnet] INFO: Telnet Password: b20d9ac1dc58b0eb 2020-06-12 23:05:13 [scrapy.middleware] INFO: Enabled extensions: [‘scrapy.extensions.corestats.CoreStats‘, ‘scrapy.extensions.telnet.TelnetConsole‘, ‘scrapy.extensions.memusage.MemoryUsage‘, ‘scrapy.extensions.feedexport.FeedExporter‘, ‘scrapy.extensions.logstats.LogStats‘] 2020-06-12 23:05:13 [scrapy.middleware] INFO: Enabled downloader middlewares: [‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware‘, ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware‘, ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware‘, ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware‘, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘, ‘scrapy.downloadermiddlewares.retry.RetryMiddleware‘, ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware‘, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘, ................................................................................................
标签:简单 clear tpc latest 使用 enabled ssl cache 一个
原文地址:https://www.cnblogs.com/will-xz/p/13111048.html