标签:入队 lin hide amp content enc lap get请求 size
需求:将投诉_阳光热线问政平台中的投诉标题和状态网友以及时间爬取下来永久储存在数据库中
url:http://wz.sun0769.com/index.php/question/questionType?type=4&page=
需求分析:每一个页面对应的是一个url,scrapy框架发起请求对一个url进行解析,解析完后又对另一个url发起请求进行解析。
实现方案:
1.可以将每一个页面的url放在爬虫程序中的start_urls中请求。(不推荐,累死人)
2.另一种请求使用Request方法手动发起请求。(推荐)
spider/sun.py
# -*- coding: utf-8 -*- import scrapy from Sunline.items import SunlineItem from selenium import webdriver class SunSpider(scrapy.Spider): name = ‘sun‘ # allowed_domains = [‘www.xxx.com‘] start_urls = [‘http://wz.sun0769.com/index.php/question/questionType?type=4&page=‘] std_url = "http://wz.sun0769.com/index.php/question/questionType?type=4&page=%d"#第二页30 第三页60 90 120 pages = 1 #在这里实例化浏览器,不懂的可以暂时跳过,就理解下面的代码是递归爬取多页数据 bro = webdriver.Chrome("./chromedriver.exe") print("爬虫开始啦!!!") def parse(self, response): page_list = response.xpath(‘//*[@id="morelist"]/div/table[2]/tbody/tr/td/table/tbody/tr‘) for page in page_list: title = page.xpath(‘./td[2]/a[2]/text()‘).extract_first() state = page.xpath(‘./td[3]/span/text()‘).extract_first() screen_name = page.xpath(‘./td[4]/text()‘).extract_first() time = page.xpath(‘./td[5]/text()‘).extract_first() item = SunlineItem() item["title"] = title item["state"] = state item["screen_name"] = screen_name item["time"] = time # print(title,state,screen_name,time) #提交数据item给管道 将设置的robots 与UA设置好 yield item # 抓取前100页数据储存在数据库 if self.pages <= 100: #经观察,发现每一页乘30就是本页 pg = self.pages*30 self.pages += 1 new_url = format(self.std_url%pg) #递归的解析数据调用parse方法 满足递归的两个条件1、终止条件:前100页,第二个条件:调用函数本身 yield scrapy.Request(url=new_url,callback=self.parse) def close(self,spider): self.bro.quit() print("爬虫结束")
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class SunlineItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() state = scrapy.Field() screen_name = scrapy.Field() time = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import pymysql class SunlinePipeline(object): #管道里面有两个默认执行的方法 def __init__(self): self.conn = None self.cursor = None def open_spider(self,spider): #建立连接 self.conn = pymysql.Connect(host="127.0.0.1",port=3306,user=‘root‘,password=‘‘,db=‘test‘) def process_item(self, item, spider): # print(item) #先尝试打印结果看一下 将设置里的管道打开 title = item["title"] state = item["state"] screen_name = item["screen_name"] time = item["time"] #执行sql语句 sql = "insert into sun(title,state,screen_name,time) values(‘%s‘,‘%s‘,‘%s‘,‘%s‘);"%(title,state,screen_name,time) self.cursor = self.conn.cursor() try: #创建游标 self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) #不提交事务 self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close()
middlewares.py
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals from time import sleep from scrapy.http import HtmlResponse class SunlineDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest #从浏览器中发起请求 spider.bro.get(url=request.url) sleep(1) spider.bro.execute_script("window.scrollTo(0,document.body.scrollHeight)") page_text = spider.bro.page_source # 改动返回响应对象 参数url=spider.bro.current_url, body=page_text, encoding=‘utf-8‘, request=request new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request) return new_response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass
settings.py
# -*- coding: utf-8 -*- # Scrapy settings for Sunline project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = ‘Sunline‘ SPIDER_MODULES = [‘Sunline.spiders‘] NEWSPIDER_MODULE = ‘Sunline.spiders‘ # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36‘ #设置UA # Obey robots.txt rules ROBOTSTXT_OBEY = False #关闭robots LOG_LEVEL = "ERROR" #只打印错误信息 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, # ‘Accept-Language‘: ‘en‘, #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # ‘Sunline.middlewares.SunlineSpiderMiddleware‘: 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { ‘Sunline.middlewares.SunlineDownloaderMiddleware‘: 543, } #打开中间件 下载的 # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # ‘scrapy.extensions.telnet.TelnetConsole‘: None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { ‘Sunline.pipelines.SunlinePipeline‘: 300, } #打开管道 # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ‘httpcache‘ #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘
结果用navicat所示:
在之前代码中,我们从来没有手动的对start_urls列表中存储的起始url进行过请求的发送,但是起始url的确是进行了请求的发送,那这是如何实现的呢?
其实是因为爬虫文件中的爬虫类继承到了Spider父类中的start_requests(self)这个方法,该方法就可以对start_urls列表中的url发起请求:
def start_requests(self): for u in self.start_urls: yield scrapy.Request(url=u,callback=self.parse)
【注意】该方法默认的实现,是对起始的url发起get请求,如果想发起post请求,则需要子类重写该方法。
方法: 重写start_requests方法,让其发起post请求:
def start_requests(self): #请求的url post_url = ‘http://fanyi.baidu.com/sug‘ # post请求参数 formdata = { ‘kw‘: ‘wolf‘, } # 发送post请求 yield scrapy.FormRequest(url=post_url, formdata=formdata, callback=self.parse)
在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息。
日志信息的种类:
ERROR : 一般错误
WARNING : 警告
INFO : 一般的信息
DEBUG : 调试信息
设置日志信息指定输出:
在settings.py配置文件中,加入
LOG_LEVEL = ‘指定日志信息种类’即可。 LOG_FILE = ‘log.txt‘则表示将日志信息写入到指定文件中进行存储。
在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个网易新闻网站,新闻的标题在第一页,新闻的详情页面在第二页。这时我们就需要用到请求传参。
案例展示:爬取https://news.163.com/新闻网,将一级页面中的新闻标题,第二页的新闻详情页进行爬取。
爬虫文件(spiders/wangyi.py):
# -*- coding: utf-8 -*- import scrapy from Wangyixinwen.items import WangyixinwenItem from selenium import webdriver class WangyiSpider(scrapy.Spider): name = ‘wangyi‘ # allowed_domains = [‘www.xxx.com‘] #自动发送请求的 去设置里将user-agent以及robots协议设置好 start_urls = [‘https://news.163.com/‘] """" 我们的第一次请求对象就是start_urls开始,我们还需要对UA伪装同时还有IP代理等,可以在下载中间件的process_request中设置UA伪装 process_exception中设置代理ip(等到自己ip出问题了就用代理ip)。等到响应对象缺失就可以在process_response中拦截。 """ bro = webdriver.Chrome("./chromedriver.exe") urls = [] def parse(self, response): #从响应对象中获取到全部目标 target_list = response.xpath(‘//*[@id="js_festival_wrap"]/div[3]/div[2]/div[2]/div[2]/div/ul/li‘) #遍历目标得到具体的url for index in [3,6,7,8]: target = target_list[index] #得到目标的url 并且取出url target_url = target.xpath(‘./a/@href‘).extract_first() self.urls.append(target_url) #对目标url发起请求 yield scrapy.Request(url=target_url,callback=self.parse_target) def parse_target(self,response): #将第二次请求的响应对象开始解析,分析由于还未编写代码就知道这次是含有动态加载数据,因此这次 #分析可以用到selenium一起解析数据,下面解析出新闻标题以及具体的详情页的url(响应的数据缺失,因此我们需要去下载中间件设置) detail_list = response.xpath(‘/html/body/div[1]/div[3]/div[4]/div[1]/div/div/ul/li/div/div‘) for detail in detail_list: title = detail.xpath(‘./div/div[1]/h3/a/text()‘).extract_first() detail_url = detail.xpath(‘./div/div[1]/h3/a/@href‘).extract_first() #实例化item对象,封装数据 item = WangyixinwenItem() item["title"] = title yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={"item":item}) def parse_detail(self,response): item = response.meta[‘item‘] content = response.xpath(‘//*[@id="endText"]/p/text()‘).extract() #将内容转换为字符串对象 content = "".join(content) item["content"] = content #提交数据 yield item def close(self,spider): # 爬虫结束,浏览器也关闭 print("爬虫结束!!!") self.bro.quit()
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class WangyixinwenItem(scrapy.Item): # define the fields for your item here like: title = scrapy.Field() content = scrapy.Field()
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals from scrapy.http import HtmlResponse import random from time import sleep class WangyixinwenDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] # 代理池 PROXY_http = [ ‘153.180.102.104:80‘, ‘195.208.131.189:56055‘, ] PROXY_https = [ ‘120.83.49.90:9000‘, ‘95.189.112.214:35508‘, ] def process_request(self, request, spider): """ 在这里可以设置UA,如果要设置IP就需要等自己IP出现问题再前去process_exception中设置代理IP 但最后一定要返回request对象 :param request: :param spider: :return: """ # ua = random.choice(self.user_agent_list)#随机选择一个元素(反爬策略(即不同浏览器和机型))出来 # request.headers["User-Agent"] = ua #return request#将修正的request对象重新进行发送 return None def process_response(self, request, response, spider): #刚才响应体缺失,因此从这里我们应该重新返回新的响应体 #这里要用到爬虫程序中的urls,判断url是否在里面,在urls里面的就会出现响应缺失, # 、因此需要返回新的响应体 if request.url in spider.urls: #响应缺失是因为是动态加载数据,因此我们配合selenium使用 #在这里实例化selenium的话会被实例化多次,然而selenium只需要实例化一次, #这个时候我们可以将selenium放在实例化一次的爬虫程序开始的时候,实例化完成引入 sleep(2) bro = spider.bro.get(url=request.url)#浏览器中发送请求 sleep(1) spider.bro.execute_script("window.scrollTo(0,document.body.scrollHeight)") sleep(1.5) spider.bro.execute_script("window.scrollTo(0,document.body.scrollHeight)") sleep(0.7) spider.bro.execute_script("window.scrollTo(0,document.body.scrollHeight)") sleep(1) spider.bro.execute_script("window.scrollTo(0,document.body.scrollHeight)") #发送到请求我们需要获取浏览器当前页面的源码数据 获取数据之前需要翻滚页面 page_text = spider.bro.page_source #改动返回响应对象 scrapy提供了一个库url=spider.bro.current_url, body=page_text, encoding=‘utf-8‘, request=request new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request) return new_response else: return response #提交完新的响应体之后,去设置将下载中间件打开 def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain #如果抛出异常,那么就换代理IP # 代理ip的设定 # if request.url.split(‘:‘)[0] == ‘http‘: # request.meta[‘proxy‘] = random.choice(self.PROXY_http) # else: # request.meta[‘proxy‘] = random.choice(self.PROXY_https) # # 将修正后的请求对象重新进行请求发送 # return request pass
# -*- coding: utf-8 -*- # Scrapy settings for Wangyixinwen project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = ‘Wangyixinwen‘ SPIDER_MODULES = [‘Wangyixinwen.spiders‘] NEWSPIDER_MODULE = ‘Wangyixinwen.spiders‘ # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36‘ # Obey robots.txt rules ROBOTSTXT_OBEY = False LOG_LEVEL = "ERROR"#只打印错误日志 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, # ‘Accept-Language‘: ‘en‘, #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # ‘Wangyixinwen.middlewares.WangyixinwenSpiderMiddleware‘: 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { ‘Wangyixinwen.middlewares.WangyixinwenDownloaderMiddleware‘: 543, } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # ‘scrapy.extensions.telnet.TelnetConsole‘: None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { ‘Wangyixinwen.pipelines.WangyixinwenPipeline‘: 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ‘httpcache‘ #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘
管道储存文件没有设置,根据具体需求设置不同储存方式。
增加并发:
默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
降低日志级别:
在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’
禁止cookie:
如果不是真的需要cookie,则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
禁止重试:
对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
减少下载超时:
如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s
11-scrapy(递归解析,post请求,日志等级,请求传参)
标签:入队 lin hide amp content enc lap get请求 size
原文地址:https://www.cnblogs.com/lishuntao/p/11629712.html