标签:roc 一个 获取 list parse ddl handle headers items
前言:本文为记录工程实现过程,会引用其他文章,如果又不清晰的地方可以查看原文章。本文主旨在于记录,所以部分作者了解的部分可能不会介绍而直接操作,如果有疑问请留言或者直接使用搜索引擎。
引用:
一、安装scrapy
管理员模式打开power shell,输入
pip install scrapy
ps:此步之前,需要先行安装pip,具体请自行搜索。
二、到某路径下建立scrapy工程
scrapy startproject boss
三、打开工程目录
cd boss
四、建立爬虫
scrapy genspider bosszhipin www.zhipin.com
五、将爬虫工程导入pycharm,修改setting.py
将 ROBOTSTXT_OBEY = True
改为 ROBOTSTXT_OBEY = False
六、编写bosszhipin.py和run.py
# -*- coding: utf-8 -*- import scrapy class BosszhipinSpider(scrapy.Spider): name = ‘bosszhipin‘ allowed_domains = [‘www.zhipin.com‘] start_urls = [‘https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1‘] def parse(self, response): print(response.text)
run.py放在项目根目录
from scrapy.cmdline import execute
execute([‘scrapy‘,‘crawl‘,‘bosszhipin‘])
运行出现错误
2018-11-04 13:03:36 [scrapy.core.engine] INFO: Spider opened 2018-11-04 13:03:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-11-04 13:03:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-11-04 13:03:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1> (referer: None) 2018-11-04 13:03:37 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1>: HTTP status code is not handled or not allowed 2018-11-04 13:03:37 [scrapy.core.engine] INFO: Closing spider (finished)
链接被关闭,应该是被反爬了,修改中间件来修改headers
middlewares.py 中加入
class UserAgentMiddleware(object): def __init__(self, user_agent_list): self.user_agent = user_agent_list @classmethod def from_crawler(cls, crawler, *args, **kwargs): # 获取配置文件中的 MY_USER_AGENT 字段 middleware = cls(crawler.settings.get(‘MY_USER_AGENT‘)) return middleware def process_request(self, request, spider): # 随机选择一个 user-agent request.headers[‘user-agent‘] = random.choice(self.user_agent)
在setting中启用中间件和MY_USER_AGENT的值
USER_AGENT = ‘boss (+http://www.yourdomain.com)‘ ... DOWNLOADER_MIDDLEWARES = { ‘boss.middlewares.BossDownloaderMiddleware‘: 543, }
(以上代码默认有实现,只是被注释了,建议先激活试试能不能用,不能用再找解决方法)
再次运行run.py,可以获取页面html信息。
第一阶段全部代码,后期准备加上MongoDB,因为看不出来爬文本直接输出有什么卵用。。。
# -*- coding: utf-8 -*- import scrapy class BosszhipinSpider(scrapy.Spider): name = ‘bosszhipin‘ allowed_domains = [‘www.zhipin.com‘] start_urls = [‘https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1‘] def parse(self, response): # print(response.text) job_node_table = response.xpath("//*[@id=\"main\"]/div/div[2]/ul") job_node_list = job_node_table.xpath("./li") for job_node in job_node_list: enterprise_node = job_node.xpath("./div/div[2]/div/h3/a") salary_node = job_node.xpath("./div/div[1]/h3/a/span") requirement_node = job_node.xpath("./div/div[1]/p") time_node = job_node.xpath("./div/div[3]/p") enterprise = enterprise_node.xpath(‘string(.)‘) salary = salary_node.xpath(‘string(.)‘) requirement = requirement_node.xpath(‘string(.)‘) time = time_node.xpath(‘string(.)‘) print("企业", enterprise.extract_first().strip()) print("薪资", salary.extract_first().strip()) print("要求", requirement.extract_first().strip()) print("更新", time.extract_first().strip()) print()
标签:roc 一个 获取 list parse ddl handle headers items
原文地址:https://www.cnblogs.com/huzhongyu/p/9903622.html