码迷,mamicode.com
首页 > 其他好文 > 详细

LinkExtractor

时间:2018-06-21 22:33:28      阅读:185      评论:0      收藏:0      [点我收藏+]

标签:127.0.0.1   mem   engine   port   https   win   ted   default   length   

wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

 

scrapy shell发送请求
scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"
wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

响应文件

response.body
response.text
response.url
>>> response.url
https://item.btime.com/m_9b62d3a9239a9473c\

导入LinkExtractor,匹配整个html文档中的链接

from scrapy.linkextractors import LinkExtractor


>>> from scrapy.linkextractors import LinkExtractor
>>> response.xpath(//div[@class="xx_neirong"]/h1/text()).extract()[0] 

北京社保开户流程是怎么个流程

 



demo
 1 wljdeMacBook-Pro:Desktop wlj$ scrapy shell "http://hr.tencent.com/position.php?"
 2 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
 3 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (default, Apr 25 2018, 14:23:58) - [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
 4 2018-06-21 21:12:40 [scrapy.crawler] INFO: Overridden settings: {DUPEFILTER_CLASS: scrapy.dupefilters.BaseDupeFilter, LOGSTATS_INTERVAL: 0}
 5 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled extensions:
 6 [scrapy.extensions.corestats.CoreStats,
 7  scrapy.extensions.telnet.TelnetConsole,
 8  scrapy.extensions.memusage.MemoryUsage]
 9 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
10 [scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware,
11  scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware,
12  scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware,
13  scrapy.downloadermiddlewares.useragent.UserAgentMiddleware,
14  scrapy.downloadermiddlewares.retry.RetryMiddleware,
15  scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware,
16  scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware,
17  scrapy.downloadermiddlewares.redirect.RedirectMiddleware,
18  scrapy.downloadermiddlewares.cookies.CookiesMiddleware,
19  scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware,
20  scrapy.downloadermiddlewares.stats.DownloaderStats]
21 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled spider middlewares:
22 [scrapy.spidermiddlewares.httperror.HttpErrorMiddleware,
23  scrapy.spidermiddlewares.offsite.OffsiteMiddleware,
24  scrapy.spidermiddlewares.referer.RefererMiddleware,
25  scrapy.spidermiddlewares.urllength.UrlLengthMiddleware,
26  scrapy.spidermiddlewares.depth.DepthMiddleware]
27 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled item pipelines:
28 []
29 2018-06-21 21:12:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
30 2018-06-21 21:12:40 [scrapy.core.engine] INFO: Spider opened
31 2018-06-21 21:12:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://hr.tencent.com/position.php> from <GET http://hr.tencent.com/position.php>
32 2018-06-21 21:12:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hr.tencent.com/position.php> (referer: None)
33 [s] Available Scrapy objects:
34 [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
35 [s]   crawler    <scrapy.crawler.Crawler object at 0x107617c18>
36 [s]   item       {}
37 [s]   request    <GET http://hr.tencent.com/position.php>
38 [s]   response   <200 https://hr.tencent.com/position.php>
39 [s]   settings   <scrapy.settings.Settings object at 0x10840e748>
40 [s]   spider     <DefaultSpider default at 0x1086c6ba8>
41 [s] Useful shortcuts:
42 [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
43 [s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
44 [s]   shelp()           Shell help (print this help)
45 [s]   view(response)    View response in a browser
46 >>> response.url
47 https://hr.tencent.com/position.php
48 >>> from scrapy.linkextractors import LinkExtractor
49 >>> link_list=LinkExtractor(allow=("start=\d+"))
50 >>> link_list.extract_links(response)
51 [Link(url=https://hr.tencent.com/position.php?&start=10#a, text=2, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=20#a, text=3, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=30#a, text=4, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=40#a, text=5, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=50#a, text=6, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=60#a, text=7, fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=70#a, text=..., fragment=‘‘, nofollow=False), Link(url=https://hr.tencent.com/position.php?&start=3800#a, text=381, fragment=‘‘, nofollow=False)]
52 >>> 

 



 

LinkExtractor

标签:127.0.0.1   mem   engine   port   https   win   ted   default   length   

原文地址:https://www.cnblogs.com/wanglinjie/p/9211013.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!