标签:shel html loader 配置 cat parse pid www splash
docker pull scrapinghub/splash
docker run -p 8050:8050 scrapinghub/splash
在settings.py中增加:
SPLASH_URL = 'http://Splash的部署地址:8050/'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
用SplashRequest
代替scrapy.Request
:
yield SplashRequest(url, self.parse_result, # 第一个参数是要请求的url,第二个参数是回调函数
args={ # 常用于传递Lua脚本
"lua_source": """
splash:set_user_agent("...")
assert(splash:go(args.url))
assert(splash:wait(args.time)) --注意脚本中可以用args.*来接收外界参数
return {html = splash:html()} --返回html
"""
“time”: time # lua脚本参数从这里传入
},
endpoint='run', # 默认为render.html,常用的是’execute’ 和 ‘run’(一般用run),用来执行脚本(run和execute的差别在于:run的脚本只需要包含main函数里的内容就可以了,就像上面的示例代码一样)
)
完整Spider示例:
import scrapy
from scrapy_splash import SplashRequest
class ExampleSpider(scrapy.Spider):
name = 'connect_splash'
def start_requests(self):
url = 'http://www.baidu.com'
script = """
assert(splash:go(args.url))
assert(splash:wait(args.wait))
return {html = splash:html()}
"""
yield SplashRequest(url, self.parse, endpoint='run', args={'lua_source':script,'wait':3})
def parse(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
pass
标签:shel html loader 配置 cat parse pid www splash
原文地址:https://www.cnblogs.com/lokvahkoor/p/10800932.html