标签:
CrawTaskBuilder是GuozhongCrawler中 CrawTask的建造者。为CrawlTask爬虫任务的创建增加了不少的便捷性。
public CrawTaskBuilder useThread(int threadNum)
threadNum
- public CrawTaskBuilder usePipeline(java.lang.Class<? extends Pipeline> pipelineCls)
pipelineCls
- 持久化处理类public CrawTaskBuilder usePageRetryCount(int retryCount)
retryCount
- public CrawTaskBuilder usePageEncoding(PageRequest.PageEncoding defaultEncoding)
public CrawTaskBuilder injectStartUrl(java.lang.String url, java.lang.Class<? extends PageProcessor> processorCls, java.util.Map<java.lang.String,java.lang.Object> contextAttribute, PageRequest.PageEncoding pageEncoding)
url
- contextAttribute
- PageEncoding
- public CrawTaskBuilder injectStartUrl(java.lang.String url, java.lang.Class<? extends PageProcessor> processorCls, java.util.Map<java.lang.String,java.lang.Object> contextAttribute)
url
- contextAttribute
- public CrawTaskBuilder injectStartUrl(java.lang.String url, java.lang.Class<? extends PageProcessor> processorCls)
url
- public CrawTaskBuilder useDynamicEntrance(java.lang.Class<? extends DynamicEntrance> dynamicEntranceCls)
dynamicEntranceCls
- DynamicEntrance的继承实现类public CrawTaskBuilder useQueuePriorityRequest()
public CrawTaskBuilder useQueueDelayedPriorityRequest(int delayInMilliseconds)
delayInMilliseconds
- 每次取Request距离上次时间延迟delayInMilliseconds毫秒public CrawTaskBuilder useTaskLifeListener(TaskLifeListener listener)
listener
- public CrawTaskBuilder useCookie(java.util.Set<Cookie> cookies)
listener
- public void addChromeDriverLifeListener(ChromeDriverLifeListener chromeDriverLifeListener)
listener
- public void addWebDriverLifeListener(WebDriverLifeListener webDriverLifeListener)
listener
- public void addHttpClientLifeListener(HttpClientLifeListener httpClientLifeListener)
listener
- public CrawTaskBuilder useProxyIpPool(java.lang.Class<? extends ProxyIpPool> proxyIpPoolCls, int initSize, long pastTime, int max_use_count)
proxyIpPoolCls
- initSize
- 每次代理IP缓冲池IP不足时加载IP的个数,推荐使用公式initSize=thread*5pastTime
- 每个IP自身的过期时间,当代理IP过期时间到的时候会被清除。这个值根据代理IP的质量决定max_use_count
- 每个代理IP最多使用的次数。推荐使用公式max_use_count=(目标网站连续请求才被封的次数)减去 2到3public CrawTaskBuilder useProxyIpPoolInstance(ProxyIpPool proxyIpPool)
proxyIpPool
- java.lang.SecurityException
java.lang.NoSuchMethodException
public final CrawTaskBuilder useTimer(int hour, long period, int endHour)
hour
- 从几点开始启动,如果当前时间小于改时间则等待到改时间启动period
- 每次抓取时间间隔 单位毫秒endHour
- 到几点结束public CrawTaskBuilder useDownloadFileThread(int thread)
thread
- public CrawTaskBuilder useDownloadFileDelayTime(int millisecond)
millisecond
- public CrawlTask build()
GuozhongCrawler系列教程 (2) CrawTaskBuilder详解
标签:
原文地址:http://blog.csdn.net/u012572945/article/details/46415703