码迷,mamicode.com
首页 > 其他好文 > 详细

Nutch2.2.1配置文件nutch-site.xml

时间:2014-08-18 18:45:02      阅读:299      评论:0      收藏:0      [点我收藏+]

标签:des   style   http   color   os   io   strong   文件   


在nutch2.2.1中,有两份配置文件:nutch-default.xml与nutch-site.xml。

其中前者是nutch自带的默认属性,一般情况下不要修改。

如果需要修改默认属性,可以在nutch-site.xml中增加一个同名的属性,并修改其值。nutch-site.xml中的属性值会覆盖nutch-default.xml中的值。


1、db.ignore.external.links

若为true,则只抓取本域名内的网页,忽略外部链接。

可以在 regex-urlfilter.txt中增加过滤器达到同样效果,但如果过滤器过多,如几千个,则会大大影响nutch的性能。

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

2、fetcher.parse

能否在抓取的同时进行解释:可以,但不 建议这样做。

<property>
  <name>fetcher.parse</name>
  <value>false</value>
  <description>If true, fetcher will parse content. NOTE: previous releases would
  default to true. Since 2.0 this is set to false as a safer default.</description>
</property>

官方解释

N.B. In a parsing fetcher, outlinks are processed in the reduce phase (at least when outlinks are followed). If a fetcher‘s reducer stalls you may run out of memory or disk space, usually after a very long reduce job. Behaviour typical to this is usually observed in this situation.

In summary, if it is possible, users are advised not to use a parsing fetcher as it is heavy on IO and often leads to the above outcome.


Nutch2.2.1配置文件nutch-site.xml,布布扣,bubuko.com

Nutch2.2.1配置文件nutch-site.xml

标签:des   style   http   color   os   io   strong   文件   

原文地址:http://blog.csdn.net/jediael_lu/article/details/38662441

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!