本教程是Nutch官方教程的翻译,采用逐段翻译的方法,并加上自己的解释。
本文由精简导航提供。
本文原版发布在CSDN博客和精简导航,并且文章在持续修改和更新。其他网站出现皆为转载,转载的文章不一定完整。请浏览原网页。
本教程虽然是Nutch 1.x的教程,但是官网上Nutch2.x的教程只是告诉我们怎么去配置一些新特性。Nutch2.x的基础教程,仍在在本教程中。
Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.
Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from here.
简介
Apache Nutch是一个开源的JAVA网络爬虫。Nutch会帮我们自动管理超链接信息,大大减少了维护的时间,比如检测损坏的链接、对已访问的页面做副本,提交给搜索引擎。
Solr是一个开源的全文本搜索框架。我们可以通过Solr来搜索Nutch爬取的网页。庆幸的是,集成Nutch和Solr是非常简单的。
Apache Nutch支持Solr的out-the-box,大大简化了Nutch和Solr的集成。现在的版本移除了老版本中,利用tomcat和lucene进行索引的模块。
非官方注释:
1.Nutch是一个网络爬虫,在搜索引擎中负责爬取网页,同时自动维护网页的URL信息,例如相同网页去重、网页定时更新、网页重定向。
2.现在版本的Nutch本身并不具有搜索功能,但是可以自动向搜索服务器提交爬取的网页。搜索服务器,例如Solr,是另一个开源项目,需要自己下载。
3.可以通过Nutch自带的命令,来控制Nutch是否将网页提交给索引服务器。
4.Nutch虽然是优秀的分布式爬虫框架,但是它的所有设计,都是为了搜索引擎服务的。在hadoop上用map-reduce框架开发,并不是很适合做数据抽取的业务。如果你的业务是做数据抽取(精抽取),而不是搜索引擎。不一定要选用Nutch。
Unix environment, or Windows-Cygwin environment
Java Runtime/Development Environment (1.5+): http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html
(Source build only) Apache Ant: http://ant.apache.org/
运行环境需求:
Unix(linux),或者装有Cygwin的Windows
JDK1.5及以上
Apache Ant
非官方注释:
1.强烈建议在Linux/unix上,进行Nutch的开发。如果没有Linux,建议在windows上装linux虚拟机。
2.Apache Ant非常必要。Nutch的整个编译过程是通过一个叫build.xml的配置文件来控制的。这个配置文件要有Ant才可以运行。Nutch官方源码没有提供Eclipse的配置文件,所以Eclipse不能直接编译Nutch。虽然可以利用Apache Ant将官方源码,转换成Eclipse工程,但是这样并不是很好。
3.要阅读下面的教程,一定要先安装Linux(或unix、cygwin)、JDK和apache ant,否则下面的步骤将无法进行。虽然安装这些东西可能需要花费数小时的时间,但是是必须的。
安装Nutch
Download a binary package (apache-nutch-1.X-bin.zip) fromhere.
Unzip your binary Nutch package. There should be a folder apache-nutch-1.X.
cd apache-nutch-1.X/
From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory (apache-nutch-1.X/).
方式一:从二进制发布包安装Nutch
1.下载Nutch1.x的二进制包。
2.解压下载的包。里面应该有个文件夹apache-nutch-1.x。
3.用命令行进入apache-nutch-1.x文件夹。
为了简化描述,本文后面用${NUTCH_RUNTIME_HOME}来表示这里说的apache-nutch-1.x文件夹。
Advanced users may also use the source distribution:
Download a source package (apache-nutch-1.X-src.zip)
cd apache-nutch-1.X/
Run ant in this folder (cf. RunNutchInEclipse)
Now there is a directory runtime/local which contains a ready to use Nutch installation.
When the source distribution is used ${NUTCH_RUNTIME_HOME} refers toapache-nutch-1.X/runtime/local/. Note that
config files should be modified in apache-nutch-1.X/runtime/local/conf/
ant clean will remove this directory (keep copies of modified config files)
方式二:从源码安装Nutch
进阶用户可以使用源码来编译Nutch
1.下载Nutch1.x的源码包。
2.解压下载的包。
3.用命令行进入apache-nutch-1.x文件夹。
4.用命令行在apache-nutch-1.x文件夹,执行命令:ant
5.执行命令后,会在apache-nutch-1.x文件夹中,出现一个文件夹runtime,在runtime/local中是编译成功的Nutch。
1.如果要修改Nutch的配置文件,请在apache-nutch-1.x/runtime/local/conf/文件夹中修改
2.如果执行ant clean命令,会清除apache-nutch-1.x/runtime/local/conf/文件夹。所以在执行ant clean之前,请备份文件夹中Nutch的配置文件。
run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:
Usage: nutch COMMAND where command is one of: crawl one-step crawler for intranets (DEPRECATED) readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment‘s pages ...
Some troubleshooting tips:
chmod +x bin/nutch
Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Usage: nutch COMMAND where command is one of: crawl one-step crawler for intranets (DEPRECATED) readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment‘s pages ...如果出现"Permission denied"错误,运行下面命令:
chmod +x bin/nutch
如果出现JAVA_HOME not set,设置JAVA_HOME环境变量.在Mac电脑, 执行下面的命令,或者将下面这行加入到~/.bashrc文件:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
在Debian或者Ubuntu, 执行下面的命令,或者将下面这行加入到~/.bashrc文件:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
Nutch requires two configuration changes before a website can be crawled:
在爬取之前,必须修改2个配置。
1.配置爬虫属性。必须要为爬虫起一个名字。
2.为爬虫提供一个URL种子列表。
Default crawl properties can be viewed and edited within conf/nutch-default.xml - where most of these can be used without modification
The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwriteconf/nutch-default.xml. The only required modification for this file is to override thevalue field of the http.agent.name
i.e. Add your agent name in the value field of thehttp.agent.name property inconf/nutch-site.xml, for example:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
1.默认的爬虫属性存放在conf/nutch-default.xml文件中。这些属性大多数都不需要修改。
2.conf/nutch-site.xml是存放你的个性化配置的地方。conf/nutch-site.xml中出现的属性,会覆盖conf/nutch-default.xml中出现的属性。只有一个属性是必须要修改的:http.agent.name
例如,在conf/nutch-site.xml中添加http.agent.name:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download
mkdir -p urls
cd urls
touch seed.txt to create a text fileseed.txt underurls/ with the following content (one URL per line for each site you want Nutch to crawl).
http://nutch.apache.org/
Edit the file conf/regex-urlfilter.txt and replace
# accept anything else +.
with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to thenutch.apache.org domain, the line should read:
+^http://([a-z0-9]*\.)*nutch.apache.org/
This will include any URL in the domain nutch.apache.org.
NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well.
2.conf/regex-urlfilter.txt中,利用正则表达式,来约束爬取的范围。
mkdir -p urls
cd urls
touch seed.txt
修改seed.txt,将seed.txt的内容改为:
http://nutch.apache.org/
编辑 conf/regex-urlfilter.txt 替换
# accept anything else +.为一个和你要爬取的域名匹配的正则。比如你想要限制爬虫只爬取 nutch.apache.org 域名下的东西, 就替换为:
+^http://([a-z0-9]*\.)*nutch.apache.org/
The crawl command is deprecated. Please see section 3.5 on how to use the crawl script that is intended to replace the crawl command.
Now we are ready to initiate a crawl, use the following parameters:
-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
crawl/crawldb crawl/linkdb crawl/segments
NOTE: If you have a Solr core already set up and wish to index to it, you are required to add the-solr <solrUrl> parameter to yourcrawl command e.g.
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
If not then please skip to here for how to set up your Solr instance and index your crawl data.
Typically one starts testing one‘s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources."crawl"命令已经废弃了,请阅读后面的章节,查看如何使用crawl脚本,来取代crawl命令。
我们用下面的参数,来配置运行crawl命令
-dir dir 存放爬取信息的文件夹
-threads threads 线程数
-depth depth 爬取深度
-topN N 每层爬取的最大页数
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
crawl/crawldb crawl/linkdb crawl/segments
注意: 如果你已经架设好了Solr服务器,想用Solr对Nutch的爬取结果进行索引,你可以在crawl命令后添加-solr <solrUrl>参数
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
如果你没有Solr服务器,可以点击 here来学习架设Solr服务器,索引你爬取的数据。
通常在测试的时候,将-depth和-topN设置的很小。一旦测试结果比较满意,将-depth设置在10左右。
-topN的设定,可能是万级别的,根据你爬取的网站。
Nutch教程中文翻译1(官方教程,中英对照)——Nutch的编译、安装和简单运行
原文地址:http://blog.csdn.net/ajaxhu/article/details/41645647