码迷,mamicode.com
首页 > 其他好文 > 详细

Nuch分析一

时间:2014-07-08 18:20:36      阅读:268      评论:0      收藏:0      [点我收藏+]

标签:nutch   hadoop   


1、构建Nutch

tar -zxvf apache-nutch-2.2.1-src.tar.gz 

cd apache-nutch-2.2.1

ant runtime


2、    ant构建之后,生成runtime文件夹,该文件夹下面有deploy和local文件夹,分别代表了nutch的两种运行方式:

Deploy:的数据必须运行在Hadoop的HDFS中

local:是运行在本地目录中。

(1)二者的目录结构如下:

[jediael@jediael44 runtime]$ ls deploy/ local/

deploy/:

apache-nutch-2.2.1.job  bin 

local/:

bin  conf  lib logs  plugins  test

在deploy中,文件被打包成一个Job,作为Hadoop的一个Job来运行。

 (2)二者目录下均有一个bin的目录,其内包含相同的crawl与nutch两个执行文件。

我们查看nutch文件的最后几行

if $local; then

 # fix for the external Xerceslib issue with SAXParserFactory

 NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"

 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"

else

 # check that hadoop can befound on the path

 if [ $(which hadoop | wc -l )-eq 0 ]; then

    echo "Can‘t findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."

    exit -1;

 fi

 # distributed mode

 EXEC_CALL="hadoop jar$NUTCH_JOB"

fi

 # run it

exec $EXEC_CALL $CLASS "$@"

即默认情况下为 EXEC_CALL="hadoop jar$NUTCH_JOB",若为Local,则 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH",若未local,且hadoop不存在,则报错。

(3)根据参数确定类文件

if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawler
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.InjectorJob
elif [ "$COMMAND" = "hostinject" ] ; then
CLASS=org.apache.nutch.host.HostInjectorJob
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.GeneratorJob
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.FetcherJob

elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParserJob
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.DbUpdaterJob
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbUpdateJob
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.WebTableReader
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbReader
elif [ "$COMMAND" = "elasticindex" ] ; then
CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "parsechecker" ] ; then
  CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository

如,对于nutch fetch命令,对应的类文件应该是:org.apache.nutch.fetcher.FetcherJob

[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java 

可以查看类文件。此方法可以查看一切的shell对应的源文件。


Nuch分析一,布布扣,bubuko.com

Nuch分析一

标签:nutch   hadoop   

原文地址:http://blog.csdn.net/jediael_lu/article/details/37356035

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!