标签:nutch
参考文档
http://nlp.solutions.asia/?p=362
http://blog.csdn.net/fby98710/article/details/10367175
http://blog.csdn.net/itufo/article/details/21519593
需要jdk1.7的环境
1. MySql数据库配置
l my.ini配置
分别在[client]、[mysql]下添加“default-character-set=utf8”;
在[mysqld]下添加:character-set-server=utf8
l 权限授予
mysql –u root –p xxxx
GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "xxxx";
l 创建数据库与表手动创建数据库nutch:CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;
和数据表webpage【如果不想用默认的库名和表名也可在nutch安装后的相关配置文件中进行修改,见后续说明】,其中webpage的表结构如下:
CREATE TABLE `webpage` (
`id` varchar(255) NOT NULL, //如果填767会出现
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
注:表中的字段根据nutch的conf文件“gora-sql-mapping”进行设置。同时也可通过自动方式生成数据库和表:配置好“gora-sql-mapping”、“gora.properties”及其它文件后,首次通过运行”bin/nutch injecturls”即可自动生成数据库和表,不过或许在自动生成的时候你会遇到问题,不过没有关系,通过及时查看hadoop.log文件你便会发现很多问题与MySQL支持的数据类型、数据长度有关,只需要根据日志提示做修改、调试(可借助navicat工具像SQL Server方便操作数据库),然后再重复自动生成过程,直到成功为止。
2. Nutch的安装与配置
1)获取nutch 2.2.x:从官网http://www.apache.org/dyn/closer.cgi/nutch/下载,然后解压至本地安装目录,如本地根目录为${NUTCH_HOME};
2)配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:
l uncomment以下行:
<dependency org=”mysql” name=”mysql-connector-java”rev=”5.1.18″ conf=”*->default”/>
l 修改以下行:
从默认的<dependencyorg="org.apache.gora" name="gora-core" rev="0.3"conf="*->default"/>,改成<dependencyorg="org.apache.gora" name="gora-core"rev="0.2.1"conf="*->default"/>
l uncomment以下行:
<dependency org="org.apache.gora"name="gora-sql" rev="0.1.1-incubating"conf="*->default" />
3)数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
###############################
#Default MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)
Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing
the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>
Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.
<property>
<name>http.agent.name</name>
<value>YourNutchSpider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>Thecharacter encoding to fall back to when no other information
isavailable</description>
</property>
l 特别添加以下内容
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
如果不添加此项内容,则通过”bin/nutch crawl urls –threadsn –depths n”爬取网页时,在日志中会看到以下错误:
java.lang.NullPointerException
atorg.apache.avro.util.Utf8.<init>(Utf8.java:37)
并且“nutch-site”文件需要保存为utf-8格式,否则在执行nutch命令时会出现以下错误。
Exception in thread “main”java.lang.RuntimeException:com.sun.org.apache.xerces.internal.impl.io.malformedByteSequenceException: 1字节的UTF-8序列的字节 1 无效。
6) 编译nutch 2.2
在保证已安装ant的情况下(没有安装的可在网上baidu下ant的安装方法),回到nutch根目录,使用ant编译 ant build。这个过程可能耗时几个小时。如果都按上述配置一步步做了,则编译过程将顺利完成。至此,Nutch 2.2的安装也已完成,接下来就可以根据需要配置网页抓取信息,进行网页抓取了。cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo ‘http://nutch.apache.org/‘ > urls/seed.txt
Start crawling (you will want to create your own script later but manually just to see what is happening type the following into the command line)
bin/nutch inject urls
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
Repeat the last four commands (generate, fetch, parse and updatedb) again.
For the generate command, topN is the max number of links you want to actually parse each time. The first time there is only one URL (the one we injected from seed.txt) but after that there are many more. Note, however, Nutch keeps track of all links it encounters in the webpage table. It just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN.
Check your crawl results by looking at the webpage table in the nutch database.
mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;
// 只能用mysql-connector-java-5.1.18.jar,用其他jar文件会报错。
// 需要加载gora-sql-0.2.1.jar
http://search.maven.org/remotecontent?filepath=org/apache/gora/gora-sql/
// gora-sql-0.1.1-incubating.jar
标签:nutch
原文地址:http://itsart.blog.51cto.com/1005243/1559008