码迷,mamicode.com
首页 > 其他好文 > 详细

Nutch的初步搭建(IDEA)

时间:2017-09-15 13:45:41      阅读:192      评论:0      收藏:0      [点我收藏+]

标签:配置   就会   retrieve   direct   gora   core   sea   mit   soft   

1.环境搭建:ant,从http://ant.apache.org/下载apache-ant-1.9.9-bin.zip;解压指定目录,配置环境变量,ANT_HOME : F:\life\rainofsky\apache-ant-1.9.9,path中新增:%ANT_HOME%\bin。

2.下载Nutch代码:http://nutch.apache.org/downloads.html;  

解压完成后,修改ivy/ivy.xml

启用以下两个依赖

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
  
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

3.在Nutch根目录:打开命令窗口:运行:ant eclipse -verbose

技术分享

就会一直在下载jar包,这个时间好长。需要差不多半个小时。个人感觉是jar包路径也需要配置。一直在c:盘下载文件,心疼我的电脑····不过内存多的话没关系了。

技术分享

29分钟也可以接受。

4.idea导入Nutch:

技术分享

5.修改conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <!--瓟虫的名字-->
    <property>
        <name>http.agent.name</name>
        <value>mySplider</value>
    </property>
    <!--瓟虫接受的语言-->
    <property>
        <name>http.accept.language</name>
        <value>ja-jp, en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
        <description>Value of the “Accept-Language” request header field.
            This allows selecting non-English language as default one to retrieve.
            It is a useful setting for search engines build for certain national group.</description>
    </property>
    <!--瓟虫文本的编码-->
    <property>
        <name>parser.character.encoding.default</name>
        <value>utf-8</value>
        <description>The character encoding to fall back to when no other information
            is available</description>
    </property>
    <!--瓟虫插件的目录-->
    <property>
        <name>plugin.folders</name>
        <value>src/plugin</value>
        <description>Directories where nutch plugins are located. Each
            element may be a relative or absolute path. If absolute, it is used
            as is. If relative, it is searched for on the classpath.</description>
    </property>
    <!--瓟虫存储指定用sql-->
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.sql.store.SqlStore</value>
        <description>The Gora DataStore class for storing and retrieving data.
            Currently the following stores are available: ….</description>
    </property>
    <!--生成的批次id-->
    <property>
        <name>generate.batch.id</name>
        <value>*</value>
    </property>
</configuration>

6.配置 conf/gora.properties 

gora.datastore.default=org.apache.gora.sql.store.SqlStore
gora.datastore.autocreateschema=true
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=password

7.创建mysql数据库和表结构

CREATE TABLE webpage (
 
id varchar(256) NOT NULL,
 
headers blob,
 
text longtext DEFAULT NULL,
 
status int(11) DEFAULT NULL,
 
markers blob,
 
parseStatus blob,
 
modifiedTime bigint(20) DEFAULT NULL,
 
prevModifiedTime bigint(20) DEFAULT NULL,
 
score float DEFAULT NULL,
 
typ varchar(32) CHARACTER SET latin1 DEFAULT NULL,
 
batchId varchar(32) CHARACTER SET latin1 DEFAULT NULL,
 
baseUrl varchar(256) DEFAULT NULL,
 
content longblob,
 
title text DEFAULT NULL,
 
reprUrl varchar(256) DEFAULT NULL,
 
fetchInterval int(11) DEFAULT NULL,
 
prevFetchTime bigint(20) DEFAULT NULL,
 
inlinks mediumblob,
 
prevSignature blob,
 
outlinks mediumblob,
 
fetchTime bigint(20) DEFAULT NULL,
 
retriesSinceFetch int(11) DEFAULT NULL,
 
protocolStatus blob,
 
signature blob,
 
metadata blob,
 
PRIMARY KEY (id)
 
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  

在执行这个sql语句报错了:

技术分享

查了好多资料,发现这个版本最多255个字符,所以把256修改成255就好了。

这样环境就配置好了,可以运行了。不过这个需要测试下。后续会更新测试情况。

 

 

Nutch的初步搭建(IDEA)

标签:配置   就会   retrieve   direct   gora   core   sea   mit   soft   

原文地址:http://www.cnblogs.com/xuyd1108/p/7525620.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!