标签:
上一篇文章介绍了nutch的安装
该文会简单的抓取网站 http://www.6vhao.com
1,打开目录nutch-2.3/runtime/local
2,mkdir urls
nano urls/url:添加链接
3,在local目录下使用命令
./bin/nutch 会出现所有可以使用的命令
inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB index run the plugin-based indexer on parsed batches elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port webapp run a local Nutch web application junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME
3,我们首先使用./bin/crawl 命令一站式抓取网页
4,爬取完成后进入hbase目录下
./bin/hbase shell 进入hbase shell,使用list可以看到当前表:data_webpage,nutch为其添加了后缀
5,hbase shell 中scan ‘data_webpage‘查看其内容,copy下样例数据
tv.66ys.www:http/zy/ column=f:ts, timestamp=1446050113914, value=\x00\x00\x01P\xAFM\xA9s tv.66ys.www:http/zy/ column=il:http://www.66ys.tv/, timestamp=1446050113914, value=\xE7\xBB\xBC\xE8\x89\xBA tv.66ys.www:http/zy/ column=mk:dist, timestamp=1446050113914, value=2 tv.66ys.www:http/zy/ column=mtdt:_csh_, timestamp=1446050113914, value=\x00\x00\x00\x00 tv.66ys.www:http/zy/ column=s:s, timestamp=1446050113914, value=\x00\x00\x00\x00
更多内容下次再讲吧~~~~~~~~~~~~~~~
标签:
原文地址:http://my.oschina.net/u/2494265/blog/523358