基于lucene的案例开发：纵横小说阅读页采集

时间：2015-04-08 10:56:51 阅读：157 评论：0 收藏：0 [点我收藏+]

转载请注明出处：http://blog.csdn.net/xiaojimanman/article/details/44937073

http://www.llwjy.com/blogdetail/29bd8de30e8d17871c707b76ec3212b0.html

个人博客站已经上线了，网址 www.llwjy.com ~欢迎各位吐槽~
-------------------------------------------------------------------------------------------------

在之前的三篇博客中，我们已经介绍了关于纵横小说的更新列表页、简介页、章节列表页的相关信息采集，今天这篇博客就重点介绍一下最重要的阅读页的信息采集。本文还是以一个简单的URL为例，网址如下：http://book.zongheng.com/chapter/362857/6001264.html 。

页面分析

上述url网址下的下面样式如下：

技术分享

阅读页和章节列表页一样，都无法通过简单的鼠标右键-->查看网页源代码这个操作，所以还是通过F12-->NetWork-->Ctrl+F5这个操作找到页面的源代码，结果截图如下：

技术分享

对页面源代码做简单的查找，即可找到标题、字数和章节内容这些属性值所在的位置分别是 47行、141行和145行（页面不同，可能所在的行数也略微有点差别，具体的行数请个人根据实际情况来确定）。

对于这三部分的正则，因为和之前的大同小异，使用的方法之前也已经介绍了，所以这里就只给出最终的结果：

\\章节内容正则
private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>";
\\标题正则
private static final String TITLE = "chapterName=\"(.*?)\"";
\\字数正则
private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>";

运行结果

技术分享

看到运行结果的截图，你也许会发现一个问题，就是章节内容中含有一些html标签，这里是因为我们的案例最终的展示是网页展示，所以这里就偷个懒，如果需要去掉这些标签的，可以直接通过String的repalceAll方法对其替换。

源代码

查看最新源代码请访问：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ReadPage.html

 /**  
 *@Description:   阅读页
 */ 
package com.lulei.crawl.novel.zongheng;  

import java.io.IOException;
import java.util.HashMap;

import com.lulei.crawl.CrawlBase;
import com.lulei.util.DoRegex;
import com.lulei.util.ParseUtil;
  
public class ReadPage extends CrawlBase {
	private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>";
	private static final String TITLE = "chapterName=\"(.*?)\"";
	private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>";
	private String pageUrl;
	private static HashMap<String, String> params;
	/**
	 * 添加相关头信息，对请求进行伪装
	 */
	static {
		params = new HashMap<String, String>();
		params.put("Referer", "http://book.zongheng.com");
		params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36");
	}
	
	public ReadPage(String url) throws IOException {
		readPageByGet(url, "utf-8", params);
		this.pageUrl = url;
	}
	
	/**
	 * @return
	 * @Author:lulei  
	 * @Description: 章节标题
	 */
	private String getTitle() {
		return DoRegex.getFirstString(getPageSourceCode(), TITLE, 1);
	}
	
	/**
	 * @return
	 * @Author:lulei  
	 * @Description: 字数
	 */
	private int getWordCount() {
		String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1);
		return ParseUtil.parseStringToInt(wordCount, 0);
	}
	
	/**
	 * @return
	 * @Author:lulei  
	 * @Description: 正文
	 */
	private String getContent() {
		return DoRegex.getFirstString(getPageSourceCode(), CONTENT, 1);
	}

	public static void main(String[] args) throws IOException {
		// TODO Auto-generated method stub  
		ReadPage readPage = new ReadPage("http://book.zongheng.com/chapter/362857/6001264.html");
		System.out.println(readPage.pageUrl);
		System.out.println(readPage.getTitle());
		System.out.println(readPage.getWordCount());
		System.out.println(readPage.getContent());
	}

}

----------------------------------------------------------------------------------------------------
ps:最近发现其他网站可能会对博客转载，上面并没有源链接，如想查看更多关于基于lucene的案例开发请点击这里。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html

基于lucene的案例开发：纵横小说阅读页采集

标签：lucene java 纵横小说网络爬虫爬虫

原文地址：http://blog.csdn.net/xiaojimanman/article/details/44937073

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行