码迷,mamicode.com
首页 > Web开发 > 详细

使用jsoup的爬虫例子

时间:2016-05-13 00:53:30      阅读:224      评论:0      收藏:0      [点我收藏+]

标签:

上网了解到网络爬虫,感觉挺使用的,自己写了一个例子。

jsoup的jar包下载地址:http://jsoup.org/download

使用jsoup-1.8.3.jar,这是一个解析html源码的工具,能快速找到某一个节点,非常方便

代码功能是:

2016年msi中rng和skt最后一场比赛刚打完,网页中没有比赛视频的链接,查看网页源码分析出url地址的规律,就尝试找一下。

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;


import org.apache.commons.lang3.StringUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class LolGameTest {


public static void main(String[] args) {
String encoding = "utf-8";
try {
HttpPost post = new HttpPost();
CloseableHttpClient httpclient = HttpClients.custom().build();
for (int i = 6105436; i < 6106000; i ++){
String url = "http://wangyou.pcgames.com.cn/610/" + i + ".html";
// url = "http://wangyou.pcgames.com.cn/610/6102693.html";
post.setURI(new URI(url));
CloseableHttpResponse resp  = httpclient.execute(post);
if (resp != null){
HttpEntity entity = resp.getEntity();
String str = EntityUtils.toString(entity, encoding);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO Auto-generated catch block
// e.printStackTrace();
//System.out.println("exceptions url:" + url);
continue;
}
if (doc == null) continue;
Element newsElement = doc.getElementById("topnews");
Elements eles = newsElement.getElementsByAttributeValue("class", "article");
for (Element ele : eles) {
Elements h1Eles = ele.getElementsByTag("h1");
if (h1Eles != null && h1Eles.size() > 0){
String text = h1Eles.get(0).text();
if (StringUtils.isNotEmpty(text) && text.contains("RNG") && text.contains("SKT")){
/**
* console result:
* H1 text content: 2016MSI小组赛第5轮比赛视频SKT vs RNG
* Url: http://wangyou.pcgames.com.cn/610/6105595.html
*/
System.out.println("H1 text content: " + text);
System.out.println("Url: " + url);
return;

}
}
}
}
}
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}


jsoup的select(queryString)很好用,和jquery的选择器很类似。

例子:

Element taskListElement = doc.select("[class=\"h_hotTaskBox mb10 clearfix\"]").first();


我们学编程一定要活用到日常的需求中,只有这样才会有走得更远的坚定信心。

使用jsoup的爬虫例子

标签:

原文地址:http://blog.csdn.net/nyhyn/article/details/51347781

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!