标签:
上网了解到网络爬虫,感觉挺使用的,自己写了一个例子。
jsoup的jar包下载地址:http://jsoup.org/download
使用jsoup-1.8.3.jar,这是一个解析html源码的工具,能快速找到某一个节点,非常方便
代码功能是:
2016年msi中rng和skt最后一场比赛刚打完,网页中没有比赛视频的链接,查看网页源码分析出url地址的规律,就尝试找一下。
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import org.apache.commons.lang3.StringUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class LolGameTest {
public static void main(String[] args) {
String encoding = "utf-8";
try {
HttpPost post = new HttpPost();
CloseableHttpClient httpclient = HttpClients.custom().build();
for (int i = 6105436; i < 6106000; i ++){
String url = "http://wangyou.pcgames.com.cn/610/" + i + ".html";
// url = "http://wangyou.pcgames.com.cn/610/6102693.html";
post.setURI(new URI(url));
CloseableHttpResponse resp = httpclient.execute(post);
if (resp != null){
HttpEntity entity = resp.getEntity();
String str = EntityUtils.toString(entity, encoding);
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (Exception e) {
// TODO Auto-generated catch block
// e.printStackTrace();
//System.out.println("exceptions url:" + url);
continue;
}
if (doc == null) continue;
Element newsElement = doc.getElementById("topnews");
Elements eles = newsElement.getElementsByAttributeValue("class", "article");
for (Element ele : eles) {
Elements h1Eles = ele.getElementsByTag("h1");
if (h1Eles != null && h1Eles.size() > 0){
String text = h1Eles.get(0).text();
if (StringUtils.isNotEmpty(text) && text.contains("RNG") && text.contains("SKT")){
/**
* console result:
* H1 text content: 2016MSI小组赛第5轮比赛视频SKT vs RNG
* Url: http://wangyou.pcgames.com.cn/610/6105595.html
*/
System.out.println("H1 text content: " + text);
System.out.println("Url: " + url);
return;
}
}
}
}
}
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClientProtocolException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
jsoup的select(queryString)很好用,和jquery的选择器很类似。
例子:
Element taskListElement = doc.select("[class=\"h_hotTaskBox mb10 clearfix\"]").first();
我们学编程一定要活用到日常的需求中,只有这样才会有走得更远的坚定信心。
标签:
原文地址:http://blog.csdn.net/nyhyn/article/details/51347781