超简单的java爬虫

时间：2014-07-09 20:49:07 阅读：299 评论：0 收藏：0 [点我收藏+]

最简单的爬虫，不需要设定代理服务器，不需要设定cookie，不需要http连接池，使用httpget方法，只是为了获取html代码...

好吧，满足这个要求的爬虫应该是最基本的爬虫了。当然这也是做复杂的爬虫的基础。

使用的是httpclient4的相关API。不要跟我讲网上好多都是httpclient3的代码该怎么兼容的问题，它们差不太多，但是我们应该选择新的能用的接口！

当然，还是有很多细节可以去关注一下，比如编码问题（我一般都是强制用UTF-8的）

放码过来：

 1 package chris;
 2 
 3 import java.io.ByteArrayOutputStream;
 4 import java.io.IOException;
 5 import java.io.InputStream;
 6 
 7 import org.apache.http.HttpEntity;
 8 import org.apache.http.client.methods.CloseableHttpResponse;
 9 import org.apache.http.client.methods.HttpGet;
10 import org.apache.http.impl.client.CloseableHttpClient;
11 import org.apache.http.impl.client.HttpClients;
12 import org.apache.http.util.EntityUtils;
13 
14 public class Easy {
15     
16     //输入流转为String类型
17     public static String inputStream2String(InputStream is)throws IOException{ 
18         ByteArrayOutputStream baos=new ByteArrayOutputStream(); 
19         int i=-1; 
20         while((i=is.read())!=-1){ 
21             baos.write(i); 
22         } 
23         return baos.toString(); 
24     }
25 
26     //抓取网页的核心函数
27     public static void doGrab() throws Exception {
28         //httpclient可以认为是模拟的浏览器
29         CloseableHttpClient httpclient = HttpClients.createDefault();
30         try {
31             //要访问的目标页面url
32             String targetUrl="http://chriszz.sinaapp.com";
33             //使用get方式请求页面。复杂一点也可以换成post方式的
34             HttpGet httpGet = new HttpGet(targetUrl);
35             CloseableHttpResponse response1 = httpclient.execute(httpGet);
36 
37             try {
38                 String status=response1.getStatusLine().toString();
39                 //通过状态码来判断访问是否正常。200表示抓取成功
40                 if(!status.equals("HTTP/1.1 200 OK")){                    
41                     System.out.println("此页面可以正常获取！");
42                 }else{
43                     response1 = httpclient.execute(httpGet);
44                     System.out.println(status);
45                 }
46                 //System.out.println(response1.getStatusLine());
47                 HttpEntity entity1 = response1.getEntity();
48                 // do something useful with the response body
49                 // and ensure it is fully consumed
50                 InputStream input=entity1.getContent();
51 
52                 String rawHtml=inputStream2String(input);
53                 System.out.println(rawHtml);
54 
55                 //有时候会有中文乱码问题，这取决于你的eclipse java工程设定的编码格式、当前java文件的编码格式，以及抓取的网页的编码格式
56                 //比如，你可以用String的getBytes()转换编码
57                 //String html = new String(rawHtml.getBytes("ISO-8859-1"),"UTF-8");//转换后的结果
58 
59                 EntityUtils.consume(entity1);
60             } finally {
61                 response1.close();//记得要关闭
62             }
63         } finally {
64             httpclient.close();//这个也要关闭哦！
65         }
66     }
67     
68     /*
69      * 最简单的java爬虫--抓取百度首页
70      * memo：
71      * 0.抓取的是百度的首页，对应一个html页面。
72      *         (至于为啥我们访问的是http://www.baidu.com而不是http://www.baidu.com/xxx.html，这个是百度那边设定的，总之我们会访问到那个包含html的页面) 
73      * 1.使用http协议的get方法就可以了(以后复杂了可以用post方法，设定cookie，甚至设定http连接池；或者抓取json格式的数据、抓取图片等，也是类似的)
74      * 2.通过httpclient的相关包（httpclient4版本）编写，需要下载并添加相应的jar包到build path中
75      * 3.代码主要参考了httpclient(http://hc.apache.org/)包里面的tutorial的pdf文件。
76      */
77     public static void main(String[] args) throws Exception{
78         Easy.doGrab();//为了简答这里把doGrab()方法定义为静态方法了所以直接Easy.doGrab()就好了
79     }
80 
81 }

超简单的java爬虫,布布扣,bubuko.com

超简单的java爬虫

标签：style blog http java color 使用

原文地址：http://www.cnblogs.com/zjutzz/p/3830140.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行