标签:
之前就听说过利用网络爬虫来获取页面,感觉还挺有意思的,要是能进行一下偏好搜索岂不是可以满足一下窥探欲。
后来从一本书上看到用HttpClient来爬取页面,虽然也有源码,但是也没说用的HttpClient是哪个版本的,而且HttpClient版本不一样,导致后面很多类也不一样。于是下载了最新的HttpCient版本,并且对着tutorial和网上的文档试着写一个简单的获取页面的例子,最终证明是可行的,但是也遇到了不少问题,而且这个例子也十分简单。
import java.io.IOException; import java.net.UnknownHostException; import java.io.InputStream; import java.io.OutputStream; import java.io.FileOutputStream; import org.apache.http.HttpEntity; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.message.AbstractHttpMessage; public class Simplest { private void Get() { CloseableHttpClient httpclient = HttpClients.createDefault(); try { String HostName = "http://www.baidu.com"; HttpGet httpget = new HttpGet(HostName); System.out.println(httpget.getURI()); //HttpGet httpget = new HttpGet("http://www.lietu.com"); CloseableHttpResponse response = httpclient.execute(httpget); System.out.println("Successful!"); System.out.println(response.getProtocolVersion()); //Protocol Version System.out.println(response.getStatusLine().getStatusCode()); //Status Code System.out.println(response.getStatusLine().getReasonPhrase()); System.out.println(response.getStatusLine().toString()); //get entity HttpEntity entity = response.getEntity(); if (entity != null) { InputStream input = entity.getContent(); String filename = HostName.substring(HostName.lastIndexOf(‘/‘)+1); System.out.println("The filename is: " + filename); OutputStream output = new FileOutputStream(filename); int tempByte=-1; while ((tempByte=input.read())>0) { output.write(tempByte); } if (input != null) { input.close(); } if (output != null) { output.close(); } } } catch(UnknownHostException e) { System.out.println("No such a host!"); } catch(IOException e) { e.printStackTrace(); } } public static void main(String[] args) { Simplest a = new Simplest(); a.Get(); System.out.println("This is a test"); } }
代码倒是不长,刚开始搞这个啥也不懂,不过最后还是获取了页面,感觉挺有意思的。
代码编译需要两个jar包,httpclient-4.5.2.jar和httpcore-4.4.4.jar将其和源文件Simplest.java放到同一个目录下。
编译过程为:javac -cp httpcore-4.4.4.jar:httpclient-4.5.2.jar Simplest.java
运行过程为:java -cp .:httpclient-4.5.2.jar:httpcore-4.4.4.jar:Simplest:commons-logging-1.2.jar Simplest
之前也搜了好久在命令行怎么导入jar包之类的,基础不牢就是这样的。
接下来的工作就是从简单到复杂,不断扩充爬虫的能力和功能了,包括页面信息提取等。
标签:
原文地址:http://www.cnblogs.com/tuhooo/p/5435782.html