本文为HttpClient 4.3.6附带的Tutorial的部分翻译,仅到达需要的抓取网页页面大小的目的,以及二级、三级页面大小
The Hyper-Text Transfer Protocol (HTTP) is perhaps the most significant protocol used on the Internet
today. Web services, network-enabled appliances and the growth of network computing continue to
expand the role of the HTTP protocol beyond user-driven web browsers, while increasing the number
of applications that require HTTP support.
Although the java.net package provides basic functionality for accessing resources via HTTP, it doesn‘t
provide the full flexibility or functionality needed by many applications. HttpClient seeks to fill this
void by providing an efficient, up-to-date, and feature-rich package implementing the client side of
the most recent HTTP standards and recommendations.
Designed for extension while providing robust support for the base HTTP protocol, HttpClient may
be of interest to anyone building HTTP-aware client applications such as web browsers, web service
clients, or systems that leverage or extend the HTTP protocol for distributed communication.
HttpClient is NOT a browser. It is a client side HTTP transport library. HttpClient‘s purpose is
to transmit and receive HTTP messages. HttpClient will not attempt to process content, execute
javascript embedded in HTML pages, try to guess content type, if not explicitly set, or reformat
request / redirect location URIs, or other functionality unrelated to the HTTP transport.
HttpClient的核心功能是执行HTTP方法。执行HTTP方法包含了一个或多个HTTP request / HTTP response 交互。这些交互常常都在HttpClient内部被完成了。用户需要提供需要执行的request object,HttpClient就会根据request去请求目标服务器,并且返回响应的response object;如果未成功,则返回一个异常。
The most essential function of HttpClient is to execute HTTP methods. Execution of an HTTP method
involves one or several HTTP request / HTTP response exchanges, usually handled internally by
HttpClient. The user is expected to provide a request object to execute and HttpClient is expected to
transmit the request to the target server return a corresponding response object, or throw an exception
if execution was unsuccessful.
通常情况下,HttpClient API的入口将会是HttpClient定义的如上约定的接口。
Quite naturally, the main entry point of the HttpClient API is the HttpClient interface that defines the
contract described above.
Here is an example of request execution process in its simplest form:
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try{ <...> }finally{ response.close(); }
All HTTP requests have a request line consisting a method name, a request URI and an HTTP protocol
HttpClient 支持HTTP/1.1中规定的所有请求类型:GET,HEAD,POST,PUT,DELETE,TRACE和OPTIONS。每一个请求类型都有一个单独的类对应:HttpGet,HttpHead,HttpPost,HttpPut,HttpDelete,HttpTrace和HttpOpions。
HttpClient supports out of the box all HTTP methods defined in the HTTP/1.1 specification: GET,
HEAD, POST, PUT, DELETE, TRACE and OPTIONS. There is a specific class for each method type.: HttpGet,
HttpHead, HttpPost, HttpPut, HttpDelete, HttpTrace, and HttpOptions.
请求的URI是一个 Uniform Resource Identifier,明确了一个和请求对应的资源。HTTP请求的URIs中包含了协议调度,主机名,端口,资源路径,optional query和optional fragment。
The Request-URI is a Uniform Resource Identifier that identifies the resource upon which to apply
the request. HTTP request URIs consist of a protocol scheme, host name, optional port, resource path,
optional query, and optional fragment.
HttpGet httpget = new HttpGet( "http://www.google.com/search?hl=en&q=httpclient&btnG=Google+Search&aq=f&oq=");
URI uri = new URIBuilder() .setScheme("http") .setHost("www.google.com") .setPath("/search") .setParameter("q", "httpclient") .setParameter("btnG", "Google Search") .setParameter("aq", "f") .setParameter("oq", "") .build(); HttpGet httpget = new HttpGet(uri);
stdout >
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); System.out.println(response.getProtocolVersion()); System.out.println(response.getStatusLine().getStatusCode()); System.out.println(response.getStatusLine().getReasonPhrase()); System.out.println(response.getStatusLine().toString());
stdout >
HTTP/1.1 200 OK
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); Header h1 = response.getFirstHeader("Set-Cookie"); System.out.println(h1); Header h2 = response.getLastHeader("Set-Cookie"); System.out.println(h2); Header[] hs = response.getHeaders("Set-Cookie"); System.out.println(hs.length);
stdout >
Set-Cookie: c1=a; path=/; domain=localhost
Set-Cookie: c2=b; path="/", c3=c; domain="localhost"
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); HeaderIterator it = response.headerIterator("Set-Cookie"); while (it.hasNext()) { System.out.println(it.next()); }
stdout >
Set-Cookie: c1=a; path=/; domain=localhost
Set-Cookie: c2=b; path="/", c3=c; domain="localhost"
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); HeaderElementIterator it = new BasicHeaderElementIterator( response.headerIterator("Set-Cookie")); while (it.hasNext()) { HeaderElement elem = it.nextElement(); System.out.println(elem.getName() + " = " + elem.getValue()); NameValuePair[] params = elem.getParameters(); for (int i = 0; i < params.length; i++) { System.out.println(" " + params[i]); } }
stdout >
c1 = a path=/ domain=localhost c2 = b path=/ c3 = c domain=localhost
HTTP entity
HTTP消息根据Request或者Response的不同携带不同的内容实体。实体不是必须的。当实体定义为request时,Request请求会使用实体。HTTP特别定义了两种定义为request方法的实体:POST和PUT。Response通常被要求包含一个内容实体。在这里定义了几种异常,如:responses to HEAD method, 204 No Content, 304 Not Modified, 205 Reset Content responses.
HTTPClient 根据实体内容的来源将实体分为三种:
这种分类对于连接管理来说是非常重要的当内容从一个HTTP response取出。对于一个被应用创建并只使用HttpClient发送的request实体来说,streamed和self-contained的不同是挺重要的。在这种情况下,通常考虑将不重复使用的实体作为streamed,可重复的作为self-contained。
Repeatable entities
Using HTTP entities
当用户获取了一个incoming实体后,可以使用方法HttpEntity#getContentType() and HttpEntity#getContentLength() 获取一些常用的metadata如Content-Type和Content-Length头(如果存在)。因为Content-Type头中包含了字符编码和内容类别,HttpEntity#getContentEncoding()方法通常被用来读取这些信息。如果头不可读,则长度返回-1,Content-Type返回NULL。如果头可读,则头文件的对象被返回。
当创建一个outgoing实体,这些meta data需要在创建时提供。
StringEntity myEntity = new StringEntity("important message", ContentType.create("text/plain", "UTF-8")); System.out.println(myEntity.getContentType()); System.out.println(myEntity.getContentLength()); System.out.println(EntityUtils.toString(myEntity)); System.out.println(EntityUtils.toByteArray(myEntity).length);
stdout >
Content-Type: text/plain; charset=utf-8
important message
Ensuring release of low level resources
为了保证合适的释放资源,要求使用者需要关闭连接实体的Content stream以及response本身。
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { InputStream instream = entity.getContent(); try { // do something useful } finally { instream.close(); } } } finally { response.close(); }
关闭content stream和关闭response的不同点在于,前者会尝试保持连接,后者会立刻关闭并断开连接。
然而有这么一种情况,当一个实体的一小部分response内容需要被取回,重复读取剩余部分和连接重复使用,会造成消耗过高,这种情况下可以通过关闭response终止content stream。
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { InputStream instream = entity.getContent(); int byteOne = instream.read(); int byteTwo = instream.read(); // Do not need the rest } } finally { response.close(); }
Consuming entity content
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { long len = entity.getContentLength(); if (len != -1 && len < 2048) { System.out.println(EntityUtils.toString(entity)); } else { // Stream content out } } } finally { response.close(); }
CloseableHttpResponse response = <...> HttpEntity entity = response.getEntity(); if (entity != null) { entity = new BufferedHttpEntity(entity); }
Producing entity content
HttpClient提供几个类能够高效的通过流获得HTTP连接中的内容。这些类的实例可以将实体的内容包含入outgoingHTTP request如POST和PUT。HttpClient提供了几种常见的数据容器,如String, byte array, input stream, and file: StringEntity, ByteArrayEntity, InputStreamEntity, and FileEntity.
File file = new File("somefile.txt"); FileEntity entity = new FileEntity(file, ContentType.create("text/plain", "UTF-8")); HttpPost httppost = new HttpPost("http://localhost/action.do"); httppost.setEntity(entity);
Response handlers