码迷,mamicode.com
首页 > Web开发 > 详细

rfc2616 HTTP Protocl Analysis

时间:2015-12-20 17:32:28      阅读:296      评论:0      收藏:0      [点我收藏+]

标签:

catalog

1. Introduction
2. Protocol Parameters
3. HTTP Message
4. Request
5. Response

 

1. Introduction

The Hypertext Transfer Protocol (HTTP) is an application-level protocol for distributed, collaborative, hypermedia information systems. HTTP has been in use by the World-Wide Web global information initiative since 1990

1. The first version of HTTP, referred to as HTTP/0.9, was a simple protocol for raw data transfer across the Internet. 
2. HTTP/1.0, as defined by RFC 1945 [6], improved the protocol by allowing messages to be in the format of MIME-like messages, containing metainformation about the data transferred and modifiers on the request/response semantics. 
However, HTTP/1.0 does not sufficiently take into consideration
    1) the effects of hierarchical proxies
    2) caching
    3) the need for persistent connections
    4) virtual hosts. 
In addition, the proliferation of incompletely-implemented applications calling themselves "HTTP/1.0" has necessitated a protocol version change in order for two communicating applications to determine each others true capabilities.

3. "HTTP/1.1".
This protocol includes more stringent requirements than HTTP/1.0 in order to ensure reliable implementation of its features.
Practical information systems require more functionality than 
    1) simple retrieval
    2) including search
    3) front-end update
    4) annotation. 
HTTP allows an open-ended set of methods and headers that indicate the purpose of a request. It builds on the discipline of reference provided by the Uniform Resource Identifier (URI), as a location(URL) or name (URN), for indicating the resource to which a method is to be applied. Messages are passed in a format similar to that used by Internet mail as defined by the Multipurpose Internet Mail Extensions (MIME) 
HTTP is also used as a generic protocol for communication between user agents and proxies/gateways to other Internet systems, including those supported by the SMTP, NNTP, FTP, Gopher, and WAIS protocols. In this way, HTTP allows basic hypermedia access to resources available from diverse applications.

0x1: Overall Operation

The HTTP protocol is a request/response protocol.

//Client-Server
1. A client sends a request to the server in the form of 
    1) a request method
    2) URI
    3) protocol version
2. followed by a MIME-like message containing 
    1) request modifiers
    2) client information
    3) and possible body content 
over a connection with a server. 

//Server-Client
The server responds with a status line, including 
    1) the messages protocol version
    2) a success or error code,
    3) followed by a MIME-like message containing 
        3.1) server information
        3.2) entity metainformation, 
        3.3) and possible entity-body content. 

Most HTTP communication is initiated by a user agent and consists of a request to be applied to a resource on some origin server. In the simplest case, this may be accomplished via a single connection between the user agent (UA) and the origin server (O).

Relevant Link:

https://www.ietf.org/rfc/rfc2616.txt

 

2. Protocol Parameters

0x1: HTTP Version

HTTP uses a "<major>.<minor>" numbering scheme to indicate versions of the protocol. The protocol versioning policy is intended to allow the sender to indicate the format of a message and its capacity for understanding further HTTP communication, rather than the features obtained via that communication.  

Proxy and gateway applications need to be careful when forwarding messages in protocol versions different from that of the application.
Since the protocol version indicates the protocol capability of the sender, a proxy/gateway MUST NOT send a message with a version indicator which is greater than its actual version. If a higher version request is received, the proxy/gateway MUST either downgrade the request version, or respond with an error, or switch to tunnel behavior. 

0x2: Uniform Resource Identifiers

URIs have been known by many names

1. WWW addresses
2. Universal Document Identifiers
3. Universal Resource Identifiers
4. Uniform Resource Locators (URL)
5. Names (URN)

As far as HTTP is concerned, Uniform Resource Identifiers are simply formatted strings which identify--via name, location, or any other characteristic--a resource.
可以说,HTTP的URL格式是一种松散格式规约,如何理解URL很大程度上取决于后端的WEB容器的实现逻辑

1. http URL
The "http" scheme is used to locate network resources via the HTTP protocol

http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

2. URI Comparison

When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions:

1. A port that is empty or not given is equivalent to the default port(80) for that URI-reference;
2. Comparisons of host names MUST be case-insensitive;
3. Comparisons of scheme names MUST be case-insensitive;
4. An empty abs_path is equivalent to an abs_path of "/".
5. Characters other than those in the "reserved" and "unsafe" sets are equivalent to their ""%" HEX HEX" encoding. For example, the following three URIs are equivalent:
    1) http://abc.com:80/~smith/home.html
    2) http://ABC.com/%7Esmith/home.html
    3) http://ABC.com:/%7esmith/home.html

0x3: Date/Time Formats

HTTP applications have historically allowed three different formats for the representation of date/time stamps:

1. Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
2. Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
3. Sun Nov  6 08:49:37 1994       ; ANSI Cs asctime() format

0x4: Character Sets

HTTP uses the same definition of the term "character set" as that described for MIME: The term "character set" is used in this document to refer to a method used with one or more tables to convert a sequence of octets into a sequence of characters

Note: This use of the term "character set" is more commonly referred to as a "character encoding." However, since HTTP and MIME share the same registry, it is important that the terminology also be shared.
也就是说,HTTP请求包可以在MIME的框架下,进行任意的"广义编码/格式变换"
//WAF Bypass的的一大原因就在于HTTP要求sender/receiver之间需要理解MIME格式的各种转换编码,,攻击者可以构造出一些经过特殊编码的、且同时能让WEB容器理解的HTTP请求包,而如果WAF无法理解或理解错误,就产生了Bypass

HTTP character sets are identified by case-insensitive tokens. The complete set of tokens is defined by the IANA Character Set registry

Although HTTP allows an arbitrary token to be used as a charset value, any token that has a predefined value within the IANA Character Set registry  MUST represent the character set defined by that registry.
Applications SHOULD limit their use of character sets to those defined by the IANA registry.  Implementors should be aware of IETF character set requirements   

1. Missing Charset

Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient.
Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the

WAF Bypas的另一大原因在于需要对旧的HTTP协议(0.91.0)进行兼容,从而导致攻击者可以构造一些"特殊的编码""HTTP包",WEB容器需要对这些情况进行兼容,而如果WAF无法理解或理解错误,就产生了Bypass

Content-Typ对照表

1 .*( 二进制流,不知道下载文件类型): application/octet-stream
2 .htm: text/html
3 .html: text/html
4 .gif: image/gif
..

Relevant Link:

http://www.iana.org/assignments/character-sets/character-sets.xhtml
http://tool.oschina.net/commons

0x5: Content Codings(client: Accept-Encoding、server: Content-Encoding)

Content coding values indicate an encoding transformation that has been or can be applied to an entity. Content codings are primarily used to allow a document to be compressed or otherwise usefully transformed without losing the identity of its underlying media type and without loss of information.
为了避免在网络传输中丢失数据,sender和receiver之间协定好了一种编码(转换/加密)方式,用于传输Server端返回的数据

1. "gzip" (GNU zip) as described in RFC 1952. This format is a Lempel-Ziv coding (LZ77) with a 32 bit CRC.
2. compress: The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive Lempel-Ziv-Welch coding (LZW).
3. deflate: The "zlib" format defined in RFC 1950 in combination with the "deflate" compression mechanism described in RFC 1951 
4. identity: The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the Accept-Encoding header, and SHOULD NOT be used in the  Content-Encoding header.

0x6: Transfer Codings

0x7: Media Types
HTTP uses Internet Media Types in the Content-Type and Accept header fields in order to provide open and extensible data typing and type negotiation.

media-type     = type "/" subtype *( ";" parameter )
type           = token
subtype        = token

1. Canonicalization and Text Defaults

When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body.
HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP. In addition, if the text is represented in a character set that does not use octets 13 and 10 for CR and LF respectively, as is the case for some multi-byte character sets, HTTP allows the use of whatever octet sequences are defined by that character set to represent the equivalent of CR and LF for line breaks

2. Multipart Types

MIME provides for a number of "multipart" types -- encapsulations of one or more entities within a single message-body. All multipart types share a common syntax, as defined RFC 2046, and MUST include a boundary parameter as part of the media type value.
The message body is itself a protocol element and MUST therefore use only CRLF to represent line breaks between body-parts.
Unlike in RFC 2046, the epilogue of any multipart message MUST be empty; HTTP applications MUST NOT transmit the epilogue (even if the original multipart contains an epilogue). These restrictions exist in order to preserve the self-delimiting nature of a multipart message- body, wherein the "end" of the message-body is indicated by the ending multipart boundary.

WEB容器在处理Multipart/form-data数据的时候,只有通过检测HTTP包中的"multipart boundary结束符"来界定的包结尾

In general, HTTP treats a multipart message-body no differently than any other media type: strictly as payload. The one exception is the "multipart/byteranges" type when it appears in a 206 (Partial Content) response, which will be interpreted by some HTTP caching mechanisms. In all other cases, an HTTP user agent SHOULD follow the same or similar behavior as a MIME user agent would upon receipt of a multipart type.
The MIME header fields within each body-part of a multipart message-body do not have any significance to HTTP beyond that defined by their MIME semantics.
In general, an HTTP user agent SHOULD follow the same or similar behavior as a MIME user agent would upon receipt of a multipart type. If an application receives an unrecognized multipart subtype, the application MUST treat it as being equivalent to "multipart/mixed".

Note: The "multipart/form-data" type has been specifically defined for carrying form data suitable for processing via the POST request method, as described in RFC 1867

0x8: Product Tokens(client: User-Agent、server: Server)

Product tokens are used to allow communicating applications to identify themselves by software name and version.

User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Server: Apache/0.8.4

0x9: Quality Values

0x10: Language Tags(client: Accept-Language、server: Content-Language)

0x11: Entity Tags

entity tags are used for comparing two or more entities from the same requested resource. HTTP/1.1 uses entity tags in the

1. ETag
2. If-Match
3. If-None-Match
4. If-Range header fields.

The definition of how they are used and compared as cache validators is in rfc2616.

0x12: Range Units

HTTP/1.1 allows a client to request that only part (a range of) the response entity be included within the response. HTTP/1.1 uses range units in the Range and Content-Range header fields. An entity can be broken down into subranges according to various structural units.

range-unit       = bytes-unit | other-range-unit
bytes-unit       = "bytes"
other-range-unit = token

The only range unit defined by HTTP/1.1 is "bytes". HTTP/1.1 implementations MAY ignore ranges specified using other units.
Relevant Link:

https://www.ietf.org/rfc/rfc2616.txt

 

3. HTTP Message

0x1: Message Types

HTTP messages consist of requests from client to server and responses from server to client.

HTTP-message   = Request | Response     ; HTTP/1.1 messages

Request and Response messages use the generic message format of RFC 822 for transferring entities (the payload of the message). Both types of message consist of

1. a start-line: Request-Line | Status-Line
2. zero or more header fields (also known as "headers")
3. an empty line (i.e., a line with nothing preceding the CRLF) indicating the end of the header fields
4. possibly a message-body.

In the interest of robustness, servers SHOULD ignore any empty line(s) received where a Request-Line is expected. In other words, if the server is reading the protocol stream at the beginning of a message and receives a CRLF first, it should ignore the CRLF.
WEB Server忽略包头的空行,直到读取到HTTP Request-Line

0x2: Message Headers

HTTP header fields, which include

1. general-header
2. request-header
3. response-header
4. entity-header 

Each header field consists of a name followed by a colon (":") and the field value.

1. Field names are case-insensitive. 
2. The field value MAY be preceded by any amount of LWS, though a single SP is preferred.

Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. Applications ought to follow "common form", where one is known or indicated, when generating HTTP constructs, since there might exist some implementations that fail to accept anything
也正因为HTTP的header(key:value)允许value跨多行,才导致了PHP Multipart/form-data remote dos Vulnerability漏洞
0x3: Message Body

The message-body (if any) of an HTTP message is used to carry the entity-body associated with the request or response. The message-body differs from the entity-body only when a transfer-coding has been applied, as indicated by the Transfer-Encoding header field

message-body = entity-body | <entity-body encoded as per Transfer-Encoding>

Transfer-Encoding MUST be used to indicate any transfer-codings applied by an application to ensure safe and proper transfer of the message. Transfer-Encoding is a property of the message, not of the

0x4: Message Length

The transfer-length of a message is the length of the message-body as it appears in the message;

0x5: General Header Fields

There are a few header fields which have general applicability for both request and response messages, but which do not apply to the entity being transferred.

general-header = Cache-Control           
          | Connection             
          | Date                     
          | Pragma                   
          | Trailer                   
          | Transfer-Encoding         
          | Upgrade                   
          | Via                      
          | Warning    

General-header field names can be extended reliably only in combination with a change in the protocol version. However, new or experimental header fields may be given the semantics of general header fields if all parties in the communication recognize them to be general-header fields. Unrecognized header fields are treated as entity-header fields.  

Relevant Link:

http://www.cnblogs.com/LittleHann/p/5044140.html

 

4. Request

0x1: Request-Line  

The Request-Line begins with a method token, followed by the Request-URI and the protocol version, and ending with CRLF. The elements are separated by SP characters. No CR or LF is allowed except in the final CRLF sequence.   

Request-Line   = Method SP Request-URI SP HTTP-Version CRLF

1. Method

The Method  token indicates the method to be performed on the resource identified by the Request-URI. The method is case-sensitive.  

Method = "OPTIONS"               
      | "GET"                    
      | "HEAD"                   
      | "POST"                   
      | "PUT"                    
      | "DELETE"                  
      | "TRACE"                 
      | "CONNECT"                 
      | extension-method  

WEB容器在处理extension-method这类非标准的method时,往往会提高容错性,到也因此导致Bypass的可能性
某些apache版本在做GET请求的时候,无论method为何值均会取出GET的内容,如果某些WAF在处理数据的时候严格按照GET,POST等方式来获取数据,就会因为apache的宽松的请求方式导致bypass

技术分享

2. Request-URI

The Request-URI is a Uniform Resource Identifier and identifies the resource upon which to apply the request.

Request-URI    = "*" | absoluteURI | abs_path | authority

The four options for Request-URI are dependent on the nature of the request.

1. The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource.
//OPTIONS * HTTP/1.1

2. The absoluteURI form is REQUIRED when the request is being made to a proxy. The proxy is requested to forward the request or service it from a valid cache, and return the response. Note that the proxy MAY forward the request on to another proxy or directly to the server Fielding, et al. 
//GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.1

3. The authority form is only used by the CONNECT method 

4. The most common form of Request-URI is that used to identify a resource on an origin server or gateway. In this case the absolute path of the URI MUST be transmitted as the Request-URI, and the network location of the URI (authority) MUST be transmitted in a Host header field. 
For example, a client wishing to retrieve the resource above directly from the origin server would create a TCP connection to port 80 of the host "www.w3.org" and send the lines:
/*
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.w3.org
*/

If the Request-URI is encoded using the "% HEX HEX" encoding, the origin server MUST decode the Request-URI in order to properly interpret the request. Servers SHOULD respond to invalid Request-URIs with an appropriate status code.

0x2: The Resource Identified by a Request

0x3: Request Header Fields

The request-header fields allow the client to pass additional information about the request, and about the client itself, to the server. These fields act as request modifiers, with semantics equivalent to the parameters on a programming language method invocation.

request-header = Accept                   
          | Accept-Charset           
          | Accept-Encoding          
          | Accept-Language         
          | Authorization            
          | Expect                    
          | From                     
          | Host                     
          | If-Match                  
          | If-Modified-Since         
          | If-None-Match             
          | If-Range                  
          | If-Unmodified-Since      
          | Max-Forwards              
          | Proxy-Authorization      
          | Range                     
          | Referer                 
          | TE                       
          | User-Agent      

Relevant Link:

https://www.ietf.org/rfc/rfc2616.txt

 

5. Response

After receiving and interpreting a request message, a server responds with an HTTP response message. 

Response      = Status-Line               
           *(( general-header         
        | response-header       
        | entity-header ) CRLF)   
           CRLF
           [ message-body ]   

0x1: Status-Line  

The first line of a Response message is the Status-Line, consisting of the protocol version followed by a numeric status code and its associated textual phrase, with each element separated by SP characters. No CR or LF is allowed except in the final CRLF sequence.  

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

1. Status Code and Reason Phrase  

The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request. The Reason-Phrase is intended to give a short textual description of the Status-Code. The Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase.

The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. There are 5 values for the first digit:

- 1xx: Informational - Request received, continuing process
- 2xx: Success - The action was successfully received, understood, and accepted
- 3xx: Redirection - Further action must be taken in order to complete the request
- 4xx: Client Error - The request contains bad syntax or cannot be fulfilled
- 5xx: Server Error - The server failed to fulfill an apparently valid request

The individual values of the numeric status codes defined for HTTP/1.1, and an example set of corresponding Reason-Phrase‘s, are presented below. The reason phrases listed here are only recommendations -- they MAY be replaced by local equivalents without affecting the protocol.

Status-Code    =
            "100"  : Continue
          | "101"  : Switching Protocols
          | "200"  : OK
          | "201"  : Created
          | "202"  : Accepted
          | "203"  : Non-Authoritative Information
          | "204"  : No Content
          | "205"  : Reset Content
          | "206"  : Partial Content
          | "300"  : Multiple Choices
          | "301"  : Moved Permanently
          | "302"  : Found
          | "303"  : See Other
          | "304"  : Not Modified
          | "305"  : Use Proxy
          | "307"  : Temporary Redirect
          | "400"  : Bad Request
          | "401"  : Unauthorized
          | "402"  : Payment Required
          | "403"  : Forbidden
          | "404"  : Not Found
          | "405"  : Method Not Allowed
          | "406"  : Not Acceptable
      | "407"  : Proxy Authentication Required
          | "408"  : Request Time-out
          | "409"  : Conflict
          | "410"  : Gone
          | "411"  : Length Required
          | "412"  : Precondition Failed
          | "413"  : Request Entity Too Large
          | "414"  : Request-URI Too Large
          | "415"  : Unsupported Media Type
          | "416"  : Requested range not satisfiable
          | "417"  : Expectation Failed
          | "500"  : Internal Server Error
          | "501"  : Not Implemented
          | "502"  : Bad Gateway
          | "503"  : Service Unavailable
          | "504"  : Gateway Time-out
          | "505"  : HTTP Version not supported
          | extension-code

HTTP status codes are extensible. HTTP applications are not required to understand the meaning of all registered status codes, though such understanding is obviously desirable. However, applications MUST understand the class of any status code, as indicated by the first digit, and treat any unrecognized response as being equivalent to the x00 status code of that class, with the exception that an unrecognized response MUST NOT be cached.

0x2: Response Header Fields

The response-header fields allow the server to pass additional information about the response which cannot be placed in the Status-Line. These header fields give information about the server and about further access to the resource identified by the Request-URI.

response-header = Accept-Ranges           
           | Age                     
           | ETag                   
           | Location                
           | Proxy-Authenticate      
           | Retry-After             
           | Server                   
           | Vary                     
           | WWW-Authenticate 

Response-header field names can be extended reliably only in combination with a change in the protocol version. However, new or experimental header fields MAY be given the semantics of response- header fields if all parties in the communication recognize them to be response-header fields. Unrecognized header fields are treated as entity-header fields.

Relevant Link:

https://www.ietf.org/rfc/rfc2616.txt

undone

Copyright (c) 2015 LittleHann All rights reserved

rfc2616 HTTP Protocl Analysis

标签:

原文地址:http://www.cnblogs.com/LittleHann/p/5057295.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!