邮箱格式的正则表达式与RFC 5322 Internet Message Format

时间：2015-07-28 16:12:30 阅读：226 评论：0 收藏：0 [点我收藏+]

标签：

百度下邮箱格式的正则表示，能够搜索到各式各样，五花八门的表示。如果没有仔细甄别，错误使用其中的一些代码，则很可能造成在遇到一些特殊的邮箱格式时无法识别。这里就分析下邮件相关的RFC标准，可详见RFC 5322, Internet Message Format或[2-RFC5322], 但在此之前需要先学习下[1-RFC5234]中关于ABNF的核心规则。

[1-RFC5234] 中Appendix B. Core ABNF of ABNF

B.1. Core Rules

         ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z
         BIT            =  "0" / "1"
         CHAR           =  %x01-7F         ; any 7-bit US-ASCII character, excluding NUL
         CR             =  %x0D            ; carriage return
         CRLF           =  CR LF           ; Internet standard newline
         CTL            =  %x00-1F / %x7F  ; controls
         DIGIT          =  %x30-39         ; 0-9
         DQUOTE         =  %x22            ; " (Double Quote)
         HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
         HTAB           =  %x09            ; horizontal tab
         LF             =  %x0A            ; linefeed
         LWSP           =  *(WSP / CRLF WSP)
                                ; Use of this linear-white-space rule
                                ;  permits lines containing only white
                                ;  space that are no longer legal in
                                ;  mail headers and have caused
                                ;  interoperability problems in other
                                ;  contexts.
                                ; Do not use when defining mail
                                ;  headers and use with caution in
                                ;  other contexts.
         OCTET          =  %x00-FF       ; 8 bits of data
         SP             =  %x20          ; space 
         VCHAR          =  %x21-7E       ; visible (printing) characters
         WSP            =  SP / HTAB     ; white space

以下内容见[2-RFC5322], 章节编号与原文保持一致。

2 Message词法分析（Lexical Analysis of Messages）

取值范围在1-127的字符被称为US-ASCII字符.

Messages可分为许多行字符，一行是用两个相连的字符carriage-return（CR: ASCII码值13）与line-feed（LF: ASCII码值10）分隔，通常写作CRLF.

Message是由header字段, 可选的body字段组成。

header section是由一些特定语法的许多行组成， header section后面是一个空行，接下来是body字段。

本规范中使用header field描述单个字段，用header section描述所有header字段。

每一行字符数务必（MUST）不能超过998个，不包括CRLF以外通常不应当超过78个字符。

998加上CRLF达到1000个字符，有许多接收方实现限定一行不能超过1000个字符。

78加上CRLF达到80个字符，有许多显示界面一行超过80个字符后被截断或显示到下一行。

Header Fields先是一个field name, 接下来是一个冒号, 接下来是一个field body, 后面是CLRF.

field name必须是可打印US-ASCII字符组成，即值为[33,126]之间除冒号以外的字符。

field body必须是可打印US-ASCII字符, 空格(SP:ASCII码值32), horizontal tab(HTAB:

ASCII码值9)字符组成。SP与HTAB组合即为WSP(white space characters).

field body务必不能（MUST NOT）包含CRLF, 除了用于"folding"与"unfolding"以外。

一些field body被称为"unstructured", 表示无需进一步处理的单行字符。

一些field body被称为"structured", 由一些特殊token组成, 这些token后面可以有comments与空格字符.

每个header field通常是由一行字符，由field name, 冒号和field body组成。

考虑到需要处理每行有998/78字符, header field中的field body部分可用多行来表示，这被称为"folding".

例如下面header field:

Subject: This is a test

可以被表示为

Subject: This

is a test

从folded多行变为单行的过程被称为"unfolding", 即去掉CRLF并立即跟一个WSP.

每个header field应当用其unfolded格式来做进一步的语法与此法分析。

unfolded header field没有长度限制。

3. 语法（Syntax）

3.2 词法(Lexical)tokens

3.2.1. Quoted characters

quoted-pair     =   ("\" (VCHAR / WSP)) / obs-qp

3.2.2. Folding White Space and Comments

header字段bodies中, 许多elements之间可以存在空白字符(White space characters), 这里的字符包含用于folding的空格;

圆括号（parentheses）中的字符串被视为comments。

下面定义了folding white space (FWS)和comment结构。

圆括号（parentheses）中的字符串只要不是在引号串内就被视为comments, Comments可以嵌套(nest).

有一些地方可随意插入comments与FWS, 为此新定义一个"CFWS" token.

但是，在一个folded header字段的任意一行，不能都是由WSP字符组成。

   FWS             =   ([*WSP CRLF] 1*WSP) /  obs-FWS
                                          ; Folding white space

   ctext           =   %d33-39 /          ; 可打印US-ASCII字符,
                       %d42-91 /          ; 取值在[33-126]范围内但不包含
                       %d93-126 /         ; 40="(", 41=")", or 92="\"
                       obs-ctext
					   
                       %d33-39:
                       33: "!"
                       34: """					   
                       35: "#"        
                       36: "$" 
                       37: "%"        
                       38: "&" 
                       39: "‘"
					   
                       42 "*"
                       43 "+"
                       44 "、"
                       45 "-"
                       46 "."
                       47 "/"
                       48-57: "0"-"9"
                       58: ":"
                       59: ";"
                       60: "<"
                       61: "="
                       62: ">"
                       63: "?"
                       64: "@"
                       65-90: "A-Z"
                       91: "["
                       93: "]"
                       94: "^"
                       95: "-"
                       96: "‘"
                       97-122: "a-z"
                       123: "{"
                       124: "|"
                       125: "}"
                       126: "~"

   ccontent        =   ctext / quoted-pair / comment
   comment         =   "(" *([FWS] ccontent) [FWS] ")"
   CFWS            =   (1*([FWS] comment) [FWS]) / FWS

3.2.3. Atom

在structured header field bodies中有一些productions是由一些基本字符组成的简单串，

这些productions被称为atoms.

但有一些productions允许句点字符(".", ASCII值为46), 因此引入"dot-atom" token来表示这种情况。

   atext           =   ALPHA / DIGIT /    ; 不包含specials的可打印US-ASCII字符
                       "!" / "#" /        ; 用于atoms.
                       "$" / "%" /          
                       "&" / "‘" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

   atom            =   [CFWS] 1*atext [CFWS]

   dot-atom-text   =   1*atext *("." 1*atext)

   dot-atom        =   [CFWS] dot-atom-text [CFWS]
   
   specials        =   "(" / ")" /        ; atext中未出现的特殊字符
                       "<" / ">" /        
                       "[" / "]" /
                       ":" / ";" /
                       "@" / "\" /
                       "," / "." /
                       DQUOTE

3.2.4. Quoted Strings

用双引号(DQUOTE, ASCII值34)包含的串.

   qtext           =   %d33 /             ; 不包含双引号"""与反斜杠"\"的可打印字符。
                       %d35-91 /          
                       %d93-126 /         
                       obs-qtext

   qcontent        =   qtext / quoted-pair

   quoted-string   =   [CFWS]
                       DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                       [CFWS]

用双引号包含的串被称为一个unit, 即双引号串在语法上与atom一致。

因为双引号串可包含FWS, 即允许folding.

而且注意双引号串中可以有双引号串, 引起可以有双引号字符与反斜杠(backslash)字符。

从语法上讲, 双引号串中的"\"与FWS/CFWS中的CRLF都是不可见的，因此它们不是双引号串的一部分。

3.2.5. 其他Tokens

定义三个token: word与phrase用于atoms和/或双引号串的组合, unstructured 用于

unstructured header字段, 以及structured header字段中的一些地方.

   word            =   atom / quoted-string
   phrase          =   1*word / obs-phrase
   unstructured    =   (*([FWS] VCHAR) *WSP) / obs-unstruct

3.3. Date与Time规范

   date-time       =   [ day-of-week "," ] date time [CFWS]

   day-of-week     =   ([FWS] day-name) / obs-day-of-week

   day-name        =   "Mon" / "Tue" / "Wed" / "Thu" / "Fri" / "Sat" / "Sun"

   date            =   day month year                 ; 应当表示本地时间

   day             =   ([FWS] 1*2DIGIT FWS) / obs-day ; 一个月的第几天

   month           =   "Jan" / "Feb" / "Mar" / "Apr" /
                       "May" / "Jun" / "Jul" / "Aug" /
                       "Sep" / "Oct" / "Nov" / "Dec"

   year            =   (FWS 4*DIGIT FWS) / obs-year  ; 四位数字

   time            =   time-of-day zone

   time-of-day     =   hour ":" minute [ ":" second ] ; 应当表示本地时间, 一天中的时:分[:秒], 
                                                      ; 范围00:00:00 - 23:59:60

   hour            =   2DIGIT / obs-hour

   minute          =   2DIGIT / obs-minute

   second          =   2DIGIT / obs-second

   zone            =   (FWS ( "+" / "-" ) 4DIGIT) / obs-zone 
                       ; date与time-of-day偏离UTC或GMT的偏差                       ; +表示ahead of（即east of）UTC, -表示behind(即west of) UTC
                       ; 前两个数字表示hours偏差, 后两个数字表示minutes偏差
                       ; +hhmm表示 +(hh * 60 + mm) 分钟, -hhmm表示 -(hh * 60 + mm) 分钟
                       ; "+0000"用于表示UTC的时区
                       ; "-0000"用于表示本地时区生成的时间, date-time不包含本地时区的信息

3.4. Address规范

Addresses表示messages的接收与发送方.

一个address可以是单个邮箱，也可以是一组邮箱。

   address         =   mailbox / group

   mailbox         =   name-addr / addr-spec

   name-addr       =   [display-name] angle-addr
   
   angle-addr      =   [CFWS] "<" addr-spec ">" [CFWS] /
                       obs-angle-addr

   group           =   display-name ":" [group-list] ";" [CFWS]

   display-name    =   phrase

   mailbox-list    =   (mailbox *("," mailbox)) / obs-mbox-list

   address-list    =   (address *("," address)) / obs-addr-list

   group-list      =   mailbox-list / CFWS / obs-group-list

一个mailbox通常由两部分组成：(1) 一个可选的display-name, (2) 用<>封装的addr-spec地址。

mailbox的一岁简化形式是只有addr-spec地址，没有接收方名称, 也没有<>。

3.4.1. Addr-Spec规范

   addr-spec       =   local-part "@" domain

   local-part      =   dot-atom / quoted-string / obs-local-part

   domain          =   dot-atom / domain-literal / obs-domain

   domain-literal  =   [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]

   dtext           =   %d33-90 /          ; 可打印US-ASCII字符, 
                       %d94-126 /         ; 不包含 "[", "]", or "\"
                       obs-dtext

3.5. Overall Message Syntax

message由header fields, 接下来是一个可选的message body.

message中的一行最大998个字符，推荐最大为78个字符, 这里都不包含CRLF.

在message body中, 虽然在text rule中列出的所有字符都可以用, 但不鼓励使用US-ASCII控制字符(值1到8, 11, 12, 14-31),

因为无法保证接收方如何来显示它们。

   message         =   (fields / obs-fields)
                       [CRLF body]

   body            =   (*(*998text CRLF) *998text) / obs-body

   text            =   %d1-9 /            ; Characters excluding CR
                       %d11 /             ;  and LF
                       %d12 /
                       %d14-127

其他略。。。

邮箱地址格式主要参见上面的3.4.1. Addr-Spec规范, 建议参照这里的规范来编写正则表达式，如果自己没有能力编写正则表达式则建议直接采用PHP、JAVA、C#、C++、C等语言中现成的库来判断。

参考资料：

[1-RFC5234] RFC 5234, Augmented BNF for Syntax Specifications: ABNF, Standards Track, January 2008, http://www.rfc-editor.org/rfc/rfc5234.txt

[2-RFC5322] RFC 5322, Internet Message Format, Standards Track, October 2008, http://www.rfc-editor.org/rfc/rfc5322.txt

邮箱格式的正则表达式与RFC 5322 Internet Message Format

标签：

原文地址：http://my.oschina.net/1pei/blog/484675

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行