标签:
本文简单介绍了各种常用的字符编码的特点,并介绍了在python2.x中如何与编码问题作战 :) 请注意本文关于Python的内容仅适用2.x,3.x中str和unicode有翻天覆地的变化,请查阅其他相关文档。 尊重作者的劳动,转载请注明作者及原文地址。
ASCII(American Standard Code for Information Interchange),是一种单字节的编码。计算机世界里一开始只有英文,而单字节可以表示256个不同的字符,可以表示所有的英文字符和许多的控制符号。不过ASCII只用到了其中的一半(\x80以下),这也是MBCS得以实现的基础。
然而计算机世界里很快就有了其他语言,单字节的ASCII已无法满足需求。后来每个语言就制定了一套自己的编码,由于单字节能表示的字符太少,而且同时也需要与ASCII编码保持兼容,所以这些编码纷纷使用了多字节来表示字符,如GBxxx、BIGxxx等等,他们的规则是,如果第一个字节是\x80以下,则仍然表示ASCII字符;而如果是\x80以上,则跟下一个字节一起(共两个字节)表示一个字符,然后跳过下一个字节,继续往下判断。
这里,IBM发明了一个叫Code Page的概念,将这些编码都收入囊中并分配页码,GBK是第936页,也就是CP936。所以,也可以使用CP936表示GBK。
MBCS(Multi-Byte Character Set)是这些编码的统称。目前为止大家都是用了双字节,所以有时候也叫做DBCS(Double-Byte Character Set)。必须明确的是,MBCS并不是某一种特定的编码,Windows里根据你设定的区域不同,MBCS指代不同的编码,而Linux里无法使用MBCS作为编码。在Windows中你看不到MBCS这几个字符,因为微软为了更加洋气,使用了ANSI来吓唬人,记事本的另存为对话框里编码ANSI就是MBCS。同时,在简体中文Windows默认的区域设定里,指代GBK。
后来,有人开始觉得太多编码导致世界变得过于复杂了,让人脑袋疼,于是大家坐在一起拍脑袋想出来一个方法:所有语言的字符都用同一种字符集来表示,这就是Unicode。
最初的Unicode标准UCS-2使用两个字节表示一个字符,所以你常常可以听到Unicode使用两个字节表示一个字符的说法。但过了不久有人觉得256*256太少了,还是不够用,于是出现了UCS-4标准,它使用4个字节表示一个字符,不过我们用的最多的仍然是UCS-2。
UCS(Unicode Character Set)还仅仅是字符对应码位的一张表而已,比如"汉"这个字的码位是6C49。字符具体如何传输和储存则是由UTF(UCS Transformation Format)来负责。
一开始这事很简单,直接使用UCS的码位来保存,这就是UTF-16,比如,"汉"直接使用\x6C\x49保存(UTF-16-BE),或是倒过来使用\x49\x6C保存(UTF-16-LE)。但用着用着美国人觉得自己吃了大亏,以前英文字母只需要一个字节就能保存了,现在大锅饭一吃变成了两个字节,空间消耗大了一倍……于是UTF-8横空出世。
UTF-8是一种很别扭的编码,具体表现在他是变长的,并且兼容ASCII,ASCII字符使用1字节表示。然而这里省了的必定是从别的地方抠出来的,你肯定也听说过UTF-8里中文字符使用3个字节来保存吧?4个字节保存的字符更是在泪奔……(具体UCS-2是怎么变成UTF-8的请自行搜索)
另外值得一提的是BOM(Byte Order Mark)。我们在储存文件时,文件使用的编码并没有保存,打开时则需要我们记住原先保存时使用的编码并使用这个编码打开,这样一来就产生了许多麻烦。(你可能想说记事本打开文件时并没有让选编码?不妨先打开记事本再使用文件 -> 打开看看)而UTF则引入了BOM来表示自身编码,如果一开始读入的几个字节是其中之一,则代表接下来要读取的文字使用的编码是相应的编码:
BOM_UTF8 ‘\xef\xbb\xbf‘
BOM_UTF16_LE ‘\xff\xfe‘
BOM_UTF16_BE ‘\xfe\xff‘
并不是所有的编辑器都会写入BOM,但即使没有BOM,Unicode还是可以读取的,只是像MBCS的编码一样,需要另行指定具体的编码,否则解码将会失败。
你可能听说过UTF-8不需要BOM,这种说法是不对的,只是绝大多数编辑器在没有BOM时都是以UTF-8作为默认编码读取。即使是保存时默认使用ANSI(MBCS)的记事本,在读取文件时也是先使用UTF-8测试编码,如果可以成功解码,则使用UTF-8解码。记事本这个别扭的做法造成了一个BUG:如果你新建文本文件并输入"姹塧"然后使用ANSI(MBCS)保存,再打开就会变成"汉a",你不妨试试 :)
str和unicode都是basestring的子类。严格意义上说,str其实是字节串,它是unicode经过编码后的字节组成的序列。对UTF-8编码的str‘汉‘使用len()函数时,结果是3,因为实际上,UTF-8编码的‘汉‘ == ‘\xE6\xB1\x89‘。
unicode才是真正意义上的字符串,对字节串str使用正确的字符编码进行解码后获得,并且len(u‘汉‘) == 1。
再来看看encode()和decode()两个basestring的实例方法,理解了str和unicode的区别后,这两个方法就不会再混淆了:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
#
coding: UTF-8 u = u ‘汉‘ print repr (u) #
u‘\u6c49‘ s = u.encode( ‘UTF-8‘ ) print repr (s) #
‘\xe6\xb1\x89‘ u2 = s.decode( ‘UTF-8‘ ) print repr (u2) #
u‘\u6c49‘ #
对unicode进行解码是错误的 #
s2 = u.decode(‘UTF-8‘) #
同样,对str进行编码也是错误的 #
u2 = s.encode(‘UTF-8‘) |
需要注意的是,虽然对str调用encode()方法是错误的,但实际上Python不会抛出异常,而是返回另外一个相同内容但不同id的str;对unicode调用decode()方法也是这样。很不理解为什么不把encode()和decode()分别放在unicode和str中而是都放在basestring中,但既然已经这样了,我们就小心避免犯错吧。
源代码文件中,如果有用到非ASCII字符,则需要在文件头部进行字符编码的声明,如下:
1
|
#-*-
coding: UTF-8 -*- |
实际上Python只检查#、coding和编码字符串,其他的字符都是为了美观加上的。另外,Python中可用的字符编码有很多,并且还有许多别名,还不区分大小写,比如UTF-8可以写成u8。参见http://docs.python.org/library/codecs.html#standard-encodings。
另外需要注意的是声明的编码必须与文件实际保存时用的编码一致,否则很大几率会出现代码解析异常。现在的IDE一般会自动处理这种情况,改变声明后同时换成声明的编码保存,但文本编辑器控们需要小心 :)
内置的open()方法打开文件时,read()读取的是str,读取后需要使用正确的编码格式进行decode()。write()写入时,如果参数是unicode,则需要使用你希望写入的编码进行encode(),如果是其他编码格式的str,则需要先用该str的编码进行decode(),转成unicode后再使用写入的编码进行encode()。如果直接将unicode作为参数传入write()方法,Python将先使用源代码文件声明的字符编码进行编码然后写入。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
#
coding: UTF-8 f = open ( ‘test.txt‘ ) s = f.read() f.close() print type (s) #
<type ‘str‘> #
已知是GBK编码,解码成unicode u = s.decode( ‘GBK‘ ) f = open ( ‘test.txt‘ , ‘w‘ ) #
编码成UTF-8编码的str s = u.encode( ‘UTF-8‘ ) f.write(s) f.close() |
另外,模块codecs提供了一个open()方法,可以指定一个编码打开文件,使用这个方法打开的文件读取返回的将是unicode。写入时,如果参数是unicode,则使用open()时指定的编码进行编码后写入;如果是str,则先根据源代码文件声明的字符编码,解码成unicode后再进行前述操作。相对内置的open()来说,这个方法比较不容易在编码上出现问题。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
#
coding: GBK import codecs f = codecs. open ( ‘test.txt‘ ,
encoding = ‘UTF-8‘ ) u = f.read() f.close() print type (u) #
<type ‘unicode‘> f = codecs. open ( ‘test.txt‘ , ‘a‘ ,
encoding = ‘UTF-8‘ ) #
写入unicode f.write(u) #
写入str,自动进行解码编码操作 #
GBK编码的str s = ‘汉‘ print repr (s) #
‘\xba\xba‘ #
这里会先将GBK编码的str解码为unicode再编码为UTF-8写入 f.write(s) f.close() |
sys/locale模块中提供了一些获取当前环境下的默认编码的方法。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
#
coding:gbk import sys import locale def p(f): print ‘%s.%s():
%s‘ % (f.__module__,
f.__name__, f()) #
返回当前系统所使用的默认字符编码 p(sys.getdefaultencoding) #
返回用于转换Unicode文件名至系统文件名所使用的编码 p(sys.getfilesystemencoding) #
获取默认的区域设置并返回元祖(语言, 编码) p(locale.getdefaultlocale) #
返回用户设定的文本数据编码 #
文档提到this function only returns a guess p(locale.getpreferredencoding) #
\xba\xba是‘汉‘的GBK编码 #
mbcs是不推荐使用的编码,这里仅作测试表明为什么不应该用 print r "‘\xba\xba‘.decode(‘mbcs‘):" , repr ( ‘\xba\xba‘ .decode( ‘mbcs‘ )) #在笔者的Windows上的结果(区域设置为中文(简体,
中国)) #sys.getdefaultencoding():
gbk #sys.getfilesystemencoding():
mbcs #locale.getdefaultlocale():
(‘zh_CN‘, ‘cp936‘) #locale.getpreferredencoding():
cp936 #‘\xba\xba‘.decode(‘mbcs‘):
u‘\u6c49‘ |
这点是一定要做到的。
按引号前先按一下u最初做起来确实很不习惯而且经常会忘记再跑回去补,但如果这么做可以减少90%的编码问题。如果编码困扰不严重,可以不参考此条。
如果编码困扰不严重,可以不参考此条。
这里说的MBCS不是指GBK什么的都不能用,而是不要使用Python里名为‘MBCS‘的编码,除非程序完全不移植。
Python中编码‘MBCS‘与‘DBCS‘是同义词,指当前Windows环境中MBCS指代的编码。Linux的Python实现中没有这种编码,所以一旦移植到Linux一定会出现异常!另外,只要设定的Windows系统区域不同,MBCS指代的编码也是不一样的。分别设定不同的区域运行2.4小节中的代码的结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
#中文(简体,
中国) #sys.getdefaultencoding():
gbk #sys.getfilesystemencoding():
mbcs #locale.getdefaultlocale():
(‘zh_CN‘, ‘cp936‘) #locale.getpreferredencoding():
cp936 #‘\xba\xba‘.decode(‘mbcs‘):
u‘\u6c49‘ #英语(美国) #sys.getdefaultencoding():
UTF-8 #sys.getfilesystemencoding():
mbcs #locale.getdefaultlocale():
(‘zh_CN‘, ‘cp1252‘) #locale.getpreferredencoding():
cp1252 #‘\xba\xba‘.decode(‘mbcs‘):
u‘\xba\xba‘ #德语(德国) #sys.getdefaultencoding():
gbk #sys.getfilesystemencoding():
mbcs #locale.getdefaultlocale():
(‘zh_CN‘, ‘cp1252‘) #locale.getpreferredencoding():
cp1252 #‘\xba\xba‘.decode(‘mbcs‘):
u‘\xba\xba‘ #日语(日本) #sys.getdefaultencoding():
gbk #sys.getfilesystemencoding():
mbcs #locale.getdefaultlocale():
(‘zh_CN‘, ‘cp932‘) #locale.getpreferredencoding():
cp932 #‘\xba\xba‘.decode(‘mbcs‘):
u‘\uff7a\uff7a‘ |
可见,更改区域后,使用mbcs解码得到了不正确的结果,所以,当我们需要使用‘GBK‘时,应该直接写‘GBK‘,不要写成‘MBCS‘。
UTF-16同理,虽然绝大多数操作系统中‘UTF-16‘是‘UTF-16-LE‘的同义词,但直接写‘UTF-16-LE‘只是多写3个字符而已,而万一某个操作系统中‘UTF-16‘变成了‘UTF-16-BE‘的同义词,就会有错误的结果。实际上,UTF-16用的相当少,但用到的时候还是需要注意。
--END--
中文编码问题是用中文的程序员经常头大的问题,在python下也是如此,那么应该怎么理解和解决python的编码问题呢?
我们要知道python内部使用的是unicode编码,而外部却要面对千奇百怪的各种编码,比如作为中国程序经常要面对的gbk,gb2312,utf8等,那这些编码是怎么转换成内部的unicode呢?
首先我们先看一下源代码文件中使用字符串的情况。源代码文件作为文本文件就必然是以某种编码形式存储代码的,python默认会认为源代码文件是asci编码,比如说代码中有一个变量赋值:
s1=’a’
print s1
python认为这个’a‘就是一个asci编码的字符。在仅仅使用英文字符的情况下一切正常,但是如果用了中文,比如:
s1=’哈’
print s1
这个代码文件被执行时就会出错,就是编码出了问题。python默认将代码文件内容当作asci编码处理,但asci编码中不存在中文,因此抛出异常。
解决问题之道就是要让python知道文件中使用的是什么编码形式,对于中文,可以用的常见编码有utf-8,gbk和gb2312等。只需在代码文件的最前端添加如下:
# -*- coding: utf-8 -*-
这就是告知python我这个文件里的文本是用utf-8编码的,这样,python就会依照utf-8的编码形式解读其中的字符,然后转换成unicode编码内部处理使用。
不过,如果你在Windows控制台下运行此代码的话,虽然程序是执行了,但屏幕上打印出的却不是哈字。这是由于python编码与控制台编码的不一致造成的。Windows下控制台中的编码使用的
是gbk,而在代码中使用的utf-8,python按照utf-8编码打印到gbk编码的控制台下自然就会不一致而不能打印出正确的汉字。
解决办法一个是将源代码的编码也改成gbk,也就是代码第一行改成:
# -*- coding: gbk -*-
另一种方法是保持源码文件的utf-8不变,而是在’哈’前面加个u字,也就是:
s1=u’哈’
print s1
这样就可以正确打印出’哈’字了。
这里的这个u表示将后面跟的字符串以unicode格式存储。python会根据代码第一行标称的utf-8编码识别代码中的汉字’哈’,然后转换成unicode对象。如果我们用type查看一下’哈’的数据类
型type(‘哈’),会得到<type ‘str’>,而type(u’哈’),则会得到<type ‘unicode’>,也就是在字符前面加u就表明这是一个unicode对象,这个字会以unicode格式存在于内存中,而如果不加u
,表明这仅仅是一个使用某种编码的字符串,编码格式取决于python对源码文件编码的识别,这里就是utf-8。
Python在向控制台输出unicode对象的时候会自动根据输出环境的编码进行转换,但如果输出的不是unicode对象而是普通字符串,则会直接按照字符串的编码输出字符串,从而出现上面的现
象。
使用unicode对象的话,除了这样使用u标记,还可以使用unicode类以及字符串的encode和decode方法。
unicode类的构造函数接受一个字符串参数和一个编码参数,将字符串封装为一个unicode,比如在这里,由于我们用的是utf-8编码,所以unicode中的编码参数使用’utf-8′将字符封装为
unicode对象,然后正确输出到控制台:
s1=unicode(‘哈’, ‘utf-8′)
print s1
另外,用decode函数也可以将一个普通字符串转换为unicode对象。很多人都搞不明白python字符串的decode和encode函数都是什么意思。这里简要说明一下。
decode是将普通字符串按照参数中的编码格式进行解析,然后生成对应的unicode对象,比如在这里我们代码用的是utf-8,那么把一个字符串转换为unicode就是如下形式:
s2=’哈’.decode(‘utf-8′)
这时,s2就是一个存储了’哈’字的unicode对象,其实就和unicode(‘哈’, ‘utf-8′)以及u’哈’是相同的。
那么encode正好就是相反的功能,是将一个unicode对象转换为参数中编码格式的普通字符,比如下面代码:
s3=unicode(‘哈’, ‘utf-8′).encode(‘utf-8′)
s3现在又变回了utf-8的’哈’。
关于编码
(首先了解一下ascii、gb2312、gbk、utf-8、unicode的关系 http://www.cnblogs.com/skynet/archive/2011/05/03/2035105.html#_3.4.UTF-8)a.命令行中编码
>>> import sys
>>> sys.getdefaultencoding()
‘ascii‘
>>> a=‘nihao中国‘
>>> a
‘nihao\xd6\xd0\xb9\xfa‘ //‘nihao’是按ascii编码,而中文是按utf-8编码,这么显示是正常的,输出字符实际字节内容,供程序员调试
>>> print a //print 才是输出给用户看的内容
nihao中国
>>> b=‘\xd6\xd0\xb9\xfa‘
>>> isinstance(b,unicode) //判断是否为unicode编码
False
>>> b1=unicode(b,‘gb2312‘) //转为unicode字符
>>> isinstance(b1,unicode)
True
>>> print b1
中国
>>> b1
u‘\u4e2d\u56fd‘
>>> b2=b1.encode(‘utf-8‘) //转为utf-8编码
>>> isinstance(b2,unicode)
False
>>> print b2
中国
>>> b2
‘\xe4\xb8\xad\xe5\x9b\xbd‘
>>> b=‘hi‘+u‘you‘ #和Unicode连接,产生Unicode字串
>>> isinstance(b,unicode)
True
>>> b
u‘hiyou‘
>>> str(b) #内置的str()函数把Unicode字串转换成ASCII字串
‘hiyou‘
>>> a=["你好","abnd"]
>>> print a
[‘\xc4\xe3\xba\xc3‘, ‘abnd‘]
>>> print a[0]
你好
b.程序中编码
系统编码如[a]中所示
当python中间处理非ASCII编码时,经常会出现如下错误: UnicodeDecodeError: ‘ascii‘ codec can‘t decode byte 0x?? in position 1: ordinal not in range(128)
0x??是超出128的数字,python在默认的情况下认为语言的编码是ascii编码,所以无法处理其他编码,需要设置python的默认编码为所需要的编码。
第一种解决方法就是在代码中添加:
(对我的系统和程序没有起作用)
第二种:
在 python的Lib\site-packages 文件夹下新建一个sitecustomize.py 文件(sitecustomize.py is a special script; Python will try to import it on startup, so any code in it will be run automatically.),输入:
这样就能够自动的设置编码了。
可以通过第一种方法进行测试。
This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry which manages the codec and error handling lookup process.
It defines the following functions:
Register a codec search function. Search functions are expected to take one argument, the encoding name in all lower case letters, and return aCodecInfo object having the following attributes:
The various functions or classes take the following arguments:
encode and decode: These must be functions or methods which have the same interface as theencode()/decode() methods of Codec instances (see Codec Interface). The functions/methods are expected to work in a stateless mode.
incrementalencoder and incrementaldecoder: These have to be factory functions providing the following interface:
factory(errors=‘strict‘)
The factory functions must return objects providing the interfaces defined by the base classesIncrementalEncoder andIncrementalDecoder, respectively. Incremental codecs can maintain state.
streamreader and streamwriter: These have to be factory functions providing the following interface:
factory(stream,errors=‘strict‘)
The factory functions must return objects providing the interfaces defined by the base classesStreamWriter andStreamReader, respectively. Stream codecs can maintain state.
Possible values for errors are
as well as any other error handling name defined via register_error().
In case a search function cannot find a given encoding, it should return None.
Looks up the codec info in the Python codec registry and returns a CodecInfo object as defined above.
Encodings are first looked up in the registry’s cache. If not found, the list of registered search functions is scanned. If noCodecInfo object is found, aLookupError is raised. Otherwise, theCodecInfo object is stored in the cache and returned to the caller.
To simplify access to the various codecs, the module provides these additional functions which uselookup() for the codec lookup:
Look up the codec for the given encoding and return its encoder function.
Raises a LookupError in case the encoding cannot be found.
Look up the codec for the given encoding and return its decoder function.
Raises a LookupError in case the encoding cannot be found.
Look up the codec for the given encoding and return its incremental encoder class or factory function.
Raises a LookupError in case the encoding cannot be found or the codec doesn’t support an incremental encoder.
New in version 2.5.
Look up the codec for the given encoding and return its incremental decoder class or factory function.
Raises a LookupError in case the encoding cannot be found or the codec doesn’t support an incremental decoder.
New in version 2.5.
Look up the codec for the given encoding and return its StreamReader class or factory function.
Raises a LookupError in case the encoding cannot be found.
Look up the codec for the given encoding and return its StreamWriter class or factory function.
Raises a LookupError in case the encoding cannot be found.
Register the error handling function error_handler under the name name. error_handler will be called during encoding and decoding in case of an error, whenname is specified as the errors parameter.
For encoding error_handler will be called with a UnicodeEncodeError instance, which contains information about the location of the error. The error handler must either raise this or a different exception or return a tuple with a replacement for the unencodable part of the input and a position where encoding should continue. The encoder will encode the replacement and continue encoding the original input at the specified position. Negative position values will be treated as being relative to the end of the input string. If the resulting position is out of bound anIndexError will be raised.
Decoding and translating works similar, except UnicodeDecodeError or UnicodeTranslateError will be passed to the handler and that the replacement from the error handler will be put into the output directly.
Return the error handler previously registered under the name name.
Raises a LookupError in case the handler cannot be found.
To simplify working with encoded files or stream, the module also defines these utility functions:
Open an encoded file using the given mode and return a wrapped version providing transparent encoding/decoding. The default file mode is‘r‘ meaning to open the file in read mode.
Note
The wrapped version will only accept the object format defined by the codecs, i.e. Unicode objects for most built-in codecs. Output is also codec-dependent and will usually be Unicode as well.
Note
Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of‘\n‘ is done on reading and writing.
encoding specifies the encoding which is to be used for the file.
errors may be given to define the error handling. It defaults to ‘strict‘ which causes a ValueError to be raised in case an encoding error occurs.
buffering has the same meaning as for the built-in open() function. It defaults to line buffered.
Return a wrapped version of file which provides transparent encoding translation.
Strings written to the wrapped file are interpreted according to the given input encoding and then written to the original file as strings using theoutput encoding. The intermediate encoding will usually be Unicode but depends on the specified codecs.
If output is not given, it defaults to input.
errors may be given to define the error handling. It defaults to ‘strict‘, which causes ValueError to be raised in case an encoding error occurs.
Uses an incremental encoder to iteratively encode the input provided by iterable. This function is agenerator.errors (as well as any other keyword argument) is passed through to the incremental encoder.
New in version 2.5.
Uses an incremental decoder to iteratively decode the input provided by iterable. This function is agenerator.errors (as well as any other keyword argument) is passed through to the incremental decoder.
New in version 2.5.
The module also provides the following constants which are useful for reading and writing to platform dependent files:
The codecs module defines a set of base classes which define the interface and can also be used to easily write your own codecs for use in Python.
Each codec has to define four interfaces to make it usable as codec in Python: stateless encoder, stateless decoder, stream reader and stream writer. The stream reader and writers typically reuse the stateless encoder/decoder to implement the file protocols.
The Codec class defines the interface for stateless encoders/decoders.
To simplify and standardize error handling, the encode() and decode() methods may implement different error handling schemes by providing theerrors string argument. The following string values are defined and implemented by all standard Python codecs:
Value | Meaning |
---|---|
‘strict‘ | Raise UnicodeError (or a subclass); this is the default. |
‘ignore‘ | Ignore the character and continue with the next. |
‘replace‘ | Replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in Unicode codecs on decoding and ‘?’ on encoding. |
‘xmlcharrefreplace‘ | Replace with the appropriate XML character reference (only for encoding). |
‘backslashreplace‘ | Replace with backslashed escape sequences (only for encoding). |
The set of allowed values can be extended via register_error().
The Codec class defines these methods which also define the function interfaces of the stateless encoder and decoder:
Encodes the object input and returns a tuple (output object, length consumed). While codecs are not restricted to use with Unicode, in a Unicode context, encoding converts a Unicode object to a plain string using a particular character set encoding (e.g., cp1252 or iso-8859-1).
errors defines the error handling to apply. It defaults to ‘strict‘ handling.
The method may not store state in the Codec instance. UseStreamCodec for codecs which have to keep state in order to make encoding/decoding efficient.
The encoder must be able to handle zero length input and return an empty object of the output object type in this situation.
Decodes the object input and returns a tuple (output object, length consumed). In a Unicode context, decoding converts a plain string encoded using a particular character set encoding to a Unicode object.
input must be an object which provides the bf_getreadbuf buffer slot. Python strings, buffer objects and memory mapped files are examples of objects providing this slot.
errors defines the error handling to apply. It defaults to ‘strict‘ handling.
The method may not store state in the Codec instance. UseStreamCodec for codecs which have to keep state in order to make encoding/decoding efficient.
The decoder must be able to handle zero length input and return an empty object of the output object type in this situation.
The IncrementalEncoder andIncrementalDecoder classes provide the basic interface for incremental encoding and decoding. Encoding/decoding the input isn’t done with one call to the stateless encoder/decoder function, but with multiple calls to theencode()/decode() method of the incremental encoder/decoder. The incremental encoder/decoder keeps track of the encoding/decoding process during method calls.
The joined output of calls to the encode()/decode() method is the same as if all the single inputs were joined into one, and this input was encoded/decoded with the stateless encoder/decoder.
New in version 2.5.
The IncrementalEncoder class is used for encoding an input in multiple steps. It defines the following methods which every incremental encoder must define in order to be compatible with the Python codec registry.
Constructor for an IncrementalEncoder instance.
All incremental encoders must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.
The IncrementalEncoder may implement different error handling schemes by providing theerrors keyword argument. These parameters are predefined:
The errors argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of theIncrementalEncoder object.
The set of allowed values for the errors argument can be extended withregister_error().
The IncrementalDecoder class is used for decoding an input in multiple steps. It defines the following methods which every incremental decoder must define in order to be compatible with the Python codec registry.
Constructor for an IncrementalDecoder instance.
All incremental decoders must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.
The IncrementalDecoder may implement different error handling schemes by providing theerrors keyword argument. These parameters are predefined:
The errors argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of theIncrementalDecoder object.
The set of allowed values for the errors argument can be extended withregister_error().
The StreamWriter andStreamReader classes provide generic working interfaces which can be used to implement new encoding submodules very easily. Seeencodings.utf_8 for an example of how this is done.
The StreamWriter class is a subclass ofCodec and defines the following methods which every stream writer must define in order to be compatible with the Python codec registry.
Constructor for a StreamWriter instance.
All stream writers must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.
stream must be a file-like object open for writing binary data.
The StreamWriter may implement different error handling schemes by providing theerrors keyword argument. These parameters are predefined:
The errors argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of theStreamWriter object.
The set of allowed values for the errors argument can be extended withregister_error().
Flushes and resets the codec buffers used for keeping state.
Calling this method should ensure that the data on the output is put into a clean state that allows appending of new fresh data without having to rescan the whole stream to recover state.
In addition to the above methods, the StreamWriter must also inherit all other methods and attributes from the underlying stream.
The StreamReader class is a subclass ofCodec and defines the following methods which every stream reader must define in order to be compatible with the Python codec registry.
Constructor for a StreamReader instance.
All stream readers must provide this constructor interface. They are free to add additional keyword arguments, but only the ones defined here are used by the Python codec registry.
stream must be a file-like object open for reading (binary) data.
The StreamReader may implement different error handling schemes by providing theerrors keyword argument. These parameters are defined:
The errors argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of theStreamReader object.
The set of allowed values for the errors argument can be extended withregister_error().
Decodes data from the stream and returns the resulting object.
chars indicates the number of characters to read from the stream. read() will never return more than chars characters, but it might return less, if there are not enough characters available.
size indicates the approximate maximum number of bytes to read from the stream for decoding purposes. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as possible.size is intended to prevent having to decode huge files in one step.
firstline indicates that it would be sufficient to only return the first line, if there are decoding errors on later lines.
The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e.g. if optional encoding endings or state markers are available on the stream, these should be read too.
Changed in version 2.4: chars argument added.
Changed in version 2.4.2:firstline argument added.
Read one line from the input stream and return the decoded data.
size, if given, is passed as size argument to the stream’s readline() method.
If keepends is false line-endings will be stripped from the lines returned.
Changed in version 2.4: keepends argument added.
Read all lines available on the input stream and return them as a list of lines.
Line-endings are implemented using the codec’s decoder method and are included in the list entries ifkeepends is true.
sizehint, if given, is passed as the size argument to the stream’sread() method.
Resets the codec buffers used for keeping state.
Note that no stream repositioning should take place. This method is primarily intended to be able to recover from decoding errors.
In addition to the above methods, the StreamReader must also inherit all other methods and attributes from the underlying stream.
The next two base classes are included for convenience. They are not needed by the codec registry, but may provide useful in practice.
The StreamReaderWriter allows wrapping streams which work in both read and write modes.
The design is such that one can use the factory functions returned by the lookup() function to construct the instance.
StreamReaderWriter instances define the combined interfaces ofStreamReader andStreamWriter classes. They inherit all other methods and attributes from the underlying stream.
The StreamRecoder provide a frontend - backend view of encoding data which is sometimes useful when dealing with different encoding environments.
The design is such that one can use the factory functions returned by the lookup() function to construct the instance.
Creates a StreamRecoder instance which implements a two-way conversion:encode anddecode work on the frontend (the input toread() and output of write()) while Reader andWriter work on the backend (reading and writing to the stream).
You can use these objects to do transparent direct recodings from e.g. Latin-1 to UTF-8 and back.
stream must be a file-like object.
encode, decode must adhere to the Codec interface. Reader, Writer must be factory functions or classes providing objects of theStreamReaderandStreamWriter interface respectively.
encode and decode are needed for the frontend translation, Reader and Writer for the backend translation. The intermediate format used is determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode as the intermediate encoding.
Error handling is done in the same way as defined for the stream readers and writers.
StreamRecoder instances define the combined interfaces ofStreamReader andStreamWriter classes. They inherit all other methods and attributes from the underlying stream.
Unicode strings are stored internally as sequences of codepoints (to be precise asPy_UNICODE arrays). Depending on the way Python is compiled (either via--enable-unicode=ucs2 or --enable-unicode=ucs4, with the former being the default)Py_UNICODE is either a 16-bit or 32-bit data type. Once a Unicode object is used outside of CPU and memory, CPU endianness and how these arrays are stored as bytes become an issue. Transforming a unicode object into a sequence of bytes is called encoding and recreating the unicode object from the sequence of bytes is known as decoding. There are many different methods for how this transformation can be done (these methods are also called encodings). The simplest method is to map the codepoints 0-255 to the bytes0x0-0xff. This means that a unicode object that contains codepoints aboveU+00FF can’t be encoded with this method (which is called‘latin-1‘ or‘iso-8859-1‘).unicode.encode() will raise aUnicodeEncodeError that looks like this:UnicodeEncodeError:‘latin-1‘codeccan‘tencodecharacteru‘\u1234‘inposition3:ordinalnotinrange(256).
There’s another group of encodings (the so called charmap encodings) that choose a different subset of all unicode code points and how these codepoints are mapped to the bytes0x0-0xff. To see how this is done simply open e.g.encodings/cp1252.py (which is an encoding that is used primarily on Windows). There’s a string constant with 256 characters that shows you which character is mapped to which byte value.
All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints defined in unicode. A simple and straightforward way that can store each Unicode code point, is to store each codepoint as two consecutive bytes. There are two possibilities: Store the bytes in big endian or in little endian order. These two encodings are called UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you will always have to swap bytes on encoding and decoding. UTF-16 avoids this problem: Bytes will always be in natural endianness. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though. To be able to detect the endianness of a UTF-16 byte sequence, there’s the so called BOM (the “Byte Order Mark”). This is the Unicode character U+FEFF. This character will be prepended to every UTF-16 byte sequence. The byte swapped version of this character (0xFFFE) is an illegal character that may not appear in a Unicode text. So when the first character in an UTF-16 byte sequence appears to be a U+FFFE the bytes have to be swapped on decoding. Unfortunately upto Unicode 4.0 the characterU+FEFF had a second purpose as aZEROWIDTHNO-BREAKSPACE: A character that has no width and doesn’t allow a word to be split. It can e.g. be used to give hints to a ligature algorithm. With Unicode 4.0 usingU+FEFF as aZEROWIDTHNO-BREAKSPACE has been deprecated (withU+2060 (WORDJOINER) assuming this role). Nevertheless Unicode software still must be able to handleU+FEFF in both roles: As a BOM it’s a device to determine the storage layout of the encoded bytes, and vanishes once the byte sequence has been decoded into a Unicode string; as aZEROWIDTHNO-BREAKSPACE it’s a normal character that will be decoded like any other.
There’s another encoding that is able to encoding the full range of Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two parts: Marker bits (the most significant bits) and payload bits. The marker bits are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are encoded like this (with x being payload bits, which when concatenated give the Unicode character):
Range | Encoding |
---|---|
U-00000000 ... U-0000007F | 0xxxxxxx |
U-00000080 ... U-000007FF | 110xxxxx 10xxxxxx |
U-00000800 ... U-0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
U-00010000 ... U-001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-00200000 ... U-03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
U-04000000 ... U-7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The least significant bit of the Unicode character is the rightmost x bit.
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded Unicode string (even if it’s the first character) is treated as aZEROWIDTHNO-BREAKSPACE.
Without external information it’s impossible to reliably determine which encoding was used for encoding a Unicode string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls"utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence:0xef,0xbb,0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESISRIGHT-POINTING DOUBLE ANGLE QUOTATION MARKINVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.
Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. Neither the list of aliases nor the list of languages is meant to be exhaustive. Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g.‘utf-8‘ is a valid alias for the ‘utf_8‘ codec.
Many of the character sets support the same languages. They vary in individual characters (e.g. whether the EURO SIGN is supported or not), and in the assignment of characters to code positions. For the European languages in particular, the following variants typically exist:
Codec | Aliases | Languages |
---|---|---|
ascii | 646, us-ascii | English |
big5 | big5-tw, csbig5 | Traditional Chinese |
big5hkscs | big5-hkscs, hkscs | Traditional Chinese |
cp037 | IBM037, IBM039 | English |
cp424 | EBCDIC-CP-HE, IBM424 | Hebrew |
cp437 | 437, IBM437 | English |
cp500 | EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500 | Western Europe |
cp720 | Arabic | |
cp737 | Greek | |
cp775 | IBM775 | Baltic languages |
cp850 | 850, IBM850 | Western Europe |
cp852 | 852, IBM852 | Central and Eastern Europe |
cp855 | 855, IBM855 | Bulgarian, Byelorussian, Macedonian, Russian, Serbian |
cp856 | Hebrew | |
cp857 | 857, IBM857 | Turkish |
cp858 | 858, IBM858 | Western Europe |
cp860 | 860, IBM860 | Portuguese |
cp861 | 861, CP-IS, IBM861 | Icelandic |
cp862 | 862, IBM862 | Hebrew |
cp863 | 863, IBM863 | Canadian |
cp864 | IBM864 | Arabic |
cp865 | 865, IBM865 | Danish, Norwegian |
cp866 | 866, IBM866 | Russian |
cp869 | 869, CP-GR, IBM869 | Greek |
cp874 | Thai | |
cp875 | Greek | |
cp932 | 932, ms932, mskanji, ms-kanji | Japanese |
cp949 | 949, ms949, uhc | Korean |
cp950 | 950, ms950 | Traditional Chinese |
cp1006 | Urdu | |
cp1026 | ibm1026 | Turkish |
cp1140 | ibm1140 | Western Europe |
cp1250 | windows-1250 | Central and Eastern Europe |
cp1251 | windows-1251 | Bulgarian, Byelorussian, Macedonian, Russian, Serbian |
cp1252 | windows-1252 | Western Europe |
cp1253 | windows-1253 | Greek |
cp1254 | windows-1254 | Turkish |
cp1255 | windows-1255 | Hebrew |
cp1256 | windows-1256 | Arabic |
cp1257 | windows-1257 | Baltic languages |
cp1258 | windows-1258 | Vietnamese |
euc_jp | eucjp, ujis, u-jis | Japanese |
euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
euc_jisx0213 | eucjisx0213 | Japanese |
euc_kr | euckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001 | Korean |
gb2312 | chinese, csiso58gb231280, euc- cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso- ir-58 | Simplified Chinese |
gbk | 936, cp936, ms936 | Unified Chinese |
gb18030 | gb18030-2000 | Unified Chinese |
hz | hzgb, hz-gb, hz-gb-2312 | Simplified Chinese |
iso2022_jp | csiso2022jp, iso2022jp, iso-2022-jp | Japanese |
iso2022_jp_1 | iso2022jp-1, iso-2022-jp-1 | Japanese |
iso2022_jp_2 | iso2022jp-2, iso-2022-jp-2 | Japanese, Korean, Simplified Chinese, Western Europe, Greek |
iso2022_jp_2004 | iso2022jp-2004, iso-2022-jp-2004 | Japanese |
iso2022_jp_3 | iso2022jp-3, iso-2022-jp-3 | Japanese |
iso2022_jp_ext | iso2022jp-ext, iso-2022-jp-ext | Japanese |
iso2022_kr | csiso2022kr, iso2022kr, iso-2022-kr | Korean |
latin_1 | iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1 | West Europe |
iso8859_2 | iso-8859-2, latin2, L2 | Central and Eastern Europe |
iso8859_3 | iso-8859-3, latin3, L3 | Esperanto, Maltese |
iso8859_4 | iso-8859-4, latin4, L4 | Baltic languages |
iso8859_5 | iso-8859-5, cyrillic | Bulgarian, Byelorussian, Macedonian, Russian, Serbian |
iso8859_6 | iso-8859-6, arabic | Arabic |
iso8859_7 | iso-8859-7, greek, greek8 | Greek |
iso8859_8 | iso-8859-8, hebrew | Hebrew |
iso8859_9 | iso-8859-9, latin5, L5 | Turkish |
iso8859_10 | iso-8859-10, latin6, L6 | Nordic languages |
iso8859_13 | iso-8859-13, latin7, L7 | Baltic languages |
iso8859_14 | iso-8859-14, latin8, L8 | Celtic languages |
iso8859_15 | iso-8859-15, latin9, L9 | Western Europe |
iso8859_16 | iso-8859-16, latin10, L10 | South-Eastern Europe |
johab | cp1361, ms1361 | Korean |
koi8_r | Russian | |
koi8_u | Ukrainian | |
mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, Macedonian, Russian, Serbian |
mac_greek | macgreek | Greek |
mac_iceland | maciceland | Icelandic |
mac_latin2 | maclatin2, maccentraleurope | Central and Eastern Europe |
mac_roman | macroman | Western Europe |
mac_turkish | macturkish | Turkish |
ptcp154 | csptcp154, pt154, cp154, cyrillic-asian | Kazakh |
shift_jis | csshiftjis, shiftjis, sjis, s_jis | Japanese |
shift_jis_2004 | shiftjis2004, sjis_2004, sjis2004 | Japanese |
shift_jisx0213 | shiftjisx0213, sjisx0213, s_jisx0213 | Japanese |
utf_32 | U32, utf32 | all languages |
utf_32_be | UTF-32BE | all languages |
utf_32_le | UTF-32LE | all languages |
utf_16 | U16, utf16 | all languages |
utf_16_be | UTF-16BE | all languages (BMP only) |
utf_16_le | UTF-16LE | all languages (BMP only) |
utf_7 | U7, unicode-1-1-utf-7 | all languages |
utf_8 | U8, UTF, utf8 | all languages |
utf_8_sig | all languages |
A number of codecs are specific to Python, so their codec names have no meaning outside Python. Some of them don’t convert from Unicode strings to byte strings, but instead use the property of the Python codecs machinery that any bijective function with one argument can be considered as an encoding.
For the codecs listed below, the result in the “encoding” direction is always a byte string. The result of the “decoding” direction is listed as operand type in the table.
Codec | Aliases | Operand type | Purpose |
---|---|---|---|
base64_codec | base64, base-64 | byte string | Convert operand to MIME base64 |
bz2_codec | bz2 | byte string | Compress the operand using bz2 |
hex_codec | hex | byte string | Convert operand to hexadecimal representation, with two digits per byte |
idna | Unicode string | Implements RFC 3490, see alsoencodings.idna | |
mbcs | dbcs | Unicode string | Windows only: Encode operand according to the ANSI codepage (CP_ACP) |
palmos | Unicode string | Encoding of PalmOS 3.5 | |
punycode | Unicode string | Implements RFC 3492 | |
quopri_codec | quopri, quoted-printable, quotedprintable | byte string | Convert operand to MIME quoted printable |
raw_unicode_escape | Unicode string | Produce a string that is suitable as raw Unicode literal in Python source code | |
rot_13 | rot13 | Unicode string | Returns the Caesar-cypher encryption of the operand |
string_escape | byte string | Produce a string that is suitable as string literal in Python source code | |
undefined | any | Raise an exception for all conversions. Can be used as the system encoding if no automaticcoercion between byte and Unicode strings is desired. | |
unicode_escape | Unicode string | Produce a string that is suitable as Unicode literal in Python source code | |
unicode_internal | Unicode string | Return the internal representation of the operand | |
uu_codec | uu | byte string | Convert the operand using uuencode |
zlib_codec | zip, zlib | byte string | Compress the operand using gzip |
New in version 2.3: The idna and punycode encodings.
New in version 2.3.
This module implements RFC 3490 (Internationalized Domain Names in Applications) andRFC 3492 (Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)). It builds upon thepunycode encoding andstringprep.
These RFCs together define a protocol to support non-ASCII characters in domain names. A domain name containing non-ASCII characters (such aswww.Alliancefrançaise.nu) is converted into an ASCII-compatible encoding (ACE, such aswww.xn--alliancefranaise-npb.nu). The ACE form of the domain name is then used in all places where arbitrary characters are not allowed by the protocol, such as DNS queries, HTTPHost fields, and so on. This conversion is carried out in the application; if possible invisible to the user: The application should transparently convert Unicode domain labels to IDNA on the wire, and convert back ACE labels to Unicode before presenting them to the user.
Python supports this conversion in several ways: The idna codec allows to convert between Unicode and the ACE. Furthermore, the socket module transparently converts Unicode host names to ACE, so that applications need not be concerned about converting host names themselves when they pass them to the socket module. On top of that, modules that have host names as function parameters, such as httplib and ftplib, accept Unicode host names (httplib then also transparently sends an IDNA hostname in the Host field if it sends that field at all).
When receiving host names from the wire (such as in reverse name lookup), no automatic conversion to Unicode is performed: Applications wishing to present such host names to the user should decode them to Unicode.
The module encodings.idna also implements the nameprep procedure, which performs certain normalizations on host names, to achieve case-insensitivity of international domain names, and to unify similar characters. The nameprep functions can be used directly if desired.
New in version 2.5.
This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this is only done once (on the first write to the byte stream). For decoding an optional UTF-8 encoded BOM at the start of the data will be skipped.