标签:out win 结束 arch 最简 utf-16 other rac notify
爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。
好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。
juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet
自动编码主要是根据统计学的方法来判断。具体原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码:
http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3
<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet --> <dependency> <groupId>com.googlecode.juniversalchardet</groupId> <artifactId>juniversalchardet</artifactId> <version>1.0.3</version> </dependency>
https://code.google.com/archive/p/juniversalchardet/
juniversalchardet is a Java port of ‘universalchardet‘, that is the encoding detector library of Mozilla.
The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Chinese
Cyrillic
Greek
Hebrew
Japanese
Korean
Unicode
Others
1 Currently not supported by Java
Download ``` import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
} } ```
The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.
标签:out win 结束 arch 最简 utf-16 other rac notify
原文地址:http://www.cnblogs.com/lhp2012/p/6888318.html