java charset detector

时间：2015-03-05 09:10:39 阅读：191 评论：0 收藏：0 [点我收藏+]

标签：

https://code.google.com/p/juniversalchardet/downloads/list

java移植mozilla的编码自动检测库（源码为c++）,准确率高。

通过svn签出只读版本的代码：

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout http://juniversalchardet.googlecode.com/svn/trunk/ juniversalchardet-read-only

package myjava;

import java.io.File;
import java.io.IOException;

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
    public static void main(String[] args) throws java.io.IOException {
        String folder = "/home/hadoop/test/charset/";
        File file = new File(folder);
        for (File _file : file.listFiles())
            detectCharset(_file.getAbsolutePath());
    }

    static void detectCharset(String fileName) throws IOException {
        byte[] buf = new byte[4096];
        java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

        // (1)
        UniversalDetector detector = new UniversalDetector(null);

        // (2)
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        // (3)
        detector.dataEnd();

        // (4)
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }

        // (5)
        detector.reset();
    }
}

可以结合另外一个java的字符集检测库来保证更好的结果，因为对于短文来说,上面的检测方法可能无法得出结论。

同时因为这个算法来自于mozilla,它应该能更好地作用于html等标签文件的检测。

http://cpdetector.sourceforge.net/usage.shtml

java charset detector

标签：

原文地址：http://www.cnblogs.com/huaxiaoyao/p/4314809.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行