stanford corenlp自定义切词类

时间：2016-12-09 16:42:04 阅读：746 评论：0 收藏：0 [点我收藏+]

标签：readline rbo 中文 initial nts nan ati rpo new

stanford corenlp的中文切词有时不尽如意，那我们就需要实现一个自定义切词类，来完全满足我们的私人定制（加各种词典干预）。上篇文章《IKAnalyzer》介绍了IKAnalyzer的自由度，本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。

《stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties，其中customAnnotatorClass.segment就是用于指定切词类的，在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator，并设置在配置文件中即可。

customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

下面是我的实现：

public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator {
    public IKSegmenterAnnotator() {
        super();
    }

    public IKSegmenterAnnotator(boolean verbose) {
        super(verbose);
    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose) {
        super(segLoc, verbose);
    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) {
        super(segLoc, verbose, serDictionary, sighanCorporaDict);
    }

    public IKSegmenterAnnotator(String name, Properties props) {
        super(name, props);
    }

    private List<String> splitWords(String str) {
        try {
            List<String> words = new ArrayList<String>();
            IKSegmenter ik = new IKSegmenter(new StringReader(str), true);
            Lexeme lex = null;
            while ((lex = ik.next()) != null) {
                words.add(lex.getLexemeText());
            }
            return words;
        } catch (IOException e) {
            //LOGGER.error(e.getMessage(), e);
            System.out.println(e);
            List<String> words = new ArrayList<String>();
            words.add(str);
            return words;
        }
    }

    @Override
    public void runSegmentation(CoreMap annotation) {
        //0 2
        // A BC D E
        // 1 10 1 1
        // 0 12 3 4
        // 0, 0+1 ,

        String text = annotation.get(CoreAnnotations.TextAnnotation.class);
        List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation.class);
        List<CoreLabel> tokens = new ArrayList<CoreLabel>();
        annotation.set(CoreAnnotations.TokensAnnotation.class, tokens);

        //List<String> words = segmenter.segmentString(text);
        List<String> words = splitWords(text);
        System.err.println(text);
        System.err.println("--->");
        System.err.println(words);

        int pos = 0;
        for (String w : words) {
            CoreLabel fl = sentChars.get(pos);
            fl.set(CoreAnnotations.ChineseSegAnnotation.class, "1");
            if (w.length() == 0) {
                continue;
            }
            CoreLabel token = new CoreLabel();
            token.setWord(w);
            token.set(CoreAnnotations.CharacterOffsetBeginAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class));
            pos += w.length();
            fl = sentChars.get(pos - 1);
            token.set(CoreAnnotations.CharacterOffsetEndAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));
            tokens.add(token);
        }
    }
}

在外面为IKAnalyzer初始化词典，指定扩展词典和删除词典

        //为ik初始化词典，删除干扰词
        Dictionary.initial(DefaultConfig.getInstance());
        String delDic = System.getProperty(READ_IK_DEL_DIC, null);
        BufferedReader reader = new BufferedReader(new FileReader(delDic));
        String line = null;
        List<String> delWords = new ArrayList<>();
        while ((line = reader.readLine()) != null) {
            delWords.add(line);
        }
        Dictionary.getSingleton().disableWords(delWords);

stanford corenlp自定义切词类

标签：readline rbo 中文 initial nts nan ati rpo new

原文地址：http://www.cnblogs.com/whuqin/p/6149742.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行