中科院分词ICTCLAS导入用户词典后分词结果一样？

时间：2016-04-18 20:43:54 阅读：298 评论：0 收藏：0 [点我收藏+]

标签：

 1 package ICTCLAS.I3S.Test;
 2 
 3 import java.io.UnsupportedEncodingException;
 4 
 5 import ICTCLAS.I3S.AC.ICTCLAS50;
 6 
 7 public class Test_UserDic {
 8 
 9     /**
10      * @param args
11      * @throws UnsupportedEncodingException 
12      */
13     public static void main(String[] args) throws UnsupportedEncodingException {
14         ICTCLAS50 ictclas =  new ICTCLAS50();
15         //initial
16         String argu = ".";    //当前目录
17         if (ictclas.ICTCLAS_Init(argu.getBytes("UTF-8")) == false) {
18             System.err.println("Initail fail!");
19             return;
20         }
21         System.out.println("Initial success!");
22         
23         String input = "中国科学院计算技术研究所在多年研究工作积累的基础上，研制出了汉语词法分析系统ICTCLAS。千万科学家";
24         
25         //未添加词典前分词
26         System.out.println(input);
27         ictclas.ICTCLAS_SetPOSmap(ictclas.PKU_POS_MAP_FIRST);
28         byte nativeBytes[] = ictclas.ICTCLAS_ParagraphProcess(input.getBytes("UTF-8"), 0, 1);
29         String result = new String(nativeBytes, 0, nativeBytes.length, "UTF-8");
30         System.out.println("未导入用户词典的分词结果是：\t" + result);
31         
32         //添加用户词典分词
33         int count = 0;
34         String userDir = "userDict.txt"; //用户词典路径
35         byte[] userDirb = userDir.getBytes();
36         count = ictclas.ICTCLAS_ImportUserDictFile(userDirb, 3);
37         System.out.println("\n导入用户词个数：\t" + count);
38         count = 0;
39         
40         //导入用户词典后再分词
41         byte[] nativeBytes1 = ictclas.ICTCLAS_ParagraphProcess(input.getBytes("UTF-8"), 0, 1);
42         String result1 = new String(nativeBytes1, 0, nativeBytes1.length, "UTF-8");
43         System.out.println("导入用户词典后的分词结果是：\t" + result1);
44         
45         //退出，释放分词组件资源
46         ictclas.ICTCLAS_Exit();        
47     }
48 }

用户词典如下：
舟曲县城@@ZQXC
连夜@@LY
中国科学院@@v
工作@@t
研究@@nb
国科@t
万科@y

结果如下：
Initial success!
中国科学院计算技术研究所在多年研究工作积累的基础上，研制出了汉语词法分析系统ICTCLAS。千万科学家
未导入用户词典的分词结果是：中国科学院/n 计算技术/n 研究/v 所/u 在/v 多年/m 研究/v 工作/v 积累/v 的/u 基础/n 上/f ，/w 研制/v 出/v 了/u 汉语/n 词法分析/n 系统/n ICTCLAS/x 。/w 千/m 万/m 科学家/n

导入用户词个数： 7
导入用户词典后的分词结果是：中国科学院/n 计算技术/n 研究/v 所/u 在/v 多年/m 研究/v 工作/v 积累/v 的/u 基础/n 上/f ，/w 研制/v 出/v 了/u 汉语/n 词法分析/n 系统/n ICTCLAS/x 。/w 千/m 万/m 科学家/n

没有变化！
看到网上有说用户词典是优先的，（2，用户词典的词的优先级貌似太高了。我在用户词典里加了“万科”这个词，结果测试语句“千万科学家”也被分成了“千/ 万科/ 学/ 家”）
但是我这里分词结果没有变化？

标签：

原文地址：http://www.cnblogs.com/liuchaogege/p/5405562.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行