标签:view bullet centos6 时间 txt exit syn 测试的 tar
训练和测试的语料都是人民日报98年标注语料,训练和测试比例是10:1,直接通过CRF++标注词性的准确率:0.933882。特征有一千多万个,训练时间比较长。机器cpu是48核,通过crf++,指定并线数量 -p为40,训练了大概七个小时才结束。
语料库、生成训练数据的python脚本、训练日志、模型、计算准确率脚本都上传到网盘,可以直接下载:戳我下载 CRF++词性标注,程序在centos6.5+python2.7下面运行通过,如果在win下或者ubuntu下可能会有异常,通常都是编码、路径规范等小问题,通过逐行debug脚本应该很容易找到问题,同时要确定crf++在自己机器本身编译没有问题,下面说一下每一步的过程。
文章目录 [展开]
生成训练和测试数据脚本:get_post_train_test_data.py,执行过程中会打印出来一些调试信息。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | #coding=utf8 import sys #home_dir = "D:/source/NLP/people_daily//" home_dir = "./" def saveDataFile(trainobj,testobj,isTest,word,handle):     if isTest:         saveTrainFile(testobj,word,handle)     else:         saveTrainFile(trainobj,word,handle) def saveTrainFile(fiobj,word,handle):     if len(word) > 0 and  word != "。" and word != ",":         fiobj.write(word + ‘\t‘ + handle  + ‘\n‘)     else:         fiobj.write(‘\n‘) def convertTag():         fiobj    = open( home_dir + ‘people-daily.txt‘,‘r‘)     trainobj = open( home_dir +‘train.data‘,‘w‘ )     testobj  = open( home_dir  +‘test.data‘,‘w‘)     arr = fiobj.readlines()     i = 0     for a in sys.stdin:         i += 1         a = a.strip(‘\r\n\t ‘)         if a=="":continue         words = a.split(" ")         test = False         if i % 10 == 0:             test = True         for word in words[1:]:             print "---->", word             word = word.strip(‘\t ‘)             if len(word) > 0:                         i1 = word.find(‘[‘)             if i1 >= 0:                 word = word[i1+1:]             i2 = word.find(‘]‘)             if i2 > 0:                 w = word[:i2]             word_hand = word.split(‘/‘)             print "----",word             w,h = word_hand             #print w,h             if h == ‘nr‘:    #ren min                 #print ‘NR‘,w                 if w.find(‘·‘) >= 0:                     tmpArr = w.split(‘·‘)                     for tmp in tmpArr:                         saveDataFile(trainobj,testobj,test,tmp,h)                     continue             saveDataFile(trainobj,testobj,test,w,h)         saveDataFile(trainobj, testobj, test,"","")     trainobj.flush()     testobj.flush() if __name__ == ‘__main__‘:         convertTag() | 
设置模板为:
| 1 2 3 4 5 6 7 8 | # Unigram U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] U05:%x[-1,0]/%x[0,0] U06:%x[0,0]/%x[1,0] | 
训练的时候的-p参数根据自己机器情况设置
| 1 2 | crf_learn -f 3 -p 4 -c 4.0 template train.data model > train.rst   crf_test -m model test.data > test.rst | 
通过命令:python clc_f.py test.rst 执行python脚本,clc_f.py中的具体程序:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | #!/usr/bin/python # -*- coding: utf-8 -*- import sys if __name__=="__main__":     try:         file = open(sys.argv[1], "r")     except:         print "result file is not specified, or open failed!"         sys.exit()     wc = 0     wc_of_test = 0     wc_of_gold = 0     wc_of_correct = 0     flag = True     for l in file:         if l==‘\n‘: continue         _, g, r = l.strip().split()         if r != g:             flag = False    wc += 1         if flag:             wc_of_correct +=1         flag = True     print "WordCount from result:", wc     print "WordCount of correct post :", wc_of_correct     #准确率     P = wc_of_correct/float(wc)     print "准确率:%f" % (P) | 
标签:view bullet centos6 时间 txt exit syn 测试的 tar
原文地址:http://www.cnblogs.com/DjangoBlog/p/7448232.html