前文已经简要介绍tesseract ocr引擎的安装及基本使用,其中提到使用-l eng参数来限定语言库,可以提高识别准确率及识别效率。
1、工具2 java虚拟机 Ver 1.8.0_91 64位版本 (oracle官网)
2、工具1 jtessboxeditor Ver 1.5版本 (jtessboxeditor官网),运行界面如下:
打开jtessboxeditor,点击Tools->Merge Tiff ,按住shift键选择前文提到的101个tif文件,并把生成的tif合并到新目录d:\python\lnypcg\new下,命名为langyp.fontyp.exp0.tif。
注意:langyp 是本人定义的语言名称,fontyp是本人定义的字体名称,后续都会用到,你可以修改成你喜欢的名字。
tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox
D:\python\lnypcg\new>tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox Tesseract Open Source OCR Engine v3.02 with Leptonica Page 1 of 101 Page 2 of 101 Page 3 of 101 …… Page 101 of 101 D:\python\lnypcg\new>dir 驱动器 D 中的卷没有标签。 卷的序列号是 36D9-CDC7 D:\python\lnypcg\new 的目录 2016-06-03 14:37 <DIR> . 2016-06-03 14:37 <DIR> .. 2016-06-03 14:30 6,327 langyp.fontyp.exp0.box 2016-06-03 13:07 126,056 langyp.fontyp.exp0.tif 2 个文件 132,383 字节 2 个目录 24,869,994,496 可用字节
切换到jTessBoxEditor工具的Box Editor页,点击open,打开前面的tiff文件langyp.fontyp.exp0.tif,工具会自动加载对应的box文件。
echo fontyp 0 0 0 0 0 >font_properties
D:\python\lnypcg\new>echo fontyp 0 0 0 0 0 >font_properties D:\python\lnypcg\new>type font_properties fontyp 0 0 0 0 0
tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train
D:\python\lnypcg\new>tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train Tesseract Open Source OCR Engine v3.02 with Leptonica Page 1 of 101 row xheight=8.66667, but median xheight = 10 APPLY_BOXES: Boxes read from boxfile: 4 Found 4 good blobs. Generated training data for 1 words …… …… …… Page 101 of 101 row xheight=8.66667, but median xheight = 10 APPLY_BOXES: Boxes read from boxfile: 4 Found 4 good blobs. Generated training data for 1 words D:\python\lnypcg\new 的目录 2016-06-03 16:34 <DIR> . 2016-06-03 16:34 <DIR> .. 2016-06-03 16:05 16 font_properties 2016-06-03 14:30 6,327 langyp.fontyp.exp0.box 2016-06-03 13:07 126,056 langyp.fontyp.exp0.tif 2016-06-03 16:20 618,844 langyp.fontyp.exp0.tr 2016-06-03 16:20 202 langyp.fontyp.exp0.txt 5 个文件 751,445 字节 2 个目录 24,869,101,568 可用字节
unicharset_extractor langyp.fontyp.exp0.box
D:\python\lnypcg\new>unicharset_extractor langyp.fontyp.exp0.box Extracting unicharset from langyp.fontyp.exp0.box Wrote unicharset file ./unicharset. D:\python\lnypcg\new>dir 驱动器 D 中的卷没有标签。 卷的序列号是 36D9-CDC7 D:\python\lnypcg\new 的目录 2016-06-03 16:41 <DIR> . 2016-06-03 16:41 <DIR> .. 2016-06-03 16:05 16 font_properties 2016-06-03 14:30 6,327 langyp.fontyp.exp0.box 2016-06-03 13:07 126,056 langyp.fontyp.exp0.tif 2016-06-03 16:20 618,844 langyp.fontyp.exp0.tr 2016-06-03 16:20 202 langyp.fontyp.exp0.txt 2016-06-03 16:41 712 unicharset 6 个文件 752,157 字节 2 个目录 24,869,171,200 可用字节
shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
D:\python\lnypcg\new>shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr Reading langyp.fontyp.exp0.tr ... Building master shape table Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... Stopped with 0 merged, min dist 999.000000 Computing shape distances... 0 1 2 3 4 5 6 7 8 9 10 Stopped with 0 merged, min dist 0.057803 Master shape_table:Number of shapes = 11 max unichars = 1 number with multiple unichars = 0 D:\python\lnypcg\new>dir 驱动器 D 中的卷没有标签。 卷的序列号是 36D9-CDC7 D:\python\lnypcg\new 的目录 2016-06-03 17:24 <DIR> . 2016-06-03 17:24 <DIR> .. 2016-06-03 17:20 19 font_properties 2016-06-03 14:30 6,327 langyp.fontyp.exp0.box 2016-06-03 13:07 126,056 langyp.fontyp.exp0.tif 2016-06-03 17:23 618,844 langyp.fontyp.exp0.tr 2016-06-03 17:23 202 langyp.fontyp.exp0.txt 2016-06-03 17:24 723 langyp.unicharset 2016-06-03 17:24 202 shapetable 2016-06-03 17:24 712 unicharset 8 个文件 753,085 字节 2 个目录 24,868,278,272 可用字节
mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
D:\python\lnypcg\new>mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr Read shape table shapetable of 11 shapes Reading langyp.fontyp.exp0.tr ... Done!
cntraining langyp.fontyp.exp0.tr
D:\python\lnypcg\new>cntraining langyp.fontyp.exp0.tr Reading langyp.fontyp.exp0.tr ... Clustering ...
rename normproto fontyp.normproto
rename inttemp fontyp.inttemp
rename pffmtable fontyp.pffmtable
rename unicharset fontyp.unicharset
rename shapetable fontyp.shapetable
D:\python\lnypcg\new>rename normproto fontyp.normproto D:\python\lnypcg\new>rename inttemp fontyp.inttemp D:\python\lnypcg\new>rename pffmtable fontyp.pffmtable D:\python\lnypcg\new>rename unicharset fontyp.unicharset D:\python\lnypcg\new>rename shapetable fontyp.shapetable
combine_tessdata fontyp.
D:\python\lnypcg\new>combine_tessdata fontyp. Combining tessdata files TessdataManager combined tesseract data files. Offset for type 0 is -1 Offset for type 1 is 140 Offset for type 2 is -1 Offset for type 3 is 852 Offset for type 4 is 137760 Offset for type 5 is 137850 Offset for type 6 is -1 Offset for type 7 is -1 Offset for type 8 is -1 Offset for type 9 is -1 Offset for type 10 is -1 Offset for type 11 is -1 Offset for type 12 is -1 Offset for type 13 is 139352 Offset for type 14 is -1 Offset for type 15 is -1 Offset for type 16 is -1
D:\python\lnypcg>tesseract 28.tif output -l eng -psm 7 Tesseract Open Source OCR Engine v3.02 with Leptonica D:\python\lnypcg>type output.txt S094 #1调用默认的eng语言,8被识别成S D:\python\lnypcg>tesseract 28.tif output -l fontyp -psm 7 Error opening data file C:\Program Files (x86)\Tesseract-OCR\tessdata/fontyp.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language ‘fontyp‘ Tesseract couldn‘t load any languages! Could not initialize tesseract. #2条用新的fontyp语言,tesseract找不到fontyp语言。 D:\python\lnypcg>copy .\new\fontyp.traineddata "C:\Program Files (x86)\Tesseract-OCR\tessdata" 已复制 1 个文件。 #3复制fontyp.traineddata到tesseract的安装目录的tessdata子目录下
D:\python\lnypcg>tesseract 28.tif output -l fontyp -psm 7 Tesseract Open Source OCR Engine v3.02 with Leptonica D:\python\lnypcg>type output.txt 8094
Anyway,jtessboxeditor 工具其实是一个基本成型的三方样本训练工具,它的功能就是自动执行上述脚本命令,但是在实际使用中,还存在不够完善的地方,譬如不能加psm参数,生成shape时经常程序异常崩溃,所以本文操作还是以命令行为主。
1、合并图片 2、生成box文件 tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 batch.nochop makebox 3、修改box文件 4、生成font_properties echo fontyp 0 0 0 0 0 >font_properties 5、生成训练文件 tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng -psm 7 nobatch box.train 6、生成字符集文件 unicharset_extractor langyp.fontyp.exp0.box 7、生成shape文件 shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr 8、生成聚集字符特征文件 mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr 9、生成字符正常化特征文件 cntraining langyp.fontyp.exp0.tr 10、更名 rename normproto fontyp.normproto rename inttemp fontyp.inttemp rename pffmtable fontyp.pffmtable rename unicharset fontyp.unicharset rename shapetable fontyp.shapetable 11、合并训练文件,生成fontyp.traineddata combine_tessdata fontyp.