码迷,mamicode.com
首页 > 其他好文 > 详细

官方的提高tesseract识别成功率的相关方法

时间:2014-11-29 20:11:25      阅读:1587      评论:0      收藏:0      [点我收藏+]

标签:des   http   io   ar   os   sp   for   on   art   

Improving the quality of the output

There are a variety of reasons you might not get good quality output from Tesseract. It‘s important to note that unless you‘re using a very unusual font or a new language retraining Tesseract is unlikely to help.

 

 

DPI

Tesseract works best with text using a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

Image processing[就是预处理,就是不提opencv,看来opencv也没有那么出名]

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn‘t good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the configuration variable tessedit_write_images to true when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract, whether with a dedicated postprocessing tool like Scan Tailor or unpaper, using a graphics editor like ImageJ or Gimp, with a batch image editor like ImageMagick, or in code using an image processing library like Leptonica.

Binarisation【如果这种东西都能识别,那么名片什么的都是弱爆了】

bubuko.com,布布扣

This is converting an image to black and white. Tesseract does this internally, but it can make mistakes, particularly if the page background is of uneven darkness.

Noise

bubuko.com,布布扣

Noise is random variation of brightness or colour in an image, that can make the text of the image more difficult to read. Certain types of noise cannot be removed by Tesseract in the binarisation step, which can cause accuracy rates to drop.

Orientation / Skew

bubuko.com,布布扣

This is when an page has been scanned when not straight. The quality of Tesseract‘s line segmentation reduces significantly if a page is too skewed, which severely impacts the quality of the OCR. To address this rotating the page image so that the text lines are horizontal.

Borders

bubuko.com,布布扣

Scanned pages often have dark borders around them. These can be erroneously picked up as extra characters, especially if they vary in shape and gradation.

Segmentation method

By default Tesseract expects a page of text when it segments an image. If you‘re just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a border to the text may also help, see issue 398.【这里提到识别roi的新方法】

Dictionaries, word lists, and patterns

By default Tesseract is optimised to recognise sentences of words. If you‘re trying to recognise something else, like receipts, price lists, or codes, there are a few things you can do to improve the accuracy of your results, as well as double-checking that the appropriate segmentation method is selected.

Disabling the dictionaries Tesseract uses should increase recognition if most of your text isn‘t dictionary words. They can be disabled by setting the both of the configuration variables load_system_dawg and load_freq_dawg to false.

It is also possible to add words to the word list Tesseract uses to help recognition, or to add common character patterns, which can further help to improve accuracy if you have a good idea of the sort of input you expect. This is explained in more detail in the Tesseract manual.[有manual,在这里找到的]

If you know you will only encounter a subset of the characters available in the language, such as only digits, you can use thetessedit_char_whitelist configuration variable. See the FAQ for an example.

Still having problems?

If you‘ve tried the above and are still getting low accuracy results, ask on the forum for help, ideally posting an example image.

官方的提高tesseract识别成功率的相关方法

标签:des   http   io   ar   os   sp   for   on   art   

原文地址:http://www.cnblogs.com/jsxyhelu/p/4131737.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!