image processing - OCR on antialiased text -
I have an OCR table from a PDF document. I wrote simple Python + OpenCV script to get personalized cells. After that new problem was born. Text is antialiase and not good quality. Tesert recognition rate is very low I have tried to preprocess images with adaptive thresholding but the results were not very good I have tried the trial version of ABVY Finner, and in fact this is exactly the output But I do not want to use non-free software I wonder if a preprocessing problem Or it is prohibited to write and learn other OCR systems.
If you look closely at your antialiase text samples, you will see that the edges There are many red and blue ones:
It indicates that your computer is being antialiasing inside, which is used to optimize the results for your LCD monitor
If so, at high resolution Removing At should be fairly easy. For example, you can use imagemagic to extract pictures from PDF files on 300 dpi like a command line:
convert-density 300 source.pdf output.png You can try loading PDFs in your favorite viewer and copying text directly to the clipboard.
Appendix: I tried to convert my sample text back into my original pixels and apply the scaling technique described in the comments. Here are the results:
Original image:
After applying 300% scaling and simple threshold:
After smart scaling and thresholding:
As you can see, some letters are still something wrong, but I think it's a bay to read it with Tessaract Level opportunity.
Comments
Post a Comment