image processing - OCR on antialiased text -

May 15, 2010

I have an OCR table from a PDF document. I wrote simple Python + OpenCV script to get personalized cells. After that new problem was born. Text is antialiase and not good quality. Tesert recognition rate is very low I have tried to preprocess images with adaptive thresholding but the results were not very good I have tried the trial version of ABVY Finner, and in fact this is exactly the output But I do not want to use non-free software I wonder if a preprocessing problem Or it is prohibited to write and learn other OCR systems.

If you look closely at your antialiase text samples, you will see that the edges There are many red and blue ones:

Increased view of antialized text

It indicates that your computer is being antialiasing inside, which is used to optimize the results for your LCD monitor

If so, at high resolution Removing At should be fairly easy. For example, you can use imagemagic to extract pictures from PDF files on 300 dpi like a command line:

  convert-density 300 source.pdf output.png    You can try loading PDFs in your favorite viewer and copying text directly to the clipboard.  
   Appendix:   
 I tried to convert my sample text back into my original pixels and apply the scaling technique described in the comments. Here are the results:  
 Original image: 
   
 After applying 300% scaling and simple threshold: 
   
 After smart scaling and thresholding: 
   
 As you can see, some letters are still something wrong, but I think it's a bay to read it with Tessaract Level opportunity.




















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




python - how we can use ajax() in views.py in django? -






June 15, 2012








    I have a little code here.   Could you please tell me what this code really does? please.   Here in the AJAX call:    $ .ajax ({url: "{% url} upload_document '%}", type: "POST", data : {Csrfmiddletoken: document.getElementsByName ('csrfmiddletoken') [0] Price, Title: document.getElementById ('title'). Value, // document: document: document.getElementById ('document'),}, datatype: "Jason", success: function (feedback) {if (feedback == "true") {// success} other {// attachment errors}}});    How AJAX works in the Django and how we can see the AJAX request.  Here is the idea of my thoughts    def upload_document (requested): Print request. POTT print request FILES if request.is_ajax (): If request.method == 'POST': form = UploadForm (request.POST, request.FILES, user = request.user) if form.is_valid (): Form.save () return HTTPPCS (SimpleJason Dumps ('True'), Mime Type = 'App / Jason') Other:...





Read more





matlab - Using loops to get multiple values into a cell -






April 15, 2013








    I have 31 topics (S1, S2, S3, S4, etc.) 3 images in each topic, opposite 1 IMG, Contrast 2 IMG and Contrast 3 Are IMG I p In a NX1 cell named P, I would like to use a loop to get all the paths of contrasts from all the subjects:    data / S1 / contrast1.img   / S1 / contrast2.img   data / S1 / contrast3.img   data / s2 / contrast1.img   data / S2 / contrast2.img   Data / S2 / contrast3.img ...   Data / S31 / contast3.img    This is what I've tried:    A = {'S1', 'S2', 'S3', ..., 'S31'}; % All topics C = {'contrast1.img', 'contrast2.img', 'contrast3.img'}; For each = P = cell (31 * 3,1) for each topic, the required contrast images: Length = A for Jammu = 1: Length (C) P {j} = spm_select ('FPList', Fullfile (data_path, q {i}) sprint ('% s', cell2mat (c))); % Of each topic is to select three contrast images, it works in my script, it is probably not 100% correct, because I had to simplify this example. End of ...





Read more





python - Sequence Pattern recognition with Pybrain -






March 15, 2011













    I am constantly trying to use recurrent neural networks to classify a series of data. To be more specific, I have a sequence of sensor reading (which is continuous over time), I have to learn an algorithm which can detect the state related to this pattern, given the readings change.   Example:   Time step_1: 1.4   Time step_2: 1   Time step_3: 0.8   State = New Sequence:   Time Phase 1: 0.4   Time Phase 2: 0.3   Time Phase 3: 0.1   State = Sitting   I actually have 12 censors, I'm just showing a sequence of numbers for convenience. (Numbers are not real, I'm just trying to fulfill this idea)!   I am trying to make my network with PyBrain RNN, however, I can not get a Data Set Container that can detect it, like information I tried to use sequential data But after some testing I came to know that this number is the next element in the sequence of numbers. Here's how I prepare my dataset:    self.alldata = SequentialDataSet (ds.num_features, 1) # Now add samples to the dat...





Read more

Search This Blog

ABC code

image processing - OCR on antialiased text -

Comments

Post a Comment

Popular posts from this blog

python - how we can use ajax() in views.py in django? -

matlab - Using loops to get multiple values into a cell -

python - Sequence Pattern recognition with Pybrain -