[aklug] Re: Anyone interested in a job importing 24k printed emails in Juneau/Anchorage into a database?

From: Mark Neyhart <Mark_Neyhart@legis.state.ak.us>
Date: Wed Jun 08 2011 - 16:23:19 AKDT

Arthur Corliss wrote:
> On Wed, 8 Jun 2011, Mark Neyhart wrote:
>
>> This was my first question as well...
>>
>> Does anybody know of a linux OCR tool which can convert images to
>> text? I've found references to Tesseract, but am not sure if it is
>> active. I've got a bunch of pages which have been scanned to PDF, and
>> would like to be able to make them searchable.
>
> I've personally used gocr and ocrad with varying levels of success. It's
> not perfect, but with some image preprocessing w/ImageMagick you can
> automate all kinds of text extraction from random images.
>
> In your case, however, PDFs are trivial to extract text out of -- assuming
> the PDFs aren't actually just storing images. Even so, I've been able to
> rasterize image PDFs and extract text out of them as well.
>
>
My PDFs are just storing images (came directly from a scanner).
Thanks for the suggestions. I'll look into them.

Were you using Imagemagick just to convert image formats? Or did you
use it to do things like noise removal and conversion from color to
black and white?
---------
To unsubscribe, send email to <aklug-request@aklug.org>
with 'unsubscribe' in the message body.
Received on Wed Jun 8 16:23:26 2011

This archive was generated by hypermail 2.1.8 : Wed Jun 08 2011 - 16:23:26 AKDT