[aklug] Re: Anyone interested in a job importing 24k printed emails in Juneau/Anchorage into a database?

From: Arthur Corliss <acorliss@nevaeh-linux.org>
Date: Wed Jun 08 2011 - 16:01:20 AKDT

On Wed, 8 Jun 2011, Mark Neyhart wrote:

> This was my first question as well...
>
> Does anybody know of a linux OCR tool which can convert images to
> text? I've found references to Tesseract, but am not sure if it is
> active. I've got a bunch of pages which have been scanned to PDF, and
> would like to be able to make them searchable.

I've personally used gocr and ocrad with varying levels of success. It's
not perfect, but with some image preprocessing w/ImageMagick you can
automate all kinds of text extraction from random images.

In your case, however, PDFs are trivial to extract text out of -- assuming
the PDFs aren't actually just storing images. Even so, I've been able to
rasterize image PDFs and extract text out of them as well.

--Arthur Corliss
Live Free or Die
---------
To unsubscribe, send email to <aklug-request@aklug.org>
with 'unsubscribe' in the message body.
Received on Wed Jun 8 16:01:29 2011

This archive was generated by hypermail 2.1.8 : Wed Jun 08 2011 - 16:01:29 AKDT