[aklug] Anyone interested in a job importing 24k printed emails in Juneau/Anchorage into a database?

From: Jason McEachen <jason@brightshinyobject.com>
Date: Wed Jun 08 2011 - 14:55:39 AKDT

This Friday at 9am the State of Alaska is going to have a couple boxes
of printed emails in Juneau for me to have, and a hand truck to help
carry them. We could also pick them up at the Anchorage Airport at 3pm.

What I'd like to do is somehow import them into a database and set up a
quick and easy web-based interface to allow searches.

The problem is my first child is coming into this world that morning at
Providence. My wife doesn't like the idea of me either being in Juneau
to receive and scan/process these docs, nor me sitting at a machine that
morning to write up a script to pull scans, parse them, and populate
some tables.

So we (AlaskaDispatch.com) are possibly interested in hiring someone to
help us with this project.

If you think this is a neat intellectual exercise, please respond to the
group with your ideas or suggestions.

If you're interested in doing this professionally (or can recommend
someone), please contact me directly and let me know how you'd propose
to do it and what you'd bill.

My first thought is to find someone in Juneau (fedex/kinkos) with a big
copier/scanner who can convert paper to PDF really quickly (there are,
after all, ~24 thousand pages) and ftp/sftp them up to a server that's
already got a nice pdf->text (maybe pdftohtml?) tools and "your favorite
script language interpreter" to parse them into a table (probably only
need fields like index, datetime, from, to, cc, bcc, subject, body,
attachments) and a little web front end waiting for search/display.

Has anyone on the list handled a similar task and can share what worked
and what didn't?

Thanks for your help,

--Jason
---------
To unsubscribe, send email to <aklug-request@aklug.org>
with 'unsubscribe' in the message body.
Received on Wed Jun 8 14:56:21 2011

This archive was generated by hypermail 2.1.8 : Wed Jun 08 2011 - 14:56:21 AKDT