Or you’ve been presented with a set of scanned PDF documents, where the text is selectable, or worse, a set of png images of text documents. If the documents were HTML web pages, you might consider writing a scraper, using the structure of the HTML document to help you identify different meaningful elements within a webpage, and as a result try to recreate the database that contained the data that was used to generate the web pages. Rather than trying to recreate a data base, how about we settle for just getting the text (the sort of thing a search engine might extract from a set of documents that it can index and search over, for example).

So you’ve got a dozen or so crappy Word documents collected over the years in a variety of formats, from to .docx, and perhaps even a PDF or two, listing the biographies of speakers at this or that event, or the members of this or that group (a set of company directors, for example).

Billing is per hour with a monthly cap with different rates for different machine specs.

