How I turned a book into a website

An experiment

This site is an experiment in bringing old paper books online. I was curious as to whether some technical books would be better published for free as websites than languishing out of print. It's a long-term experiment as I know it takes a while for SEO and Adsense to settle down.

A family friend held the copyright to a book about cars that went out of print about 15 years ago. It was produced mainly as promotional material for a large brand in the UK, and given away free with vouchers from petrol stations.

He agreed to let me publish the book as a website, but all I had to use was an original copy of the book. I bought the domains howacarworks.com and FixingACar.com (which I have yet to use).Here's the process I took:

Scanning

I cut the pages out of the book along the spine, and then loaded them all into a Fujitsu ScanSnap S1500 which I'd already bought to digitise my documents following a post by Ryan Waggoner.

Scanning was a breeze. The ScanSnap is a beast and devours paper at a phenomenal speed. I'd say the entire book was scanned within half an hour.

OCR

Now I had 250+ numbered JPG images on my hard drive and needed to convert them into HTML pages. This was quite a stumbling block.

Each page contained a single article on a specific part of the car, which made it far more granular to deal with than a book where sections run across multiple pages.

I tried quite a few OCR apps but in the end OmniPage seemed to produce the best results - it made very, very few mistakes in recognising words and was able to spit out the HTML I needed (albeit as DISGUSTING HTML 1, complete with FONT tags).

Recognising the text was much easier than I thought, despite it being in multiple columns and interspersed with image captions. OmniPage automatically zones the various areas of the page, and I then manually tweaked it to only scan the title and body of the page, and to save the rest as separate images.

Outputting the text and images

OmniPage produces absolutely shocking HTML - it's like something I'd have written in the Geocities era. Here's a snippet:

<P style=" background-color: #FFFFFF; text-align: left; text-indent: 0px; line-height: normal ; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" ><SPAN style="font-family: Times New Roman,serif; font-style: normal; font-weight: normal; font-size: 12pt; letter-spacing: 0.370000pt;" >To unfasten a metal clip, use a screwdriver to ease it from its slot, or hook it back with a bent split pin. </SPAN></P><LEFT>
<IMG src="45_Picture3.jpg" alt="Picture" align="top" height="432" width="149"></LEFT>

It would have been easier to export the OCRed text as plain text, but then I wouldn't have been able to get the images exported and linked to the correct page. HTML provided me with text, images, and, critically, the means to link the two together.

Creating usable content

I wrote a ruby script that used Nokogiri to parse the HTML files into markdown, which is how I have stored all the article text in the database. I chose markdown because it's much cleaner to work with. So my process was Hardcopy -> Horrible HTML -> Markdown -> Nice HTML.

Images

At the same time as parsing the article HTML into markdown, I also used Nokogiri to identify the images and copy them into a Picture model (using Carrierwave) which belongs to the Article model.

At first I hoped to recreate the layout of each page, because they vary dramatically. But this proved to be a nightmare. I settled for having all the images in a sidebar, in the correct order. I think it works well enough - clicking on an image opens it up nice and big.

Linking things together

So by this point I had a table of contents, and could view an article, but there was no way to link between one page and another - I believe the kids call this a hyperlink!

The original printed pages contained references like this:

The engine is a big lump of metal (See SHEET 14).

These references had OCRed up nicely (except for a handful which were recognised as 'See SHEEP 4') so I wrote a regex which replaced them with a custom Wikipedia-style link markup which looks like this.

  The engine is a big lump of metal (See [[Article:9]]).

I used the Article ID in this link markup so that it can adapt to changed titles later. The Article ID doesn't match up with the page numbers from the book - I didn't bother making pages for the contents and other needless page, so I had to store the original_page_number in the Article table.

I later wrote a few regexes which link words within each article to key areas. For example, the regex /brakes?/ adds a link to the page about how brakes work. This adds some more structure to the pages, and also (probably) helps with SEO.

Monetising

I've just put Google Adsense alongside the content and it seems to earn around $15 per week at the moment. I can only imagine that going up as Google starts to recognise this as quality content and not just content farm rubbish.

I'm Alex Muir.