This is the second post of a series dedicated to digitised historic newspapers. This post explains non-technically what is OCR, the technology that operates behind large quantities of scanned text documents.

OCR (Optical Character Recognition) is the use of computer methods for scanning text, initially developed for translating written text into computer speech in benefit of blind people [1].

Parting from materials such as books, the task of scanning a page is simple, there is text and occasional illustrations, the text flows from left to right and from top to bottom, and the text flow continues in the next page; at most, text is distributed in columns. But newspapers are structured differently. Below, we see a front page of a newspaper published on June 29th, 1919. On this day, Uusi Suomi, a Finnish newspaper that ran until the 1990s, summarised its main contents: the recently signed Treaty of Versailles; the situation of the red and white armies in the neighbouring Russian civil war; the plans of the Bolshevists of possible moves around the Finnish border; and new ration regulations for sugar supplies. Our eyes are used to columns and headlines that have structured newspapers for centuries, therefore it is easy to distinguish the stories at first glance. But, how is it possible for the computer to recognise such things like headlines and distribution of an article irregularly in multiple columns?

Front page of 1919 Uusi Suomi with a summary of its most prominent news Front page of 1919 Uusi Suomi with a summary of its most prominent news
Uusi Suomi, 29.06.1919, nro 145, s. 1 Kansalliskirjaston Digitoidut aineistot.

In other words, how does OCR work? At first glance, a scanning process consists of taking a picture of the page that is placed on the scanner glass. On the background, OCR does a series of operations before the scanned document is processed. After correcting contrast and sharpness in the page, and turning the page into black and white (what is called colour binarization), the main function of OCR is to transform the text from the “photo” taken of the page, into machine readable text. To do this, the layout of the page is analysed, that is, text is distinguished and extracted from strokes used as separators, images or other non-textual decoration; image captions are as well recognised as text [1]. In terms of recognition, there are improvements to these basic OCR functionality: if a reference list or vocabulary is provided, there are methods to identify the text language or named entities (e.g. people, places); newer projects are advancing methods to recognise types of texts, such as poetry or music [2].

Aye Fechtin' Awa'. Australian Town and Country Journal (Sydney, NSW : 1870 - 1907), Saturday 6 December 1884, page 30
Australian Town and Country Journal (Sydney, NSW : 1870 – 1907), Saturday 6 December 1884, page 30

This fragment of a poem published in the Australian Town and Country Journal in 1884, was automatically identified as poetry using an algorithm that recognised rhyming lines. This algorithm analyses endings of lines looking for matches within three lines. This proved to be the most successful of a series of algorithms applied to the text before determining if they are poems. Others looked for irregular length of consecutive lines or the a larger indentation of the text in a column.

Headaches of OCR. If one examines the image at the top, and if one understands Finnish, one can find many errors in the automated transcription of the text. OCR scanned pages can present grave spelling errors, specially if the scan-quality is not very good or if the text uses special fonts such as in this case “Fraktuura”, a popular but not very legible gothic typeface that was in use since the 16th century, and the Germans maintained until mid-20th century.

Overseeing this common shortcoming of OCR, the visitors of digital libraries in search for these documents (historic newspapers, but also old books, manuscripts or letters), have at least three activities at their disposal:

  1. Keyword search, highlighted on the page display;
  2. intuitive image/article clipping and segmentation;
  3. side-by-side display of OCR text with the digitized page (see top image).

After understanding how content is digitised from a newspaper page, the next post explains how to best make use of digital libraries and searching content.

Images: Kansalliskirjastot digitoidut ainestot (digi.kansalliskirjasto.fi) / Australian National Library (trove.nla.gov.au/)

Further reading:

[1] ‘Optical Character Recognition’. 2017. Wikipedia. https://en.wikipedia.org/wiki/Optical_character_recognition#Techniques

[2] Kerry Kilner and Kent Fitch. 2017. ‘Searching for My Lady’s Bonnet: Discovering Poetry in the National Library of Australia’s Newspapers Database’. Digital Scholarship in the Humanities, January, i69–83. doi:10.1093/llc/fqw062.
Advertisements