OCR: i progetti di digitalizzazione e il riconoscimento ottico dei caratteri

Contenuto principale dell'articolo

Markus Brantl
Tommaso Garosci

Abstract

Currently programmes that perform the translation of a bitmapped file (typically a .jpg or .tiff) into a machine-readable text (OCR, Optical Character Resolution) are deemed to offer a more than satisfactory degree of confidence. Hence they are considered a mature technology and the assessment of their performance levels does not represent a priority in the broader research field of information retrieval and text manipulation. Still, particularly for librarians, their performance assessment is not a useless or easy exercise. The purpose of such evaluation depends on the scope and scale of relative digitization projects. There is also a more compelling reason why such evaluation is important. OCR's are unpredictable. Their output is not a-priori foreseeable and has to be empirically measured. Accurate empirical evaluations have been carried out by the University of Michigan, Harvard and by Medline on English texts.
For a few years now a set of programmes to test "page readers" (OCR) has been made available online free of charge by the Information Science Research Institute (Isri) of the University of Nevada at Las Vegas. The so-called "frontiers toolkit" (ftk) contains both the control programmes and the bitmapped image files and ground truth control texts necessary to carry out the test.
The article, after reviewing the available literature on the subject, reports on the results of an assessment performed with the help of the University of Nevada toolkit. To this end it was created a sample of 300 dpi bitmapped images of pages randomly selected from a dataset of about 15.000 pages in Italian published from 1950 to 2000 and digitized in Turin between 2006 and 2008.
With the purpose of carrying out the test a ground truth text version of the sampled pages was manually construed. The tested programmes: Abby Finereader (ver. 9.0), Scansoft Omnipage (ver. Pro.14: ver.16 is the current version) and Iris Readiris, were chosen because they are the default choice in Italy and in fact were used by the digitization project carried out in Turin. Only the character and word accuracy of the programmes were tested. Results show how Finereader and Omnipage basically have the same performance footprint, whereas Readiris is well below an acceptable threshold (this is why this latter is not included in the report).
The article is complemented with some general remarks about possible future developments as regards free OCR's and a checklist of considerations to bear in mind when choosing an OCR.

Dettagli dell'articolo

Sezione
Articoli