Scanned Texts as Corpora Cover Image

Skanowane teksty jako korpusy
Scanned Texts as Corpora

Author(s): Janusz S. Bień
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: skanowanie; tekst; korpus; wyszukiwarka; kodowanie; scanning; text; corpus; search tool; coding

Summary/Abstract: A modification of the Poliqarp corpus search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proof­reading). This search tool operates since December 2009 and is available at http://wbl.klf.uw.edu.pl/. The two­level regular expressions, which can be used in the queries, allow – at least in principle – to circumvent the OCR errors. The crucial property of the search engine is its ability to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing the so­called graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40,000 pages. It is expected that in the near future other texts will be added to the system.

  • Issue Year: 2012
  • Issue No: 63
  • Page Range: 25-36
  • Page Count: 12
  • Language: Polish