Skanowane teksty jako korpusy

Janusz S.  Bień

Skanowane teksty jako korpusy
Scanned Texts as Corpora

Author(s): Janusz S. Bień
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: skanowanie; tekst; korpus; wyszukiwarka; kodowanie; scanning; text; corpus; search tool; coding

Summary/Abstract: A modification of the Poliqarp corpus search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proofreading). This search tool operates since December 2009 and is available at http://wbl.klf.uw.edu.pl/. The twolevel regular expressions, which can be used in the queries, allow – at least in principle – to circumvent the OCR errors. The crucial property of the search engine is its ability to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing the socalled graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40,000 pages. It is expected that in the near future other texts will be added to the system.

Details
Contents

Journal: Prace Filologiczne

Issue Year: 2012
Issue No: 63
Page Range: 25-36
Page Count: 12
Language: Polish

Content File-PDF

Back to list

Skanowane teksty jako korpusy Scanned Texts as Corpora

Skanowane teksty jako korpusy
Scanned Texts as Corpora