Hurdles

10/3/2016

Like any other project, our task with the MJP is vulnerable to difficulties. The past two weeks have been been consumed by troubleshooting our OCRing software, ABBYY FineReader Pro.

At the end of my last post, my progress was gaining steam. I had successfully converted these image scans of The Dilettante issues 1.1 and 1.2 into edited, compiled PDFs ready for OCRing. OCRing stands for optical character recognition. It is the process that ultimately allows the reader to search through a document for key words and phrases. It is a popular tool used frequently today, and it allows for easy navigation. The process involves the use of an OCR software such as ABBYY FineReader Pro.

Two weeks ago, I began the process by loading my TIFF files into ABBYY. I had the program zone the areas of text I wanted it to read, and then I let it analyze/read the text for me. The expected output from ABBYY was a PDF with an editable transcript right next to it. This is where I ran into trouble. ABBYY could create the PDF, but there was no way to edit the transcript it analyzed from it. Any mistakes it made while OCRing the TIFF images could not be changed. Mark, my professor, only has knowledge of the Windows version of ABBYY so I turned to other resources in attempt to figure it out.

Two weeks later, I was basically in the same rut I was in at the beginning. I had learned that the ABBYY FineReader for Mac does not have all of the capabilities the Windows version has; The most important one being the ability to edit the text transcript after reading. Mark and I made the decision to continue our work and not spend any longer on this issue. ABBYY FineReader Pro's percent error in its OCR'd data is very low and a majority of pages did not have any errors. Our compromise to combat the inability to edit the PDF text was to create two files through ABBYY (which is normal procedure), but only edit the ".txt" file, leaving the PDF and its unfixable errors alone.

The process of copy-editing the OCR'd text (in the .txt file) was more interesting than the copy-editing I am used to. I was generally looking for words that were copied incorrectly or misinterpreted by the program. Although there were many simple ones where a letter was added or omitted, some cases called for major change. Three full words in a row would be jumbled with strange punctuation marks attempting to form letters. A common error in the program was the combination of the letter "i" and an apostrophe to create the letter "r."

While editing the OCR'd text, I also noticed some mistakes found within the literary magazine itself. Spelling mistakes that are also found in the scanned PDF. Of course, it was much more difficult to copy-edit when Microsoft Word wasn't around. I also noticed some intentional spelling choices that have changed between then and now. Words like perseverance being spelled "perserverence" and harass being spelled "harrass" consistently throughout the magazine. Errors like these I obviously did not change because they are found within the actual text and may have something to say about the era and location this magazine was created in.

After this two week lull, I know can begin moving forward to 1.3 and 1.4.

0 Comments

Hurdles

Leave a Reply.