Using regular expressions to clean and process OCR data

This is a write-up of a script I wrote for my RA work, demonstrating how regular expressions in Python can be used to clean and process OCR text with many errors in order to generate a workable dataset. The goal in this specific example is to clean US Senate testimony to make a dataset listing the speaker in one column with their testimony in the next column. I also show how to categorize the comments by the section of testimony they are in and how to give an index for those sections. The script is available on GitHub.

The script as written requires an input file called "V1". In this case, the file is OCRd text of Senate testimony from 1913. The text delineates the speaker at the start of each comment (e.g. "Senator Gallinger") and then gives the comment. There are also various section breaks (e.g. "TESTIMONY OF TRUMAN G. PALMER—Continued."). These comments and section breaks are the text data we are interested in...

Read More