Using regular expressions to clean and process OCR data

This is a write-up of a script I wrote for my RA work, demonstrating how regular expressions in Python can be used to clean and process OCR text with many errors in order to generate a workable dataset. The goal in this specific example is to clean US Senate testimony to make a dataset listing the speaker in one column with their testimony in the next column. I also show how to categorize the comments by the section of testimony they are in and how to give an index for those sections. The script is available on GitHub.

The script as written requires an input file called "V1". In this case, the file is OCRd text of Senate testimony from 1913. The text delineates the speaker at the start of each comment (e.g. "Senator Gallinger") and then gives the comment. There are also various section breaks (e.g. "TESTIMONY OF TRUMAN G. PALMER—Continued."). These comments and section breaks are the text data we are interested in...

Read More

Wikipedia search for historical firm founding date (Python)

For my RA work this summer, I developed this Python script which links a historical firm to its suggested Wikipedia information in order to predict the firm's founding date. Here is a quick write-up on the script, which is available on GitHub.

The script as written requires an input file called "CompanyList". This file should be utf-16 encoded .txt file. The file should contain a list of the companies to be searched, with one company on each line.

The output file will be a utf-8 encoded .txt file called "WikipediaFoundingDates". You can initialize this file beforehand by creating a blank .txt file of this name. After the script runs, the updated file will have a semicolon-separated list that can be easily imported into other software for analysis...

 

Read More

Make your own bibliography style in LaTeX

If the many multitudes of LaTeX bibtex* bibliography styles don't suit you, never fear! It's easy and exciting to make your own bibliography style (.bst) in just a couple of minutes, or even hours if you really get into it! Here's how I did it:

WHAT YOU'LL NEED

  • A Unix computer (e.g. a Mac)
  • With MacTeX installed (MacTeX is the free LaTeX distribution for Macs)
  • A passion for procrastination and tedium...
Read More