Project: Extracting & Transforming Data from a PDF

I’ve been working on a project that involves extracting data from a PDF document and transforming and loading it into a more useful format. It involves the use of the PyPDF2 library and extensive use of regular expressions.

The project is essentially complete, but with plenty of room for improvement. I would welcome any comments, suggestions, or questions.

I’m posting the Jupyter Notebook here. I’ve also posted all the files for the project on a GitHub repository.


senate_project.ipynb (27.7 KB)

Click here to view the jupyter notebook file in a new tab


Hey @ed9

First of all, sincere apologies for this delayed response. I had come to your project the same week you had uploaded it. Somehow I couldn’t respond back to this and thereafter I lost it. I searched for it using “PDF” as a keyword but got other posts :frowning:

Thank you for sharing the project :ok_hand: I hope I will be able to break the Regex patterns you have devised, for a better understanding for myself.


Thanks, Rucha! The regex Takeaways sheets from the two Dataquest courses on regex proved to be invaluable for this project.


This looks great. I will hopefully be able to learn a lot from this, since I am just starting to work with pdf’s.

I am so new, I am having trouble finding the pypdf2 module I imported. Any tips?

Thanks for your interest. I use Anaconda and installed it that way. See here for information:

Once installed, you can import it by name: ‘import PyPDF2’. You can see this in my code.

Good luck!

