I really enjoy working with data retrieved from a API the problem that I have encountered is that data is frequently dirty and unstructured and must be reworked before it is usable. After reading this guide you will understand the steps that I took to clean my data.
I have been working with the Propublica API to retrieve political data about legislators and bills they have sponsored, after reading the propublica documentation I noticed I could get specific information about a legislator by passing a member_id as a URL parameter. The response according to the documentation will give me data about a legislator roles, committees, and subcommittees, which is exactly what I need.
Please check out: https://projects.propublica.org/api-docs/congress-api/members/#get-a-specific-member
To extract the roles, committees and subcommittees from the response object, I need the following:
- Model file to represent the API JSON structure

2. Client file to make the rest call and map the JSON to the model fields

When I tried to parse the subcommittees from the response I kept on getting a null pointer exception, which made me question the accuracy of my data and decided to further investigate.
- I looked at the response object (String result in above photo), by doing this I was able to see exactly what fields where actually present and noticed subcommittees were empty.
- Next, I wondered if the social media names were valid and up to date. After taking twitter names and searching for them on twitter I did noticed that a few politicians did have multiple twitter accounts. This represents a new problems that I need to find a solution for.
- Lastly, I started looking for null values and handled them accordingly.
After dealing with dirty data I was able to get all the necessary information I needed from propublica. I understand how important the data cleaning process is and how unstructured and inconsistent data can and will lead to misleading results.
Thanks again for reading my article, next article will be over how I used the fec_id from propublica to retrieve financial information from maplight API.
Please checkout the below links
Resume website — https://tommarler.org
Linkedin — https://www.linkedin.com/in/tom-m-bb4857112/
If you like data checkout the programming historian
Programming Historian: https://medium.com/@tommarler/osint-the-programming-historian-1d9129439898