HW 5 – ETL with real data
The goal of this homework is similar to that of HW4 where you’re given semi-structured non-normalized data and do a full ETL on it with the end result being a cleaned, normalized set of tables that you then push to your postgres database on AWS.
In this scenario though, the data you’re given is more complex, more messy, and needing a lot more of the E and T of the ETL than the HW4 district court decisions data. You also have a lot less guidance in how you approach this. Basically, I’ll give you some information on the data and some guidance on what your RDB schema might look like (but you’ll have to make a lot of decisions here too). We’re basically simulating a scenario where I, a client, am giving you data and some general guidelines and asking you to do a full ETL on it.
This is basically a slightly more structured version of what you’ll need to do with your final projects. The data you’ll be using here is actually from the final project of a former student, so it should give you an idea of the level of complexity that’s expected and sort of a bigger picture idea of what your final project will involve.
Submission 1) Run all cells. 2) Create a directory with your name. 3) Create a pdf copy of your notebook. 4) Download .py and .ipynb of the notebook. 5) Put all three files in it. 6) Zip and submit.