Removing ~50% of the PROGRAM TYPE column

On step 2 of the ‘Data Cleaning Walkthrough: Combining the Data’, we’re told to remove almost 50% of the data of the class_size dataframe (RE the PROGRAM TYPE == GEN ED), surely this isn’t a good idea? Or,is this just with the aim of being quick?

1 Like

Hi @burhaan.quinn,

Good question. Honestly, I don’t know the exact answer :joy:, but I can guess from this screen:
https://app.dataquest.io/m/137/data-cleaning-walkthrough%3A-combining-the-data/2/condensing-the-class-size-data-set

in particular, from this sentence in the end of that screen:

Each school can have multiple program types. Because GEN ED is the largest category by far, let’s only select rows where PROGRAM TYPE is GEN ED .

that GEN ED is the most universal program type present in all the schools, while the othrer programs are school-specific. Presumably, this is the reason why we decided to consider only this program type.

Thanks, @Elena_Kosourova.

Just seemed like it would skew the data, but maybe I’m overthinking things.

1 Like