CYBER WEEK - EXTRA SAVINGS EVENT
TRY A FREE LESSON

Sat_score column has 0 values and messes the plot

screens: Data Cleaning Walkthrough: Analyzing and Visualizing the Data- Screen 5 to Screen 7.

sat_score column in combined has the values 0 (zero) for me has the in some of the rows, therefore when plotting the scatterplot, there are some values at the bottom that are messing with plot and consequently with the the rest of the exercises.

image

How can I remove to 0 values so that I can have a clean scatter plot?

Everything stemmed from this functions here where we converted the values to integers using pd.to_numeric() with the argument errors= 'coerce'

sat_columns= ["SAT Math Avg. Score", "SAT Critical Reading Avg. Score", "SAT Writing Avg. Score"]

# converting the values to integers using pd.to_numeric()
for i in sat_columns:
    data["sat_results"][i] = pd.to_numeric(data["sat_results"][i], errors= 'coerce')
    
data["sat_results"]["sat_score"] = data["sat_results"][sat_columns].sum(axis=1)
data["sat_results"]["sat_score"] 

Is the data set wrong? Anyone can give me a way to filter it?

Also, when choosing the school names from that had a total_enrollment and sat_score lower than 1000, I got a lot of schools with the sat_score of 0. Is this correct?

I proceeded to filter them out using low_enrollment = low_enrollment[low_enrollment['sat_score'] > 0] so I can get the schools that actually have an SAT score. Is method of filtering out the schools with the 0 correct if it were to be a real-life situation?

I know there are a lot of questions and I will highly appreciate your help and for taking your time to read and answer the questions.

Thank you in advance!

1 Like

Hi Vallentin,

The issue is with the last line of your code (well, the second to the last). To calculate a total SAT score in a column, you should use the following code instead:

data['sat_results']['sat_score'] = data['sat_results'][sat_columns[0]] + data['sat_results'][sat_columns[1]] + data['sat_results'][sat_columns[2]]

Then, you’ll find out that the minimum SAT score in the dataset is 887.0

1 Like

Thanks for the answer Elena!

What is the logic behind this? Shouldn’t the sum() function sum all the elements that are in sat_columns list anyway?

1 Like

Hi Vallentin,

The issue with the sum() method is not so obvious, but the thing is that by default, it excludes NaN values from the calculation. In other words, in the case when you have NaN in all the columns that you want to sum, this method calculates: NaN+NaN=0. So if you want to use the sum() method anyway, you have to pass in the skipna parameter. This one is True by default. Since it seems that your dataset has sometimes NaN values in all the columns to be summed, you should assign it to False instead.

1 Like

YES YES YES!
Oh my, you have no idea how happy I am that you mentioned the skipna parameter. Wow, thanks a lot.

Now it is interesting that when you sum the columns manually, the adding up of the NaN values is being skipped.

1 Like

That’s great that my suggestion was useful! Probably, I should have advised it to you from the very beginning instead of my clumsy code :sweat_smile:

I think that we don’t have this issue when summing up manually exactly because of the absence of that parameter the default value of which is not always convenient. In “normal” Python, NaN+NaN=NaN.

1 Like

Thank you for your answers and taking the time, Elena!

1 Like

Vallentin, sorry that I’m still “disturbing” you with your own question, but I found even a much better solution and would like to share it with you! :grinning: Look what I’ve discovered, I’ll demonstrate it to you by a small example dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [np.nan, 4, np.nan], 
                   'B': [9, 8, np.nan],
                   'C': [np.nan, 6, np.nan]})
print(df)

     A    B    C
0  NaN  9.0  NaN
1  4.0  8.0  6.0
2  NaN  NaN  NaN

Following your initial approach:

df['sum'] = df.sum(axis=1)
print(df)

     A    B    C   sum
0  NaN  9.0  NaN   9.0
1  4.0  8.0  6.0  18.0
2  NaN  NaN  NaN   0.0

Not good, we have 0 instead of NaN for the last row.

However, following my first approach:

df['sum'] = df['A'] + df['B'] + df['C'] 
print(df)

     A    B    C   sum
0  NaN  9.0  NaN   NaN
1  4.0  8.0  6.0  18.0
2  NaN  NaN  NaN   NaN

or even my second approach:

df['sum'] = df.sum(axis=1, skipna=False)
print(df)

     A    B    C   sum
0  NaN  9.0  NaN   NaN
1  4.0  8.0  6.0  18.0
2  NaN  NaN  NaN   NaN

we got another issue: we have a NaN sum even in the first row, where actually we have 1 valid value! :exploding_head:

(Side note: in our particular case with the NYC school dataset, we didn’t encounter this problem, because for the 3 components of SAT, or all of them are NaN, or all are valid. )

Sometimes, it can be exactly what we need: when we want to ignore the rows with at least one NaN value. However, it’s not always the case in real life. Hence, to calculate NaN only for the rows with all NaNs and normal values for the rows with at least 1 valid value, we’d better actually use another parameter: min_count=1. Look:

df['sum'] = df.sum(axis=1, min_count=1)
print(df)

     A    B    C   sum
0  NaN  9.0  NaN   9.0
1  4.0  8.0  6.0  18.0
2  NaN  NaN  NaN   NaN

Of course, we can assign any other threshold value to the min_count parameter, if it’s necessary for a particular task. Anyway, the last approach looks much more iniversal!