10 - analyzing-missing-data Wrong code

Screen Link:
https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/10/analyzing-missing-data

My Code:

sorted = combined.set_index('REGION').sort_values(['REGION', 'HAPPINESS SCORE'])
sns.heatmap(sorted.isnull(), cbar=False)


What I expected to happen:
Above code which is given on https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/10/analyzing-missing-data is wrong it gives error as region is not a column anymore

What actually happened:

Replace this line with the output/error

KeyErrorTraceback (most recent call last)
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2524 try:
-> 2525 return self._engine.get_loc(key)
2526 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘REGION’

During handling of the above exception, another exception occurred:

KeyErrorTraceback (most recent call last)
in ()
8 missing = combined.isnull().sum()
9
—> 10 sorted = combined.set_index(‘REGION’).sort_values([‘REGION’, ‘HAPPINESS SCORE’])
11 sns.heatmap(sorted.isnull(), cbar=False)
12

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)
3605 keys =
3606 for x in by:
-> 3607 k = self.xs(x, axis=other_axis).values
3608 if k.ndim == 2:
3609 raise ValueError('Cannot sort by duplicate column s'

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
2333
2334 if axis == 1:
-> 2335 return self[key]
2336
2337 self._consolidate_inplace()

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in getitem(self, key)
2137 return self._getitem_multilevel(key)
2138 else:
-> 2139 return self._getitem_column(key)
2140
2141 def _getitem_column(self, key):

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2144 # get column
2145 if self.columns.is_unique:
-> 2146 return self._get_item_cache(key)
2147
2148 # duplicate columns & possible reduce dimensionality

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
1840 res = cache.get(item)
1841 if res is None:
-> 1842 values = self._data.get(item)
1843 res = self._box_item_values(item, values)
1844 cache[item] = res

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
3841
3842 if not isna(item):
-> 3843 loc = self.items.get_loc(item)
3844 else:
3845 indexer = np.arange(len(self.items))[isna(self.items)]

/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2525 return self._engine.get_loc(key)
2526 except KeyError:
-> 2527 return self._engine.get_loc(self._maybe_cast_indexer(key))
2528
2529 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘REGION’

I used below code for analysis:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


combined = combined.dropna(thresh=159,axis='columns')
missing = combined.isnull().sum()

# sorted = combined.set_index('REGION')
# sorted = sorted.sort_index(ascending=False).sort_values(['HAPPINESS SCORE'])
# # sort_values(['REGION', 'HAPPINESS SCORE'])
# sns.heatmap(sorted.isnull(), cbar=False)

# combined_updated = combined.set_index('YEAR')
# sns.heatmap(combined_updated.isnull(), cbar=True,cmap="Blues_r")


sorted =  combined.set_index('REGION').sort_index(ascending=True)
plot_ = sns.heatmap(sorted.isnull(), cbar=True,cmap="Blues_r")

for ind, label in enumerate(plot_.get_yticklabels()):
    if ind % 10 == 0:  # every 10th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)
        
        
zz= combined[combined["FAMILY"].isnull()]
zz["REGION"].unique()

zz[zz["REGION"] == "Southeastern Asia"]

c_values = combined["REGION"].value_counts()
zz_values = zz["REGION"].value_counts()
# c_values[c_values.index == zz_values.index]

qw = (c_values.index.isin(zz_values.index)) 

c_values_missing  = c_values[qw]

(zz_values / c_values_missing) * 100



Hi @eashwary

I have following questions for you:

  • sorted here is a built-in function. Why are you assigning a dataframe to it? what’s the idea behind this step

  • are you using Jupyter notebook as well in parallel to using the IDE to have a better understanding of the workflow. You can do that by downloading the datasets directly in your machine and then accessing the Jupter Notebook. (just a thought)

  • how are we gonna interpret this generated plot.
    image

The REGION error might be caused because, it’s not a column anymore but index for the dataframe.

on dataquest portal link : https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/10/analyzing-missing-data , wrong code is given , I tried running that , yes they have reindexed , wanted to share that .

Also i tried to create the visualisation , the heat map which is given on https://app.dataquest.io/m/347/working-with-missing-and-duplicate-data/10/analyzing-missing-data . My visualisation is having some drawbacks like if if filter label by 10 some labels misses . If I tried all labels , one can not distinguish the labels.

Wanted a clean heat map for null value analysis as given on that page.

Hi @eashwary
I completely forgot the author have themselves used the keyword in instructions.

Okay let’s try this again.

In your code consider only the below part. This generates the heatmap.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


combined = combined.dropna(thresh=159,axis='columns')
missing = combined.isnull().sum()

# sorted = combined.set_index('REGION')
# sorted = sorted.sort_index(ascending=False).sort_values(['HAPPINESS SCORE'])
# # sort_values(['REGION', 'HAPPINESS SCORE'])
# sns.heatmap(sorted.isnull(), cbar=False)

# combined_updated = combined.set_index('YEAR')
# sns.heatmap(combined_updated.isnull(), cbar=True,cmap="Blues_r")


sorted =  combined.set_index('REGION').sort_index(ascending=True)
plot_ = sns.heatmap(sorted.isnull(), cbar=True,cmap="Blues_r")

plt.show()

If you add the below for-loop code before plt.show() then only the 10th position labels are visible.

for ind, label in enumerate(plot_.get_yticklabels()):
    if ind % 10 == 0:  # every 10th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)

I can’t find instructions for the last part of the code though. Can you further elaborate your thought process to code it so?

I took the liberty of appending your code in my own practice book to add comments to the last part of the code. I hope you don’t mind that.

Missing&DuplicateData_Appended.zip (125.0 KB)