283-7 Simplifying Stratum Code

In section 7/14 of the ‘Sampling’ Missions, the following code is presented in the answer:

# Stratifying the data in five strata
stratum_G = wnba[wnba.Pos == 'G']
stratum_F = wnba[wnba.Pos == 'F']
stratum_C = wnba[wnba.Pos == 'C']
stratum_GF = wnba[wnba.Pos == 'G/F']
stratum_FC = wnba[wnba.Pos == 'F/C'] 

In the above, is it the same as stratum_G = wnba[wnba['Pos'] == 'G']?
The latter is closer to what was learned in the earlier python missions. using .Pos seems like a slightly more efficient way to write it, if that is indeed what it is.

Since sampling came right after API’s and Webscraping (at least in the analyst path) is this referencing the .class and #type lessons with html?

Thanks!

1 Like

Hey, Chris.

Nope, nothing to do with that.

Yes.

Agreed, I use it often, but it needs to be handled with care. This notation doesn’t always work. For instance, when the column name has a space or other special characters, you can’t use this notation anymore.

To exemplify this and other problems, I’ll create a simple dataset — an empty one, actually.

>>> import pandas as pd

>>> just_a_variable = 1337
>>> cols = [1, "name with spaces", int, just_a_variable, "shape"]
>>> df = pd.DataFrame(columns = cols)
>>> print(df)
Empty DataFrame
Columns: [1, name with spaces, <class 'int'>, 1337, shape]
Index: []

Now let’s see what happens when we try to use the dot notation of these columns:

  • Column names can’t be already existing entities:

      >>> df.1
        File "<stdin>", line 1
          df.1
            ^
      SyntaxError: invalid syntax
      >>> 
      >>> df.int
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
          return object.__getattribute__(self, name)
      AttributeError: 'DataFrame' object has no attribute 'int'syntax
      >>> 
      >>> df.just_a_variable
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
          return object.__getattribute__(self, name)
      AttributeError: 'DataFrame' object has no attribute 'just_a_variable'
      >>> 
      >>> df.shape
      (0, 5)
    

    Notice, in addition, that shape was interpreted at the dataframe attribute, not the column name.

  • You can’t create new columns with dot notation

      >>> df.new_column = []
      __main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
    

    For your convenience, here is the link given in the error message above.

  • You can’t use dot notation with “complex” names:

    >>> df.name with spaces
    File "<stdin>", line 1
        df.name with spaces
                ^
    SyntaxError: invalid syntax
    >>> 
    >>> df."name with spaces"
    File "<stdin>", line 1
        df."name with spaces"
                            ^
    SyntaxError: invalid syntax
    
2 Likes

Thanks for the response, Bruno!

I noticed you edited the title of this thread to “283-7”. What does this mean?

No problem!

Regarding the title edit, check out the second bullet point here.

Here is a pythonique solution:

wnba['Pts_per_game'] = wnba['PTS']/ wnba['Games Played']
unique_pos = wnba['Pos'].unique()
strata_mean = [wnba[wnba['Pos']==p]['Pts_per_game'].sample(10, random_state = 0).mean() for p in unique_pos]
pts_pos = {}
for key, value in zip(unique_pos, strata_mean):
    pts_pos[key] = value
 
position_most_points = max(pts_pos, key = lambda x: pts_pos[x])