In section 7/14 of the ‘Sampling’ Missions, the following code is presented in the answer:
# Stratifying the data in five strata
stratum_G = wnba[wnba.Pos == 'G']
stratum_F = wnba[wnba.Pos == 'F']
stratum_C = wnba[wnba.Pos == 'C']
stratum_GF = wnba[wnba.Pos == 'G/F']
stratum_FC = wnba[wnba.Pos == 'F/C']
In the above, is it the same as stratum_G = wnba[wnba['Pos'] == 'G']?
The latter is closer to what was learned in the earlier python missions. using .Posseems like a slightly more efficient way to write it, if that is indeed what it is.
Since sampling came right after API’s and Webscraping (at least in the analyst path) is this referencing the .class and #type lessons with html?
Agreed, I use it often, but it needs to be handled with care. This notation doesn’t always work. For instance, when the column name has a space or other special characters, you can’t use this notation anymore.
To exemplify this and other problems, I’ll create a simple dataset — an empty one, actually.
>>> import pandas as pd
>>> just_a_variable = 1337
>>> cols = [1, "name with spaces", int, just_a_variable, "shape"]
>>> df = pd.DataFrame(columns = cols)
>>> print(df)
Empty DataFrame
Columns: [1, name with spaces, <class 'int'>, 1337, shape]
Index: []
Now let’s see what happens when we try to use the dot notation of these columns:
Column names can’t be already existing entities:
>>> df.1
File "<stdin>", line 1
df.1
^
SyntaxError: invalid syntax
>>>
>>> df.int
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'int'syntax
>>>
>>> df.just_a_variable
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'just_a_variable'
>>>
>>> df.shape
(0, 5)
Notice, in addition, that shape was interpreted at the dataframe attribute, not the column name.
You can’t create new columns with dot notation
>>> df.new_column = []
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
For your convenience, here is the link given in the error message above.
You can’t use dot notation with “complex” names:
>>> df.name with spaces
File "<stdin>", line 1
df.name with spaces
^
SyntaxError: invalid syntax
>>>
>>> df."name with spaces"
File "<stdin>", line 1
df."name with spaces"
^
SyntaxError: invalid syntax