What is .str[]?

In Data cleaning basics, lesson 8. Extracting Values from Strings, you use a function .str[0] that returns item 0 from the list that was just created. However, you haven’t explained that method, nor could I find python documentation on how to use it. Can you link to documentation for that usage of .str?

Thanks,
Aaron

Hi @Aaronask,

It’s good practice to include both a link to the mission, as well as a snippet of the code you’re talking about. This provides more context on the question and makes it easier for others to help you!

Mission link: https://app.dataquest.io/m/293/data-cleaning-basics/8/extracting-values-from-strings

We’ll start by examining what precisely it is we’re working with. This is an example of a series that’s generated by using the Series.str.split() method on a column:

img
The above series is what we’re trying to work with. Each element in that series is a list, and what we’re after is the very first object inside said lists.

If we just tried indexing the 0th element of that Series, we’d get only a single item:

img2

Instead of getting the first element of each list, we simply get the entirety of the very first list that was in that series. Invoking the Series.str method again right after the splitting is done fixes this problem because - and this is probably putting it somewhat crudely - this tells the kernel to treat each item inside that series as its own object for the purposes of the next method that you pass (in this case the 0th element indexing using [0]).

That’s why the following works:

img3

3 Likes

Hey, Aaron.

BBP already explained what it is doing. I’d like to explain how it is doing what it is doing, but I got lost investigating the source code, and couldn’t find any documentation. Nevertheless, I still have some insights to share.

Before I move on into the concrete stuff I found, I’ll say that I suspect that this even works is a coincidence, possibly a bug.

To make it easier for people (including myself) to test this locally, I’ll be using a universally accessible data set.

>>> import pandas as pd
>>> import seaborn as sns
>>> planets = sns.load_dataset("planets")
>>> planets.head()
            method  number  orbital_period   mass  distance  year
0  Radial Velocity       1         269.300   7.10     77.40  2006
1  Radial Velocity       1         874.774   2.21     56.95  2008
2  Radial Velocity       1         763.000   2.60     19.84  2011
3  Radial Velocity       1         326.030  19.40    110.62  2007
4  Radial Velocity       1         516.220  10.50    119.47  2009

The types are what we would expect:

>>> planets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
method            1035 non-null object
number            1035 non-null int64
orbital_period    992 non-null float64
mass              513 non-null float64
distance          808 non-null float64
year              1035 non-null int64
dtypes: float64(3), int64(2), object(1)
memory usage: 48.6+ KB

String accessors are supposed to be used with strings:

image

What we’re seeing here is that they work with objects other than strings as well. Let’s reproduce this behavior with the planets dataframe.

>>> series_list = planets.method.str.split()
>>> series_list = series_list.sample(5)
>>> series_list
322    [Radial, Velocity]
583    [Radial, Velocity]
702             [Transit]
145    [Radial, Velocity]
35              [Imaging]
Name: method, dtype: object
>>> type(series_list.loc[322])
<class 'list'>

We confirm that the values in series_list are actually lists, as we expected. As was mentioned above, using a string accessor on a series that isn’t built of strings shouldn’t work (as per the documentation and common sense). However…

>>> series_list.str[0]
322     Radial
583     Radial
702    Transit
145     Radial
35     Imaging
Name: method, dtype: object

It worked. What makes me believe that this isn’t supposed to happen is that it doesn’t work with other data types (at least not all of them).

Expand to see it fail
>>> planets.year.str[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5175, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/accessor.py", line 175, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py", line 1917, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py", line 1967, in _validate
    raise AttributeError("Can only use .str accessor with string " "values!")
AttributeError: Can only use .str accessor with string values!
>>> planets.year.str
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5175, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/accessor.py", line 175, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py", line 1917, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/bruno/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py", line 1967, in _validate
    raise AttributeError("Can only use .str accessor with string " "values!")
AttributeError: Can only use .str accessor with string values!

When using it with numeric formats, we get the message Can only use .str accessor with string values! which is obviously false, so we are positive that there is at least one bug: either an incorrect error message or incorrect behavior when the elements of a series are lists.

Let’s explore series_list.str without indexing.

>>> obj = series_list.str
>>> type(obj)
<class 'pandas.core.strings.StringMethods'>
>>> print(obj)
<pandas.core.strings.StringMethods object at 0x7fcd86c48cd0>

Moving on, we already saw that obj supports indexing, therefore we can also loop over it. Let’s take look at its elements:

>>> for x in obj:
...     print(type(x), x, sep="\n")
... 
<class 'pandas.core.series.Series'>
322     Radial
583     Radial
702    Transit
145     Radial
35     Imaging
Name: method, dtype: object
<class 'pandas.core.series.Series'>
322    Velocity
583    Velocity
702         NaN
145    Velocity
35          NaN
Name: method, dtype: object

We see that each element in obj is a series. And now it’s easy to understand that series_list.str[0] (which is the same as obj[0] is simply retrieving the first element of obj, which is a series consisting of the first element of each list in series_list.

Another way to think about series_list.str is as it being the result of breaking up its lists into its own series. Another way to do it is the following.

>>> series_list.apply(pd.Series)
           0         1
322   Radial  Velocity
583   Radial  Velocity
702  Transit       NaN
145   Radial  Velocity
35   Imaging       NaN

So, basically, obj[0] is fetching the series that comes at the integer location 0.

Expand to see my investigation of the source code

While investigating the source code, I landed on the definition of the class StringMehtods which defines a method that acts as the first line of defense against data types with which string accessors shouldn’t work:

image

We can see that string accessors should only work with types (as returned by the lib.infer_dtype function) that are in the list ["string", "empty", "bytes", "mixed", "mixed-integer"].

I also investigated the function lib.infer_dtype and concluded that lists are classified as mixed by default, i.e., they aren’t any of the other types.

All of this explains why it works, but not how it works. That, I couldn’t figure out, but I suspect that this ultimately working might be a fluke.

I hope this helps.

1 Like

Thanks for the thorough testing!

I had to go try it out on my own, and I also saw that we run into a similar issue trying to achieve the same task using a different approach, like the apply method.

Trying the same with a float or int columns results in the expected 'float'/'int' object is not subscriptable error, so that makes me wonder if the subscriptability has some part to play in it.

I also realized another thing. After invoking the Series.str method, you don’t get the expected “out of bounds” index error.

img2

I expect that subscriptability does have some part to play in this, but it looks like it’s not in any obvious way:

>>> series_set = series_list.apply(set)
>>> series_set
322    {Radial, Velocity}
583    {Radial, Velocity}
702             {Transit}
145    {Radial, Velocity}
35              {Imaging}
Name: method, dtype: object
>>> series_set.str[0]
322   NaN
583   NaN
702   NaN
145   NaN
35    NaN
Name: method, dtype: float64
>>> for x in series_set.str: print(x)
... 
>>> # It didn't print anything

Note how it didn’t yield an error, contrarily to what happens with integers.