Inserting a column - Cross Validation

Screen Link: https://app.dataquest.io/m/154/cross-validation/3/k-fold-cross-validation

dc_listings["fold"].iloc[0:745] = 1
dc_listings["fold"].iloc[745:1490] = 2
dc_listings["fold"].iloc[1490:2234] = 3
dc_listings["fold"].iloc[2234:2978] = 4
dc_listings["fold"].iloc[2978:3723] = 5``` 

This is the code I'm using.to insert a column with my fold variable...though the answer isn't accepted.  It looks like the issue is that I'm generating an int instead of a float?  

How is the above code worse than the example code below?

: ```dc_listings.loc[dc_listings.index[0:745], "fold"] = 1
dc_listings.loc[dc_listings.index[745:1490], "fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234], "fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978], "fold"] = 4
dc_listings.loc[dc_listings.index[2978:3723], "fold"] = 5

Hi @andreas.varotsis

Welcome to our Dataquest Community.

Let’s see one by one.

First One: -

In short, this code is used to replace an existing value from the fold column. And according to instructions in the slide, you have to create a new_column then assign them the values.

But according to code, first you are selecting the fold column then assigning the value to some particular index. Notice there is no column name fold in the DataFrame. So it gives you a KeyError i.e., the column not found.

Second One: -

Assigning value in column fold according to the index.

dataframe.loc[index, column] = value

In this code, first, you are selecting the particular row according to index then for that row in certain columns you are assigning values.

I hope this helps :slightly_smiling_face: .

2 Likes

Hi there,
I have another question concerning this. I have used similar method, I wrote the following code:

dc_listings['fold'] = 0
dc_listings['fold'].iloc[0:745] = 1
dc_listings['fold'].iloc[745:1490] = 2
dc_listings['fold'].iloc[1490:2234] = 3
dc_listings['fold'].iloc[2234:2978] = 4
dc_listings['fold'].iloc[2978:3723] = 5

I understand that the mechanism here is different and first I create the column and then replace the values. The column that I create is initially assigned an int equal to 0. Therefore when I change specific values my output is also int.
Have I used assignment dc_listings['fold'] = 0. the Series ‘fold’ would be a float too.

Bur I just don’t get why when using the solution code which is:

dc_listings.loc[dc_listings.index[0:745], "fold"] = 1
dc_listings.loc[dc_listings.index[745:1490], "fold"] = 2
dc_listings.loc[dc_listings.index[1490:2234], "fold"] = 3
dc_listings.loc[dc_listings.index[2234:2978], "fold"] = 4
dc_listings.loc[dc_listings.index[2978:3723], "fold"] = 5

the values become automatically a float and not an int

1 Like

Hi @k.wiktorski.it,

If we only fill values for specific rows, then the rest of the rows must be assigned NaN. This is why Pandas is converting the column to float. Because NaN is a float value.

print(type(numpy.nan))
float

On the other hand, dc_listings['fold'] = 0 is assigning an integer value to all rows. Therefore, the column is set to integer.

Best,
Sahil

1 Like

Hi Sahil,

I tried to use this code for asssignment -
dc_listings[‘fold’] = 0
dc_listings[0:745][‘fold’] = 1
dc_listings[745:1490][‘fold’] = 2
dc_listings[1490:2234][‘fold’] = 3
dc_listings[2234:2978][‘fold’] = 4
dc_listings[2978:3723][‘fold’] = 5

but it is not working correctly, I am not able to figure out what I am doing wrong here, could you please help, thanks.

Edit:
Whereas this code I tried works correctly:
dc_listings.loc[0:745, ‘fold’] = 1
dc_listings.loc[745:1490, ‘fold’] = 2
dc_listings.loc[1490:2234,‘fold’] = 3
dc_listings.loc[2234:2978,‘fold’] = 4
dc_listings.loc[2978:3723, ‘fold’] = 5

I can understand that I am missing some basic point, but unable to figure it out.

Hi @sohamdey.tcs,

You are trying to assign value using chained indexing. Chained indexing creates a temporary object which we don’t have access to. The values are assigned to that temporary object. That is why you are unable to see any change in your dc_listings dataframe.

For example, let’s split this line to mimic what is going on with your assignment.

dc_listings[0:745]['fold'] = 1

temp = dc_listings[0:745]['fold']
temp = 1

1 was assigned to this inaccessible temporary object instead of the original dataframe.

You can read more about it in the Assigning a new value to a list with chained indexing section of this article:

Best,
Sahil

2 Likes

This is the code I m using for inserting a new column with specific values. It is running but it is not accepted by the system:
dc_listings[‘fold’]=5
dc_listings.loc[0:745, ‘fold’]=1
dc_listings.loc[745:1490, ‘fold’]=2
dc_listings.loc[1490:2234, ‘fold’]=3
dc_listings.loc[2234:2978, ‘fold’]=4
print(dc_listings[‘fold’].value_counts())
print(dc_listings[‘fold’].isnull().sum())

As you see, first I populate the whole of the new column with value 5 and then I change it bit by bit. What have I done wrong?

There are a couple of things going on here:

  1. Since you are populating the fold column entirely with 5 upon initialization, the column is initialized as datatype int whereas the answer is expecting a float as per Sahil’s response above.

  2. Slicing using .loc is a bit different than in most other places in python. As per the official documentation:

Warning

Note that contrary to usual python slices, both the start and the stop are included

To get your code to be accepted by the system you will need to cast the fold column as a float as well as modify your indices knowing that both start and stop are included.

For example, as it stands, your code causes dc_listings.loc[2978, 'fold'] to be set to 4 where it should be 5.

I see. Thank you for reminding me that .loc includes both the start and the end. I had totally forgotten it! So, if I change the ends and starts of my slices it should work, but again it will not conform with the float requirement.
On the other hand it seems to me counterintutitive to slice a dataframe using “index[…]” for the rows. It’s the first time I see an indexing technique like this.
So, I did it like this and it was accepted!

dc_listings[‘fold’]=5.0

dc_listings.loc[0:744, ‘fold’]=1
dc_listings.loc[745:1489, ‘fold’]=2
dc_listings.loc[1490:2233, ‘fold’]=3
dc_listings.loc[2234:2977, ‘fold’]=4
print(dc_listings[‘fold’].value_counts())
print(dc_listings[‘fold’].isnull().sum())