Inconsistent variables for 135-7 (Machine Learning Project Walkthrough: Making Predictions)

Screen Link: https://app.dataquest.io/m/135/machine-learning-project-walkthrough%3A-making-predictions/7/cross-validation

My Code:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression()

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)

# Check the len of the origianl Dataframe
print(len(loans))

# Check the len of the pre-assigned Dataframe
print(len(features))
print(len(predictions))
print(len(target))

What I expected to happen:
37675
37675
37675
37675

What actually happened:

37675
38708
38708
38708

I encountered a problem with 135-7, and was curious as to how does the provided solution produced the official answer.

Level 1

I have then discovered that the Dataframes features and target are longer than the original Dataframe loans.

How did the solution arrive at its answer

It seems that while comparing using == for two pd.Series with different length will raise a ValueError:
Code:

from sklearn.model_selection import cross_val_predict
lr = LogisticRegression()

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)


long_short = (predictions == predictions.head(-200))

Output:

ValueError: Can only compare identically-labeled Series objects

The boolean operators treated the missing values (NaN I would guess) as False for boolean operations:
Code:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression()

predictions = cross_val_predict(lr, features, target, cv=3)
predictions = pd.Series(predictions)


print((predictions == 1).tail(100).value_counts())
 
long_short1 = (predictions == 1) & (predictions.head(-200) == 1)
print(long_short1.tail(100).value_counts())

long_short2 = (predictions == 1) | (predictions.head(-200) == 1)
print(long_short2.tail(100).value_counts())

Output:

True    100
dtype: int64
False    100
dtype: int64
True    100
dtype: int64

Level 2

I then investigate what would be different in preparation in order to generate this longer features DataFrame.

Location of source of deviation

The first row that is only in features has index 162.

It seems that this features DataFrame in mission 135-7 is generated using the 'clean_loans_2017.csv' from the previous mission 134. The main deviation in feature preparation that I can identify is this:
features is obtained by not dropping rows with missing values for columns emp_length, title, revol_util and last_credit_pull_d (mission 134-2), but instead by loans['emp_length'].fillna(0, inplace=True).

This will produce a DataFrame that has 38707 rows. I believe I could download the files and compared the Dataframes from mission 135 and 134, but I would leave this issue where it is and move on with the missions.

1 Like

nice investigative work @cheungkasing !!!