Multi-Class Classifiction, verification

Screen Link: <Multiclass Classification → 6. Choose the Origin>

My Code:

FWIW, I understand all of the code in the lesson. I don’t think it’s helpful to put all the code from the module, but here is where I got stuck.

# create a df to contain the predicted probabilities
testing_probs = pd.DataFrame(columns=unique_origins)

# add to the 'testing_probs' dataframe
for i in unique_origins:
    # select testing features
    X_test = test[features]
    # compute probability of observation being in the origin
    testing_probs[i] = models[i].predict_proba(X_test)[:,1]


# classify each observation (in the test set)
predicted_origins = testing_probs.idxmax(axis=1)

What I expected to happen:
I was expecting an additional screen where we would concatenate the multi-class result back into the ‘test’ dataframe, and then compare our results to the original ‘origin’ column.

What actually happened:
the course just ended, seemingly without conclusion

I was curious if anyone had any ideas on how to wrap this up? I tried to add the resultant series to the test dataframe :

test['classify_origin'] = predicted_origins

…but I ran into issues. Did the np.random.permutation from screen 3 cause the final values to not line up with the test dataframe.

Pretty new here, I hope I’m making sense :confused: thanks for all your help!

1 Like

Hi lahguitarist

Here test is a part of the cars DataFrame.

testing_probs is a newly created DataFrame that contains predicted probabilities for origins 1, 2, and 3.

And, predicted_origins is a Series that contains the index value (1 or 2 or 3) with maximum predicted probability.

The predicted_origins Series is derived from testing_probs DataFrame, and the values in predicted_origins won’t be in sync with test DataFrame. Hence,

test['classify_origin'] = predicted_origins

doesn’t make sense.

If you want to

concatenate the multi-class result back into the ‘test’ DataFrame, and then compare the results to the original ‘origin’ column

then you should have predicted_origins in sync with the cars DataFrame.
Otherwise, you could use SVC (Support Vector Classifier), or KNN (k-Nearest Neighbors) algorithms for this.
Hope this makes sense.

1 Like

Ok. How can I keep predicted_origins in sync with cars and test?

You could use the predict method to get the predictions.

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

X_train = train[features]
y_train = train['origin']
X_test = test[features]
y_test = test['origin']

lr_model = LogisticRegression(), y_train)
y_pred = model.predict(X_test)
print('LogisticRegression - accuracy score:', accuracy_score(y_test, y_pred))

knn = KNeighborsClassifier(), y_train)
y_pred = knn.predict(X_test)
print('KNN - accuracy score:', accuracy_score(y_test, y_pred))

svc = SVC(), y_train)
y_pred = svc.predict(X_test)
print('SVC - accuracy score:', accuracy_score(y_test, y_pred))

The output would be,

LogisticRegression - accuracy score: 0.6610169491525424
KNN - accuracy score: 0.652542372881356
SVC - accuracy score: 0.6610169491525424

I see what you’re doing here, but it doesn’t really answer my question so I can’t mark it as a solution. Thanks for your input tho.

Calculate the maximum probability value for each row in the test DataFrame, assign that value to the test['predicted_origins'].
Now you could ask how do you do that, the answer is by using apply and map methods could be used. And to know how use those methods please refer to Transforming Data with Pandas.
Someone has already done the job for us. So, instead of reinventing the wheel we could use those to get our job done.
Anyway, now you got an idea how to add the results to test and get the conclusion, so please do that and share it here for all of us to learn.

1 Like


If I understand you correctly, you want a dataframe that shows both the actual and predicted values for comparison purposes. And probably you want to calculate the accuracy or other metrics for this classification problem.

Kindly refer to the code below:

predicted_origins = testing_probs.idxmax(axis=1)

result_df = pd.DataFrame(columns=['actual', 'predicted'])
result_df['actual'] = test.origin
result_df['predicted'] = np.array(predicted_origins)

from sklearn.metrics import accuracy_score

accuracy_score(test.origin, predicted_origins)

If you got to screen 3, the shuffled dataframe was split and stored as test and train.

On screen 6, you can get the correct values of the shuffled test set with test.origin as you still have access to the stored test and train.

Hope this helps!


ah-ha! Thanks @monorienaghogho this is exactly what I was looking for! You gave me some ideas for further research too! Bravo!

1 Like