Screen Link: <Multiclass Classification → 6. Choose the Origin>
My Code:
FWIW, I understand all of the code in the lesson. I don’t think it’s helpful to put all the code from the module, but here is where I got stuck.
# create a df to contain the predicted probabilities
testing_probs = pd.DataFrame(columns=unique_origins)
# add to the 'testing_probs' dataframe
for i in unique_origins:
# select testing features
X_test = test[features]
# compute probability of observation being in the origin
testing_probs[i] = models[i].predict_proba(X_test)[:,1]
testing_probs.head()
# classify each observation (in the test set)
predicted_origins = testing_probs.idxmax(axis=1)
predicted_origins
What I expected to happen:
I was expecting an additional screen where we would concatenate the multi-class result back into the ‘test’ dataframe, and then compare our results to the original ‘origin’ column.
What actually happened:
the course just ended, seemingly without conclusion
I was curious if anyone had any ideas on how to wrap this up? I tried to add the resultant series to the test
dataframe :
test['classify_origin'] = predicted_origins
…but I ran into issues. Did the np.random.permutation
from screen 3 cause the final values to not line up with the test dataframe.
Pretty new here, I hope I’m making sense
thanks for all your help!
1 Like
Hi lahguitarist
Here test
is a part of the cars
DataFrame.
testing_probs
is a newly created DataFrame that contains predicted probabilities for origins 1
, 2
, and 3
.

And, predicted_origins
is a Series that contains the index value (1
or 2
or 3
) with maximum predicted probability.

The predicted_origins
Series is derived from testing_probs
DataFrame, and the values in predicted_origins
won’t be in sync with test
DataFrame. Hence,
test['classify_origin'] = predicted_origins
doesn’t make sense.
If you want to
concatenate the multi-class result back into the ‘test’ DataFrame, and then compare the results to the original ‘origin’ column
then you should have predicted_origins
in sync with the cars
DataFrame.
Otherwise, you could use SVC (Support Vector Classifier), or KNN (k-Nearest Neighbors) algorithms for this.
Hope this makes sense.
1 Like
Ok. How can I keep predicted_origins
in sync with cars
and test
?
You could use the predict
method to get the predictions.
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
X_train = train[features]
y_train = train['origin']
X_test = test[features]
y_test = test['origin']
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('LogisticRegression - accuracy score:', accuracy_score(y_test, y_pred))
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print('KNN - accuracy score:', accuracy_score(y_test, y_pred))
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print('SVC - accuracy score:', accuracy_score(y_test, y_pred))
The output would be,
LogisticRegression - accuracy score: 0.6610169491525424
KNN - accuracy score: 0.652542372881356
SVC - accuracy score: 0.6610169491525424
I see what you’re doing here, but it doesn’t really answer my question so I can’t mark it as a solution. Thanks for your input tho.
Calculate the maximum probability value for each row in the test DataFrame, assign that value to the test['predicted_origins']
.
Now you could ask how do you do that, the answer is by using apply
and map
methods could be used. And to know how use those methods please refer to Transforming Data with Pandas.
Someone has already done the job for us. So, instead of reinventing the wheel we could use those to get our job done.
Anyway, now you got an idea how to add the results to test
and get the conclusion, so please do that and share it here for all of us to learn.
1 Like
@lahguitarist
If I understand you correctly, you want a dataframe that shows both the actual and predicted values for comparison purposes. And probably you want to calculate the accuracy or other metrics for this classification problem.
Kindly refer to the code below:
predicted_origins = testing_probs.idxmax(axis=1)
result_df = pd.DataFrame(columns=['actual', 'predicted'])
result_df['actual'] = test.origin
result_df['predicted'] = np.array(predicted_origins)
from sklearn.metrics import accuracy_score
accuracy_score(test.origin, predicted_origins)
If you got to screen 3, the shuffled dataframe was split and stored as test
and train
.
On screen 6, you can get the correct values of the shuffled test set with test.origin
as you still have access to the stored test
and train
.
Hope this helps!
2 Likes
ah-ha! Thanks @monorienaghogho this is exactly what I was looking for! You gave me some ideas for further research too! Bravo!
1 Like