Be aware with scoring function: a "good" mean accuracy can hide a bad result!

Hi all,

I’ve just realized something pretty annoying: doing Logistic Regression with a binary target to predict, I thought first that I was obtaining not so bad results (though overfitting a bit at first glance):

But at looking further, something was definitely going wrong:

Since the sklearn score function is based upon “mean accuracy”, it may be very misleading !

Indeed, 78% of predicted outputs are good, but in fact the model is not able at all to classify correctly the 1 outputs.

How do you deal with this kind of issue? Did it happen to you before? Did you rely in the past on mean accuracy like me before realizing it was a “trap” ?



Hi, @WilfriedF .

It’s not really “mean accuracy” but just accuracy. Imagine that you have 100 lines of data coming in 99 - 1, and 1 - 0. Your model has predicted 1 for all values. That is, it has an accuracy of 99% when it was wrong once. This is most noticeable in data unbalance, as in your case, one value greatly outweighs the other.

It is for this reason that accuracy is a poor metric for model evaluation.

Look at the values of errors 1 and 2 of the genus.
And also with sklearn metrics -

Precision" and “recall” metrics will be important to you.

Methods to combat such errors, try to teach the model on balanced data.
Another is the change of the probability threshold at which the model refers to a value of 0 or 1.

Regards, Max


Hi @moriturus7 thanks for the answer.

Since the beginning, I trained with class_weight=“balanced”, but looks it’s not enough to rebalance the class.

Note that all my features (n = 62) are binary features.

Indeed, the 1 recall score is very bad, so I guess I should focus only on recall metric or f1-score. “Precision” is missleading here.

I didn’t try to change the threshold, but it’s a bit arbitrary, don’t you think ? It means, if I understand well, that I could force the model to ouput “1” if proba(1) >=0.3 instead of 0.5, for example? Fine, but it’s a kind of supervision that looks strange to me, because I couldn’t find any “objective” reason to fix the threshold at 0.3.

Since 0 output weight = 76% in the dataset, p = 0.76, so maybe it makes sense from a logical point of view to fix the threshold at 1 - p = 0.24 ? I am afraid this could lead to overfitting.

1 Like

After changing the probability threshold, I improved a lot indeed, but I need a sick threshold:
if p(0) < 0.97 => p(1) = 1

Do you think it’s now overfitting ?

1 Like

Study the articles on Precision-Recall Curves, it will help you better define the threshold parameter, as well as explore the approaches that are used for this.

If in simple language, then you are right this observation. You can see that due to unbalanced data, the model tends to underestimate the probability values for 1 very much. So you lower the threshold, allowing it to predict 1 more often.

But it is obvious that if you lower the threshold too much, there will be a lot of errors associated with 0, so you need to find balance. To do this, use Precision-Recall Curves.

The retraining error is best seen when you check your model on the test dataset. If the metrics for the test dataset are much worse than for the train and validation, then the model has been overfitted.


Thanks for all this Max, I will check Precision-Recall curves and post my results later. Note that my scores with training and testing datasets are not so bad, the recall score becomes bad when I apply the model to the whole dataset.

Happy to help. It will be interesting to know your final results

1 Like

Well you made me go deep into the subject!

I understand a bit better why I should use the precision-recall curve instead of ROC curve since the calculation for the former doesn’t take in account the true negatives. Though I am still a bit confused with true/false positive/negative terminology.

Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives.

Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1.

Indeed, this is exactly the case here. I am above all interested in the correct prediction of the minority class-1.

Source: this tutorial

The same source mentions also this article: The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

Note that I cannot import plot_precision_recall_curve from scikit.metrics, I got an error!
So, I finally managed to plot the precision-recall curve using the code contained in the tutorial.

Think we got it with just three lines of code finally easy to understand:

# predict probabilities
lr_probs = model.predict_proba(x_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
lr_precision, lr_recall, _ = precision_recall_curve(y_test, lr_probs)

And then you plot lr_precision and lr_recall, while the thresholds can be retrieved with the _ variable

How would you read this curve? If I understand well, when recall increases, precision quickly falls, so this is not looking very good, right ?

As for understanding the concepts of “false positive” and “false negative”, you can use the following description.

Imagine that you have developed a test to determine the disease.

  1. The test reports that the patient is healthy, but he is sick is a false-positive result (0, 1).
  2. The test reports that the patient is sick and is sick is a true-positive result (1, 1)
  3. The test reported that the patient was sick, but in fact, he was healthy - this was a false-negative result (1, 0).
  4. The test showed that the patient is healthy and healthy, which is a true-negative result (0, 0).

At the end of each example, I gave what it looks like in a binary classifier where the 1-st number of predictions and the second expected result.

Concerning the reading of the chart. When your model does not have 100% accuracy, this behavior is perfectly normal. It’s worth remembering that the measurements on the graph correspond to the change of the probability threshold, which we talked about earlier. So you can read the graph from right to left.
The Recall is equal to 1, only if there are no false negatives, this is possible with a probability threshold of 0.0 when all probability values are greater than or equal to 0.0, so there will be no 1 value that was predicted as 0. Since absolutely all predictions are equal to 1
Precision is 1, only when there are no false positives, this is possible at probability threshold 1.0.

Now let’s read one of the points on the graph. If we take the Recall point 0.6 and Precision point 0.78 it will mean the following. With a probability threshold of about 0.4, we will correctly predict 1 in 60% of the total number of expected 1. In this case, the accuracy of prediction 1 will be 78%, that is, in 22% of cases, we take 0 for 1.

It may be easier for you to read the graph if you change it by postponing the x-axis probability thresholds. And on the y-axis of the graph, you can read the chart with recall and precision.

Next, you need to determine the balance of these indicators for your task. Or you can continue to work on improving the hyperparameters of the model.

I’m not sure that I was able to explain these concepts enough.

Regards, Max

1 Like

Thanks again, your explanations are always well structured and super interesting! I think we are really developping a great discussion about something important!

This is the confusion matrix for thresholds [0,1] on the y_test
Let’s see if I am right:

  • 489 true negatives
  • 164 true positives
  • 31 false negatives (so recall is pretty good)
  • 124 false positives (so precision is pretty bad)

I got it, no?

In my case, I think precision is more important. Not really a problem if there are false negatives, but false positives may lead to bad outcome.

Now I followed your advice and made the following chart:

Note that _ shape is only 783 when lr_precision and lr_recall shape is 784, do you know why?

Let’s say know I decide to select 0.8 threshold as optimal, how then can I force the model to act accordingly? I mean there is an implicit with the predict logistic regression method : f(x) > 0.5 where f is the decision_function and 0.5 the threshold value by default. Manually, this is easy to do, but I guess this is not the best way.

Also, I am very interesting to plot (with imshow) for analyse purpose the coef_ attributes of the features in the decision function. If I change the threshold manually, it will not affect the coef_ right ? That is, I am not sure to understand very well the relation, if there is one, between the coef_ and the probability threshold since there is no clear way as far I know to “redo” the model with a new “built-in” threshold different than 0.5 (not sure I explain myself very well here). Since I am searching for heuristics easy to memorize for humans, it’s very important to extract clear general rules from the features that best predict the positive class and that’s why the _coef are so importants.

Hi. @WilfriedF

Glad that my explanations help you.

You got it right in the matrix values.

Unfortunately, I can’t imagine what caused the difference in size. But it doesn’t affect the shape of the chart anyway.

In other questions I recommend you to read these discussions on the stack overflow. They discuss a similar situation.

And with simple words from me. Then if you want to try to retrain the model using this knowledge. You should make several models with different values for class_weight and compare the results. It should affect _coef as well.

But when you have already trained the model, there is nothing wrong with writing a prediction function with a change in threshold. You can see examples in the links I gave you.

1 Like

Some news.

Playing with class_weight as you said, I faced the following behavior after applying the model to (x,y) : the precision increases a lot indeed, but the false negatives too, and above all, I lost nearly 100% of the true positives!

But finally, after doing more researches, I tried LogisticRegressionCV (built-in cross-validation support) and got promising results:

w = [{0:1000,1:100},{0:1000,1:10}, ... ]
c_range = np.arange(0.5, 20.0, 0.5)
skf = StratifiedKFold(n_splits=5)
model = LogisticRegressionCV(Cs=c_range,random_state=13, cv = skf, scoring = "f1", class_weight=w).fit(x,y)

Significative improvement:

Precision: 0.82
Recall: 0.64
f1-score: 0.72
Accuracy: 0.88

I think I can make better, but this is a great result compared to the first confusion matrix posted at the start of the discussion!

What I am now looking at are the scoring functions provided by scikit-learn. So much to learn here !

!!! This is great!

I completed my model adding other features that wasn’t making any significative difference since now. But this is not the case anymore. The new features changed everything!



Can’t believe it. Machine learning is really surprising sometimes, you think you are stucked and suddenly you make a giant step ! :slight_smile:

1 Like


I am glad that you were able to achieve such good results.

1 Like