Linear relationship between RMSE and Mean Absolute Percentage Error

Hi community,

I am working on something difficult to interpret (related to the Guided Project about k-nearest neighbors).

Let’s say we create an univariate model and we want to test two different features. We want also to compare two error metrics: RMSE and Mean Absolute Percentage Error (MAPE). I have tuned the k parameter (n-neighbors) with different values and obtain the different scores for each feature shown in the figure below.

The caveat here is that Feature 1 makes the better RMSE scores (see the 2 red points on the upper left) but with worst MAPE scores than Feature 2!


The two models have been trained with the same random split for the training set.
Also, for Feature 2, the RMSE and MAPE scores show a pretty clean linear relationship, but this is not the case for Feature 1, at least to my eyes.

Does it mean that Feature 1 induces some kind of instability in the model and that Feature 2 should be seen as a more conservative choice for the model? What lesson could we draw from it?

I am conscious this is probably the kind of caveat that happens when no clear choice is being made about the error metric to optimize, but I find it disturbing.


My advice here is: do not trust your eyes. You have tools to verify this. Run a correlation test for the results of both features. Don’t make assumptions about something you do not know yet.

Hi @otavios.s

Thanks for the response, but this is not my main point here. In any case, don’t you think it’s appealing we can pretty easily draw a line between the blue points, but not for the red points?
But the main point in fact is the relationship between RMSE and MAPE. Both are scoring functions. The problem with Feature 1 is that this feature gives the better RMSE scores but paired with the worst MAPE scores ! I guess there is a mathematical explanation, but what is this fact telling us about Feature 1 ? If I want to minimize RMSE, one may think: oh Feature 1 is definitively my best feature. But it"s contradictory with the MAPE score!

No, I don’t. The way I see it, you can pretty easily draw a line between both blue and red points:

Sem título

There’s no reason to believe only your eyes. You may run a test just to see that there’s a greater correlation in the blue point than in the red points. But you’d then be sure of that. All I’m saying here is: run the tests.

Different features will yield different results for the same metric. Also, MAPE and RMSE are under no obligation to be equal or to present a linear relationship. For instance, RMSE squares the errors, which means higher errors will have an even greater impact on this metric.

My suggestion is that you first define the metric you’re using based on the problem you’re trying to solve. Then you choose the best feature to optimize this metric.

1 Like

Well, I followed you advice (I am conscious that’s a general advice), checked it with scipy.stats.linregress and my first impression is confirmed:


Indeed, so it means that MSE and RMSE are not very outlier resistant.

MAPE is doing the following:

np.mean(np.abs((target - predictions) / target))

Mean is well known to be very sensitive to outliers too, so MAPE will suffer from the same weakness than RMSE.

I think this two different R-squared are telling something about some instability that may affect the model depending on whether we choose feature 1 or feature 2, but I am not able to define it clearly. I am thinking, for example, to this scikit-learn example which examines “pitfalls” due to feature collinearities and things like that.

I have done the KNN guided project but now investigating more. If you refer to the Data Set Description, past usage of the dataset obtained 11,84% and 14,12% using “Percent Average Deviation Error of Prediction from Actual”. I would like to compare my scores to past usage performance (benchmarks), and that cannot be done with RMSE. But since the course uses RMSE, I also use RMSE. In other words, despite there is apparently no necessarily strong linear relationship between this two score functions, I would like to build a model minimizing both of them, or at least find the best tradeoff.

1 Like