Which metric to use for Classification Problems?

There are many metrics for classifications like accuracy, Precision, Recall, f1_score, ROC curve, AUC etc. So which single metric someone should focus on during analysis to understand the model performance?

Precision and Recall are 1 level up from accuracy, to handle imbalanced classes. Attaching a cost matrix to the confusion matrix will clarify whether the business cares more about precision or recall (usually a trade-off). For example, customer service companies want high precision to minimize false positives wasting their resources (assuming they spend resources on interventions on positively predicted entities). Medical Diagnosis want high recall to prevent missing out actually positive cases.

F1 score is a harmonic mean of precision and recall for those who want both precision and recall, with the lower number of the 2 dominating the resulting f1-score to give a more pessimistic view than arithmetic mean. Previous 3 terms have a single value defined from a fixed chosen threshold (0.5 by default in sklearn) where you cut the output of predict_proba in sklearn classifications models into the result given by predict.

Moving on to variable thresholds is ROC, a curve plotted by starting from bottom left of graph, at max threshold where all predictions are negatives, moving to top right where all predictions are positive, and tracing the TPR,FPR at each threhsold. AUC is a summary statistic of ROC. The C in AUC usually refers to ROC curve, but it may also refer to Precision-Recall curve or any other curve too.

Also because now the curve traces out an entire range of thresholds, a user may have to define at exactly which threshold (a single point on ROC) he wants to operate at. A model may be better than another model at a certain threshold but worse at another threshold. What this threshold is could be related to business costs (cost matrix) mentioned earlier.This could show up as 2 intersecting ROC curves.

Wikipedia says AUC is probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Note AUC may not be good for imbalanced datasets with alot of actual negatives as its more difficult for the ROC curve to grow rightwards than upwards as threshold is lowered. Because going rightwards (x-axis/False positive rate), there is a large True Negative term in the denominator. Nevertheless if all your models are working on the same imbalanced dataset then it doesn’t really matter since everyone’s AUC is biased too.

Teach the decision maker what these metrics mean, clarify what is the goal, and gather information about what he wants, to select the metric. Of course you can even invent your own metric. Designing good loss functions (for training) and metrics (for evaluation) is a huge contributor of machine learning progress.