How an MMA fan did a better job than the experts (and made a few bucks) with predictive modeling

Growing up in the late '90s to early 2000s, one of the biggest things ever was professional wrestling. It was an awesome time being a fan with seemingly unlimited wrestling content in the form of the Monday Night Wars, and the personification of the '90s-'00s culture of taking things to the extreme in terms of storylines, wrestling personalities, and matches. While it was great to see these over-the-top spectacles and edgier content, it also began to make the realization between fantasy and reality more apparent in the product. Although still enjoyable to watch, there was some part of me that wanted some sense of authenticity. That’s where mixed martial arts (MMA) came in.

MMA appeared to be the best of both worlds. It provided the legitimacy of competitive sports but also had the things that I loved about pro wrestling like:

  • The platform to show off an array of technical skills and strategies to solve complex problems with dire consequences
  • A growing and passionate fan base
  • A capacity to generate a storyline leading to an emotional investment towards a match
  • The ability to display the triumph of human will in its truest sense

Ever since I saw my first MMA event, I became an instant diehard fan.

However, as an avid fan, one of the most frustrating things is being aware of the existing dichotomy between the "casuals’ and the “hardcore”. While it’s understood that we hardcore fans are in the vast minority, there is still an expectation of some representation in the face of the general population. However, every time I happen to catch some commentary on an upcoming fight card or a certain match-up, I’m baffled by some of the analyses given by these MMA experts. Particularly, in the case of their take on fight outcomes, as it seems that for every correct fight pick there is also one upset that happened. In fact, it is so frequent that it’s become commonplace for these analysts/experts to attribute it to the volatile nature of the sport. While this is a fair comment, it still seems like a major cop-out and I figured I could a better job than some of these MMA experts.

Now before I get labeled as a “I know better because I trained ‘UFC’” guy, I’ll be using a more data-driven approach. Specifically, I’ll be analyzing what the data has to say between winners and losers from previous fights in the largest MMA organization, UFC, and build some predictive models to see how accurate I can get with predicting Winners and I can do a better job at predicting outcomes. Also, considering the nature of this project, let’s see if I can actually make more money doing so from my machine learning models compared to just using betting lines.


DISCLAIMER: Everything that is being used in this article is entirely for educational purposes and entertainment. DO NOT USE IT FOR THE PURPOSE OF FINANCIAL GAIN. You are way better off going with r/WallStreetBets on Reddit or investing in cryptocurrency than what’s shown here.


The data

To accomplish this task, I’ve used data that had been previously scrapped from the UFC stat website and made available on Kaggle. This data was split up into two separate data set

  1. Historical data that contains information of completed fights from 2010-03-21 to 2021-02-10.
  2. Statistics of competitors in an “upcoming UFC card”, in this case UFC 258, that was held on 2021-02-13.

The information found in these data sets include:

Variable Description
fighter name of fighter in match up
date when the fight took place
location city where the fight took place
country country where the fight took place
winner which fighter won the fight
weight_class weight class where the fight took place in
gender gender of the fighters
stance fight stance of the underdog/favorite fighter
finish how did the fight end
finish_details specifically how did the fight end
finish_round_time time in the round when the fight ended
title_bout was the fight a title bout
odds betting odds of the favorite/underdog
ev payout for favorite/underdog win
no_of_rounds number of rounds for the fight
current_lose_streak number of consecutive losses at the time of fight
current_win_streak number of consecutive wins at the time of fight
draw total number of draw up until the time of fight
losses total number of losses up until the time of fight
wins total number of wins up until the time of fight
longest_win_streak longest win streak in career up until the time of fight
avg_SIG_STR_landed average number of significant strikes landed per minute
avg_SIG_STR_pct accuracy of significant strikes landed vs thrown
avg_SUB_ATT average number of submission attempts in a fight
avg_TD_landed average number of takedowns landed per minute
avg_TD_pct accuracy of takedown landed vs attempted
Height_cms height of respective fighter in match up
Reach_cms reach of the respective fighter in the match up
weight_lbs weight of respective fighter in match up
age age of respective fighter in match up
total_title_bouts total number of in career up until the time of fight
total_round_fought total number of rounds fought in the UFC up until fight
win_by_decision total number of wins respectively by decision
win_by_submission total number of wins respectively by submission
win_by_KO.TKO total number of wins respectively by knockout
win_by_TKO_Doctor total number of wins respectively by doctor stoppage
win_dif total win differential b/t respective fighter
loss_dif total loss differential b/t respective fighter
lose_streak_dif losing streak differential b/t respective fighter
win_streak_dif win streak differential b/t respective fighter
longest_win_streak_dif longest win streak differential b/t respective fighter
total_round_dif differential b/t respective fighter on total rounds fought
total_title_bout_dif differential b/t fighters on total title bouts fought
ko_dif differential on total knockouts + doctor stoppage wins
sub_dif differential on total submission wins
sig_str_dif differential on significant strikes landed
avg_sub_att_dif differential on average submission attempts
avg_td_dif differential on takedowns landed
empty_arena was the fight held in an empty arena
weightclass_rank weight class rank for respective fighter
finish_round which round did the fight end
total_fight_time_secs total length of fight in seconds
kd_bout average number of knockdowns per bout
sig_str_landed_bout average number of significant strikes landed in per bout
sig_str_attempted_bout average number of significant strikes attempted per bout
sig_str_pct_bout average accuracy of significant strikes landed in per bout
tot_str_landed_bout average total number of strikes landed per bout
tot_str_attempted_bout average total number of strikes attempted per bout
td_landed_bout average number of takedowns landed in per bout
td_attempted_bout average number of takedowns attempted per bout
td_pct_bout average accuracy of takedowns landed in per bout
sub_attempts_bout average number of submissions attempted per bout
pass_bout average number of guard-passes per bout
rev_bout average number of reversals (i.e., sweeps) per bout

In both of these data sets, there needs to be a fair amount of processing that needs to be done such as handling of missing variables, reclassifying the variable type, and variable engineering. Most notably, differences between fighters listed in the red corner and those in the blue corner. Given how intensive this process was, I’ll leave this section out in this article. If you would like to see this process, you can see the code here.

What does the data say

Examining the history of past match-ups in the UFC in the past 11 years, it was found that the majority of the time, the fighter that was designated in the red corner (aka. the “favorite”) usually comes out as the victor in a match-up. However, when looking further into the context of betting odds, it seems that this only appears to be a case when they are heavily favored.

image image

Exploring further with a correlation matrix, other variables that have a relationship with winner outcomes include

  1. Differences in current win streaks between fighters
  2. Differences in total losses in the UFC between fighters
  3. Differences in age between fighters between fighters
  4. Differences in reach between fighters between fighters
  5. Differences in the number of significant strikes between fighters
  6. Differences in the number of takedowns landed between fighters
  7. Differences in the average accuracy of significant strikes landing between fighters
  8. Differences in the average accuracy of takedowns landing between fighters

image

NOTE: The difference score is calculated by Blue corner score - Red corner score where a negative score actually favors the Red corner fighter.

Building Some Models

In establishing a reference model, it’ll be assumed that these MMA analysts primarily rely on betting odds to determine their fight picks. Since this data set relies on American Odds, a negative score will indicate the betting favorite.

# Creating a variable to identify likely MMA pick for an analyst based solely on betting odds. 

UFC = UFC %>% 
  mutate(analyst_pick = as.factor(ifelse(R_odds > 0, "underdog", ifelse(R_odds < 0, "favorite", NA)))) %>% 
  mutate(analyst_pick = relevel(analyst_pick, ref = "underdog"))

# Create a training/testing data split 

set.seed(1234) 

indexing = createDataPartition(UFC$Winner, p = 0.75, list = F)

training_set = UFC[indexing, ]
testing_set = UFC[-indexing, ]

# Looking good so far...let's see how good this basic metric looks in predicting correct outcomes

table_baseline = table(testing_set$Winner, testing_set$analyst_pick)
accuracy_baseline = sum(diag(table_baseline))/sum(table_baseline)
accuracy_baseline = round(accuracy_baseline * 100, 3)

There are five proposed supervised machine learning algorithms that will be used in predicting fight outcomes:

Approach 1: K-Nearest Neighbors

normalize = function(x) {return((x - min(x))/(max(x) - min(x)))} # Function used for scaling numeric variables

# Processing the data set for k-NN modeling

training_for_knn.pt_1 = training_dif.only %>%
  mutate(
    total_fight_time_secs = normalize(total_fight_time_secs), 
    rank_dif = normalize(rank_dif), 
    ev_dif = normalize(ev_dif), 
    win_dif = normalize(win_dif), 
    win_streak_dif = normalize(win_streak_dif), 
    longest_win_streak_dif = normalize(longest_win_streak_dif), 
    draw_dif = normalize(draw_dif), 
    loss_dif = normalize(loss_dif), 
    lose_streak_dif = normalize(lose_streak_dif),
    KO.TKO.Doctor_Stoppage_win_dif = normalize(KO.TKO.Doctor_Stoppage_win_dif),
    sub_win_dif = normalize(sub_win_dif), 
    dec_dif = normalize(dec_dif), 
    age_dif = normalize(age_dif), 
    height_dif = normalize(height_dif),
    reach_dif = normalize(reach_dif), 
    sig_str_dif = normalize(sig_str_dif), 
    avg_str_pct = normalize(avg_str_pct), 
    avg_td_dif = normalize(avg_td_dif), 
    avg_td_pct = normalize(avg_td_pct),
    avg_sub_att_dif = normalize(avg_sub_att_dif), 
    total_round_dif = normalize(total_round_dif), 
    total_title_bout_dif = normalize(total_title_bout_dif)
  ) %>% 
  mutate(
    punch_elbow_slam = ifelse(finish_details == "Punch/Elbow/Slam", 1, 0),
    kick_knee = ifelse(finish_details == "Kick/Knee", 1, 0), 
    choke = ifelse(finish_details == "Choke", 1, 0), 
    decision = ifelse(finish_details == "Decision", 1, 0),
    lowerbody = ifelse(finish_details == "Lowerbody Joint Lock", 1, 0), 
    upperbody = ifelse(finish_details == "Upperbody Joint Lock", 1, 0), 
    heavyweight = ifelse(weight_class == "Heavyweight", 1, 0),
    light.heavyweight = ifelse(weight_class == "Light Heavyweight", 1, 0), 
    middleweight = ifelse(weight_class == "Middleweight", 1, 0), 
    welterweight = ifelse(weight_class == "Welterweight", 1, 0), 
    lightweight = ifelse(weight_class == "Lightweight", 1, 0), 
    featherweight = ifelse(weight_class == "Featherweight", 1, 0), 
    bantamweight = ifelse(weight_class == "Bantamweight", 1, 0), 
    flyweight = ifelse(weight_class == "Flyweight", 1, 0), 
    women_featherweight = ifelse(weight_class == "Women's Featherweight", 1, 0), 
    women_bantamweight = ifelse(weight_class == "Women's Bantamweight", 1, 0), 
    women_flyweight = ifelse(weight_class == "Women's Flyweight", 1, 0), 
    women_strawweight = ifelse(weight_class == "Women's Strawweight", 1, 0), 
    gender = ifelse(gender == "MALE", 1, 0), 
    is_foreign = ifelse(is_foreign == TRUE, 1, 0), 
    title_bout = ifelse(title_bout == TRUE, 1, 0), 
    stance_comparison = ifelse(stance_comparison == "Same", 0, 1)
  )

training_for_knn = training_for_knn.pt_1 %>% 
  dplyr::select(Winner, punch_elbow_slam, kick_knee, choke, lowerbody, upperbody, decision, total_fight_time_secs, heavyweight, light.heavyweight, middleweight, welterweight, lightweight, featherweight, bantamweight, flyweight, women_featherweight, women_bantamweight, women_flyweight, women_strawweight, gender, is_foreign, title_bout, empty_arena, ev_dif, win_dif, win_streak_dif, longest_win_streak_dif, draw_dif, loss_dif, lose_streak_dif, KO.TKO.Doctor_Stoppage_win_dif, sub_win_dif, dec_dif, age_dif, height_dif, reach_dif, stance_comparison, sig_str_dif, avg_str_pct, avg_td_dif, avg_td_pct, avg_sub_att_dif, total_round_dif, total_title_bout_dif)


testing_for_knn.pt_1 = testing_dif.only %>%
  mutate(
    total_fight_time_secs = normalize(total_fight_time_secs), 
    rank_dif = normalize(rank_dif), 
    ev_dif = normalize(ev_dif), 
    win_dif = normalize(win_dif), 
    win_streak_dif = normalize(win_streak_dif), 
    longest_win_streak_dif = normalize(longest_win_streak_dif), 
    draw_dif = normalize(draw_dif), 
    loss_dif = normalize(loss_dif), 
    lose_streak_dif = normalize(lose_streak_dif),
    KO.TKO.Doctor_Stoppage_win_dif = normalize(KO.TKO.Doctor_Stoppage_win_dif),
    sub_win_dif = normalize(sub_win_dif), 
    dec_dif = normalize(dec_dif), 
    age_dif = normalize(age_dif), 
    height_dif = normalize(height_dif),
    reach_dif = normalize(reach_dif), 
    sig_str_dif = normalize(sig_str_dif), 
    avg_str_pct = normalize(avg_str_pct), 
    avg_td_dif = normalize(avg_td_dif), 
    avg_td_pct = normalize(avg_td_pct),
    avg_sub_att_dif = normalize(avg_sub_att_dif), 
    total_round_dif = normalize(total_round_dif), 
    total_title_bout_dif = normalize(total_title_bout_dif)
  ) %>% 
  mutate(
    punch_elbow_slam = ifelse(finish_details == "Punch/Elbow/Slam", 1, 0),
    kick_knee = ifelse(finish_details == "Kick/Knee", 1, 0), 
    choke = ifelse(finish_details == "Choke", 1, 0), 
    decision = ifelse(finish_details == "Decision", 1, 0),
    lowerbody = ifelse(finish_details == "Lowerbody Joint Lock", 1, 0), 
    upperbody = ifelse(finish_details == "Upperbody Joint Lock", 1, 0), 
    heavyweight = ifelse(weight_class == "Heavyweight", 1, 0),
    light.heavyweight = ifelse(weight_class == "Light Heavyweight", 1, 0), 
    middleweight = ifelse(weight_class == "Middleweight", 1, 0), 
    welterweight = ifelse(weight_class == "Welterweight", 1, 0), 
    lightweight = ifelse(weight_class == "Lightweight", 1, 0), 
    featherweight = ifelse(weight_class == "Featherweight", 1, 0), 
    bantamweight = ifelse(weight_class == "Bantamweight", 1, 0), 
    flyweight = ifelse(weight_class == "Flyweight", 1, 0), 
    women_featherweight = ifelse(weight_class == "Women's Featherweight", 1, 0), 
    women_bantamweight = ifelse(weight_class == "Women's Bantamweight", 1, 0), 
    women_flyweight = ifelse(weight_class == "Women's Flyweight", 1, 0), 
    women_strawweight = ifelse(weight_class == "Women's Strawweight", 1, 0), 
    gender = ifelse(gender == "MALE", 1, 0), 
    is_foreign = ifelse(is_foreign == TRUE, 1, 0), 
    title_bout = ifelse(title_bout == TRUE, 1, 0), 
    stance_comparison = ifelse(stance_comparison == "Same", 0, 1)
  )

testing_for_knn = testing_for_knn.pt_1 %>% 
  dplyr::select(Winner, punch_elbow_slam, kick_knee, choke, lowerbody, upperbody, decision, total_fight_time_secs, heavyweight, light.heavyweight, middleweight, welterweight, lightweight, featherweight, bantamweight, flyweight, women_featherweight, women_bantamweight, women_flyweight, women_strawweight, gender, is_foreign, title_bout, empty_arena, ev_dif, win_dif, win_streak_dif, longest_win_streak_dif, draw_dif, loss_dif, lose_streak_dif, KO.TKO.Doctor_Stoppage_win_dif, sub_win_dif, dec_dif, age_dif, height_dif, reach_dif, stance_comparison, sig_str_dif, avg_str_pct, avg_td_dif, avg_td_pct, avg_sub_att_dif, total_round_dif, total_title_bout_dif)

#  Create a basic model using default parameters 

library(class) # Needed for k-NN modeling 

set.seed(1234)

proposed_k_value = round(sqrt(nrow(training_for_knn)), 0)

training_knn_cl = training_for_knn[, 1]
testing_knn_cl = testing_for_knn[, 1]

KNN.basic.performance = expand.grid( "k.value" = c(proposed_k_value, proposed_k_value+1),  "model.accuracy" = 0)

for (i in 1:length(KNN.basic.performance)) {
  
  set.seed(1234)
  
  KNN_model = knn(train = training_for_knn[, -1], 
                  test = testing_for_knn[, -1], 
                  cl = training_knn_cl,
                  k = KNN.basic.performance$k.value[i])
  
  accuracy = sum(KNN_model == testing_knn_cl) / NROW(testing_knn_cl)
  accuracy = round(accuracy*100, 3)
  KNN.basic.performance$model.accuracy[i] = accuracy  

}

if (KNN.basic.performance$model.accuracy[1] > KNN.basic.performance$model.accuracy[2]) {
  print(c("Best proposed K-Value for basic model is ", KNN.basic.performance$k.value[1], "with an accuracy of ", KNN.basic.performance$model.accuracy[1], "%"))
} else {
  print(c("Best proposed K-Value for basic model is ", KNN.basic.performance$k.value[2], "with an accuracy of ", KNN.basic.performance$model.accuracy[2], "%"))
}

# just to show which is the better performing proposed optimal k-value to use

# Finding the optimal parameters and tuning the model

knn.opt = 1

kNN.accuracy = expand.grid( "K.value" = seq(1, 499, 1),   "accuracy_score" = 0)

for (i in 1:499) {
  
  set.seed(1234)
  
  knn.mod = knn(train = training_for_knn[, -1], 
            test = testing_for_knn[, -1], 
            cl = training_knn_cl, 
            k = i)
  
  knn.opt[i] = 100 * sum(knn.mod == testing_knn_cl) / NROW(testing_knn_cl)
  
  k = i
  
  cat(k, "=", knn.opt[i], "")
  
  kNN.accuracy$accuracy_score[i] = knn.opt[i]
}

plot(knn.opt, type = "b", xlab ="K-value", ylab = "Accuracy") # Plot the accuracy of model with testing data across all selected K-values

kNN.accuracy = kNN.accuracy %>% arrange(desc(accuracy_score))

accuracy_KNN_opt = round(max(knn.opt), 3)


Approach 2: Logistic Regression Models

There are 4 different variable selection methods used in building a logistic regression algorithm.

The first is to select all variables of interest. Considering that when coming up with fight match-ups, there are a fair number of negotiations and agreements that needed to be made between both camps before a fight can be finalized. These include things such as venue, date, weight class, etc. As such, it would be assumed that these factors wouldn’t be as pressing of an issue in dictating outcomes as others.

relevant_training_data =  training_dif.only %>% select(-weight_class, -gender, -is_foreign, -empty_arena) 

log.reg_all = glm(Winner ~ ., data = relevant_training_data, family = binomial(link = "logit"))

pred_log.reg_all = ifelse(predict(log.reg_all, testing_dif.only, type = "response") > 0.5, "Red", "Blue")
accuracy_log.reg_all = round(mean(pred_log.reg_all == testing_dif.only$Winner)*100, 3)

The second method is utilizing correlate filtering based on the above correlation matrix.

log.reg_cor = glm(Winner ~ ev_dif + win_streak_dif + loss_dif + lose_streak_dif + age_dif + reach_dif + sig_str_dif + avg_str_pct + avg_td_dif + avg_td_pct,  data = training_dif.only, family = binomial(link = "logit"))

pred_log.reg_cor = ifelse(predict(log.reg_cor, testing_dif.only, type = "response") > 0.5, "Red", "Blue")
accuracy_log.reg_cor = round(mean(pred_log.reg_cor == testing_dif.only$Winner)*100, 3)

The third method is utilizing a bi-directional stepwise regression approach.

set.seed(1234)

start_model = glm(Winner ~ 1, data = training_dif.only, family = binomial(link = "logit"))
all_model = glm(Winner ~ ., data = training_dif.only, family = binomial(link = "logit"))

retention = MASS::stepAIC(start_model, direction = 'both', scope = formula(all_model))
 
log.reg_step = glm(retention$formula, data = training_dif.only, family = binomial(link = "logit"))

accuracy_log.reg_step = round(mean(ifelse(predict(log.reg_step, testing_dif.only, type = "response") > 0.5, "Red", "Blue") == testing_dif.only$Winner)*100, 3)

The last method is selecting only variables that significantly differed between winners.

holder = as.data.frame(cbind('variables' = colnames(training_dif.only), 'p-value' = 0))

for(i in 1:length(training_dif.only)) {
    
    if(class(training_dif.only[, i]) == "integer") {
      
      sig.test = t.test(training_dif.only[, i] ~ training_dif.only$Winner) 
      holder$`p-value`[i] = sig.test$p.value
      
    } else if (class(training_dif.only[, i]) == "numeric") {
      
      sig.test = t.test(training_dif.only[, i] ~ training_dif.only$Winner)
      holder$`p-value`[i] = sig.test$p.value
      
    } else if (class(training_dif.only[, i]) == "factor")  {
      
      sig.test = chisq.test(training_dif.only$Winner, training_dif.only[, i], correct = T)
      holder$`p-value`[i] = sig.test$p.value
    } else if (class(training_dif.only[, i]) == "logical") {
      
      sig.test = chisq.test(training_dif.only$Winner, training_dif.only[, i], correct = T)
      holder$`p-value`[i] = sig.test$p.value
    }
    
}
  

significantly_different_variables = holder %>% filter(`p-value` < 0.05 & variables != "Winner") %>% arrange(`p-value`)

significantly_different_variables$variables
# These should include ev_dif, age_dif, win_streak_dif, loss_dif,  reach_dif,  avg_td_dif, finish_details,  rank_dif,  lose_streak_dif,  sig_str_dif,  height_dif,  avg_td_pct,  total_round_dif,  title_bout,  longest_win_streak_dif,  empty_arena,  KO.TKO.Doctor_Stoppage_win_dif


log_reg.association = glm(Winner ~ ev_dif + age_dif + win_streak_dif + loss_dif + reach_dif + avg_td_dif + finish_details + rank_dif + lose_streak_dif + sig_str_dif + height_dif + avg_td_pct + total_round_dif + title_bout + longest_win_streak_dif + empty_arena + KO.TKO.Doctor_Stoppage_win_dif, data = training_dif.only, family = binomial(link = "logit"))

accuracy_log.reg_association = round(mean(ifelse(predict(log_reg.association, testing_dif.only, type = "response") > 0.5, "Red", "Blue") == testing_dif.only$Winner)*100, 3)

Approach 3: Decision Trees

library(rpart)

# Creating a parameter grid used for selecting the optimal tuning parmeters
rpart_control = expand.grid(
  minsplit = seq(10, 40, 5), 
  minbucket = seq(3, 10, 1),
  cp = c(0.001, 0.003, 0.005, 0.01, 0.03, 0.05, 0.1),
  maxdepth = 30, 
  xerror = 0
)

# Use a for-loop to select the best parameters based on the lowest cross-validation error 

for (i in 1:nrow(rpart_control)) {
  
  set.seed(1234)
  
  model = rpart(Winner ~., 
                data = training_dif.only, 
                method = "class", 
                parms = list(split = "gini"),
                control = rpart.control(minsplit = rpart_control$minsplit[i],
                                        minbucket = rpart_control$minbucket[i],
                                        maxdepth = rpart_control$maxdepth[i], 
                                        cp = rpart_control$cp[i]))
  
  rpart_control$xerror[i] = min(model$cptable[, 4]) 
  
}

decision_tree.best_parameters = rpart_control %>% arrange(xerror) %>% slice_min(xerror, n = 1)

# Using the optimal parameters to tune the final model 

for (i in 1:nrow(decision_tree.best_parameters)) {
  
  set.seed(1234)
  
  model = rpart(Winner ~., 
                data = training_dif.only, 
                method = "class", 
                parms = list(split = "gini"),
                control = rpart.control(minsplit = decision_tree.best_parameters$minsplit[i],
                                        minbucket = decision_tree.best_parameters$minbucket[i],
                                        maxdepth =  decision_tree.best_parameters$maxdepth[i],
                                        cp = decision_tree.best_parameters$cp[i]))
  
  
  accuracy = round(mean(predict(model, testing_dif.only, type = "class") == testing_dif.only$Winner) * 100, 3)
  decision_tree.best_parameters$accuracy[i] = accuracy
  
}

accuracy_decision.tree.tuned = max(decision_tree.best_parameters$accuracy)

Note: With respect to the more “black box” machine learning algorithms, the decision to rely on a more exhaustive approach to parameter optimization (i.e., grid search) is due to the objective of the task which is to find the best possible model for predicting outcomes. So while it is quite computationally expensive compared to other approaches, it provides a little more assurance that I will get the best results.

Approach 4: Random Forest Model

library(ranger) # a faster engine for random forest 

# Using a for-loop to find the optimal parameters

rf.hyper_grid = expand.grid(
  mtry = seq(1, 10, by = 1), 
  min.node.size = seq(1, 20, by = 1), 
  splitrule = "gini",
  accuracy_score = 0
)

for (i in 1:nrow(rf.hyper_grid)) {
  
  # train model 
  set.seed(1234)
  model = ranger(Winner ~., 
                 data = training_dif.only, 
                 num.trees = 1500, 
                 mtry = rf.hyper_grid$mtry[i],
                 importance = "impurity",
                 min.node.size = rf.hyper_grid$min.node.size[i], 
                 splitrule = "gini")
  
  pred_model = predict(model, testing_dif.only, type = "response")$predictions
  accuracy = round(mean(pred_model == testing_dif.only$Winner)*100, 5)
  rf.hyper_grid$accuracy_score[i] = accuracy
}

best_rf_parameters = rf.hyper_grid %>% arrange(desc(accuracy_score)) %>% top_n(1)
best_rf_parameters = best_rf_parameters[1,]

set.seed(1234)

rf_best.tuned = ranger(Winner ~., 
                       data = training_dif.only, 
                       num.trees = 1500, 
                       mtry = best_rf_parameters$mtry,
                       importance = 'impurity', 
                       min.node.size = best_rf_parameters$min.node.size, 
                       splitrule = "gini")

accuracy_rf.tuned = round(mean(predict(rf_best.tuned, testing_dif.only)$prediction == testing_dif.only$Winner)*100, 3)

Amongst the various top variables in the data set that appear to drive the split of the nodes are wager payout differences, the average number of significant strikes and takedowns landing, the accuracy of significant strikes or takedowns landing.

Approach 5: Extreme Gradient Boost

Since this process will be quite computationally expensive, I would suggest using parallel computing to speed up this process slightly. This is accomplished using the parallel package.

library(parallel)
no_cores = detectCores() - 1 # Find out the number of CPU processors you have in your device and use 1 less so as to avoid having R crash 
cl = makePSOCKcluster(no_cores)
registerDoParallel(cl)

NOTE: Just for the sake of speeding things up (otherwise it will literally take an entire day or so to run), I’ve broken up the grid search hyperparameter tuning into multiple steps.

library(xgboost)

set.seed(1234)

training_for_xgboost = training_dif.only
testing_for_xgboost = testing_dif.only

# Step 1: Pre-Process

training_for_xgboost = training_for_xgboost %>% 
  mutate(
    punch_elbow_slam = ifelse(finish_details == "Punch/Elbow/Slam", 1, 0),
    kick_knee = ifelse(finish_details == "Kick/Knee", 1, 0), 
    choke = ifelse(finish_details == "Choke", 1, 0), 
    decision = ifelse(finish_details == "Decision", 1, 0),
    lowerbody = ifelse(finish_details == "Lowerbody Joint Lock", 1, 0), 
    upperbody = ifelse(finish_details == "Upperbody Joint Lock", 1, 0), 
    heavyweight = ifelse(weight_class == "Heavyweight", 1, 0),
    light.heavyweight = ifelse(weight_class == "Light Heavyweight", 1, 0), 
    middleweight = ifelse(weight_class == "Middleweight", 1, 0), 
    welterweight = ifelse(weight_class == "Welterweight", 1, 0), 
    lightweight = ifelse(weight_class == "Lightweight", 1, 0), 
    featherweight = ifelse(weight_class == "Featherweight", 1, 0), 
    bantamweight = ifelse(weight_class == "Bantamweight", 1, 0), 
    flyweight = ifelse(weight_class == "Flyweight", 1, 0), 
    women_featherweight = ifelse(weight_class == "Women's Featherweight", 1, 0), 
    women_bantamweight = ifelse(weight_class == "Women's Bantamweight", 1, 0), 
    women_flyweight = ifelse(weight_class == "Women's Flyweight", 1, 0), 
    women_strawweight = ifelse(weight_class == "Women's Strawweight", 1, 0), 
    gender = ifelse(gender == "MALE", 1, 0), 
    is_foreign = ifelse(is_foreign == TRUE, 1, 0), 
    title_bout = ifelse(title_bout == TRUE, 1, 0), 
    stance_comparison = ifelse(stance_comparison == "Same", 0, 1)
  ) %>% 
  mutate(
    is_foreign = as.numeric(ifelse(is_foreign == TRUE, 1, 0)),
    title_bout = as.numeric(ifelse(title_bout == TRUE, 1, 0))
  )

testing_for_xgboost = testing_for_xgboost %>% 
  mutate(
    punch_elbow_slam = ifelse(finish_details == "Punch/Elbow/Slam", 1, 0),
    kick_knee = ifelse(finish_details == "Kick/Knee", 1, 0), 
    choke = ifelse(finish_details == "Choke", 1, 0), 
    decision = ifelse(finish_details == "Decision", 1, 0),
    lowerbody = ifelse(finish_details == "Lowerbody Joint Lock", 1, 0), 
    upperbody = ifelse(finish_details == "Upperbody Joint Lock", 1, 0), 
    heavyweight = ifelse(weight_class == "Heavyweight", 1, 0),
    light.heavyweight = ifelse(weight_class == "Light Heavyweight", 1, 0), 
    middleweight = ifelse(weight_class == "Middleweight", 1, 0), 
    welterweight = ifelse(weight_class == "Welterweight", 1, 0), 
    lightweight = ifelse(weight_class == "Lightweight", 1, 0), 
    featherweight = ifelse(weight_class == "Featherweight", 1, 0), 
    bantamweight = ifelse(weight_class == "Bantamweight", 1, 0), 
    flyweight = ifelse(weight_class == "Flyweight", 1, 0), 
    women_featherweight = ifelse(weight_class == "Women's Featherweight", 1, 0), 
    women_bantamweight = ifelse(weight_class == "Women's Bantamweight", 1, 0), 
    women_flyweight = ifelse(weight_class == "Women's Flyweight", 1, 0), 
    women_strawweight = ifelse(weight_class == "Women's Strawweight", 1, 0), 
    gender = ifelse(gender == "MALE", 1, 0), 
    is_foreign = ifelse(is_foreign == TRUE, 1, 0), 
    title_bout = ifelse(title_bout == TRUE, 1, 0), 
    stance_comparison = ifelse(stance_comparison == "Same", 0, 1)
  ) %>% 
  mutate(
    is_foreign = as.numeric(ifelse(is_foreign == TRUE, 1, 0)),
    title_bout = as.numeric(ifelse(title_bout == TRUE, 1, 0))
  )


training_for_xgboost = training_for_xgboost %>% 
  dplyr::select(
    Winner, punch_elbow_slam, kick_knee, choke, lowerbody, upperbody, decision, total_fight_time_secs, heavyweight, light.heavyweight, middleweight, welterweight, lightweight, featherweight, bantamweight, flyweight, women_featherweight, women_bantamweight, women_flyweight, women_strawweight, gender, is_foreign, title_bout, empty_arena, ev_dif, win_dif, win_streak_dif, longest_win_streak_dif, draw_dif, loss_dif, lose_streak_dif, KO.TKO.Doctor_Stoppage_win_dif, sub_win_dif, dec_dif, age_dif, height_dif, reach_dif, stance_comparison, sig_str_dif, avg_str_pct, avg_td_dif, avg_td_pct, avg_sub_att_dif, total_round_dif, total_title_bout_dif
    )

testing_for_xgboost = testing_for_xgboost %>% 
  dplyr::select(
    Winner, punch_elbow_slam, kick_knee, choke, lowerbody, upperbody, decision, total_fight_time_secs, heavyweight, light.heavyweight, middleweight, welterweight, lightweight, featherweight, bantamweight, flyweight, women_featherweight, women_bantamweight, women_flyweight, women_strawweight, gender, is_foreign, title_bout, empty_arena, ev_dif, win_dif, win_streak_dif, longest_win_streak_dif, draw_dif, loss_dif, lose_streak_dif, KO.TKO.Doctor_Stoppage_win_dif, sub_win_dif, dec_dif, age_dif, height_dif, reach_dif, stance_comparison, sig_str_dif, avg_str_pct, avg_td_dif, avg_td_pct, avg_sub_att_dif, total_round_dif, total_title_bout_dif
    )

# The interesting aspect of the XGBoost package is that it cannot run using a data frame as input data.
# It can only run as a matrix, so we need to convert it as such beforehand. 

xgb_training = as.matrix(training_for_xgboost %>% dplyr::select(- Winner))
xgb_testing = as.matrix(testing_for_xgboost %>% dplyr::select(- Winner))

training_label = training_for_xgboost$Winner
testing_label = testing_for_xgboost$Winner 

# First part of creating an optimized model - finding the optimal learning rate (ETA)

xgb_trcontrol = trainControl(
  method = "repeatedcv", 
  number = 3,
  repeats = 5,
  allowParallel = T, 
  classProbs = T,
  summaryFunction = twoClassSummary
)

xgb.grid.1 = expand.grid(
  eta = seq(0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 1), # this is the only parameter that changed
  gamma = 0, 
  max_depth = 6, 
  min_child_weight = 1, 
  subsample = 1, 
  colsample_bytree = 1,
  nrounds = 500
)

set.seed(1234)
xgb_model.A = caret::train(x = xgb_training, 
                           y = training_label,
                           trControl = xgb_trcontrol, 
                           method = "xgbTree",
                           metric = "ROC", # essentially using area under the curve to determine optimal model
                           tuneGrid = xgb.grid.1
                        )

xgb_model.A$bestTune # Find out what ends up being the best eta value and use that going forward

# Second part of creating an optimized model - determining the best subsample, min_child_weight, and max_depth

xgb.grid.2 = expand.grid(
  eta = xgb_model.a$bestTune[1, 3], # This is where the optimal ETA value is found  
  gamma = 0,
  max_depth = seq(3, 10, by = 1), 
  min_child_weight = seq(1, 10, by = 1), 
  subsample = seq(0.4, 1, by = 0.1),
  colsample_bytree = 1, 
  nrounds = 100
)

set.seed(1234)

xgb_model.A = caret::train(x= xgb_training, 
                           y = training_label,
                           trControl = xgb_trcontrol,
                           method = "xgbTree", 
                           metric = "ROC", 
                           tuneGrid = xgb.grid.2, 
                           allowParallel = T)

xgb_model.B$bestTune # get optimal subsample, max_depth and min_child_weight

# Third part of creating an optimized model - determining the best colsample_bytree and gamma values 

xgb.grid.3 = expand.grid(
  eta = xgb_model.B$bestTune[1, 3], 
  gamma = seq(0, 8, 1),
  max_depth = xgb_model.B$bestTune[1, 2], 
  min_child_weight = xgb_model.B$bestTune[1, 6], 
  subsample = xgb_model.B$bestTune[1, 7],
  colsample_bytree = seq(0.4, 1, 0.1), 
  nrounds = 500
)

set.seed(1234)

xgb_model.C = caret::train(x= xgb_training, 
                           y = training_label,
                           trControl = xgb_trcontrol,
                           method = "xgbTree", 
                           metric = "ROC", 
                           tuneGrid = xgb.grid.3, 
                           allowParallel = T)

xgb_model.C$bestTune 

#  Now that we have the optimal parameters, we can run the tuned model 

xgb.grid.tuned = expand.grid(
  eta = xgb_model.C$bestTune$eta, 
  gamma = xgb_model.C$bestTune$gamma,
  max_depth = xgb_model.C$bestTune$max_depth, 
  min_child_weight = xgb_model.C$bestTune$min_child_weight, 
  subsample = xgb_model.C$bestTune$subsample,
  colsample_bytree = xgb_model.C$bestTune$colsample_bytree, 
  nrounds = 500
)

set.seed(1234)
xgb_model.best = caret::train(x = xgb_training, 
                              y = training_label, 
                              trControl = xgb_trcontrol, 
                              method = "xgbTree", 
                              metric = "ROC", 
                              tuneGrid = xgb.grid.tuned, 
                              allowParallel = T)

accuracy_xgboost.tuned = round(mean(predict(xgb_model.best, xgb_testing) == testing_label)*100, 3)

So, how did I do?

Looking at the performance of the various models used, it seems that the only models that appeared to outperformed the proposed MMA analyst approach were the logistic regression models. While it’s a little annoying to see so much effort that went into optimizing a more complex algorithm, I guess it goes to show that old Einstein’s adage is true, “Everything should be made as simple as possible, but no simpler”.

# Make predictions using the proposed method
prediction.UFC_258.log_reg_cor = ifelse(predict(log.reg_cor, UFC_258_clean, type = 'response') > 0.5, "Red", "Blue")
UFC_258_clean = UFC_258_clean %>% mutate(analyst_pick = as.factor(ifelse(R_odds > 0, "underdog", ifelse(R_odds < 0, "favorite", NA)))) %>% mutate(analyst_pick = ifelse(analyst_pick=="favorite", "Red", "Blue"))
Baseline_picks = UFC_258_clean$analyst_pick

When comparing one of the better-performing models (the correlation filtered-logistic regression model) against the proposed MMA analyst method with the “upcoming” fight data set, it looks like our model did a marginally better job (8 correct picks vs. 7 correct picks).

Not the best win, but I’ll take it!

Can I make money with my model?

While I don’t condone it, let’s see what would happen if I use my model to make bets on fights. A quick look at the payouts using the picks derived from my predictive model, it appears that if I were to place a $100 wager for each fight, I would stand to make a net $269.07 profit from wagering $1000. While it is great that I made a profit, especially when you would compare making equivalent wagers relying on the proposed MMA analyst approach (a net profit of $53.13), it’s not an optimal approach at all.

A better approach is to just make a more selective wager by running probabilities of victory between fighters in a given match-up. This can be accomplished by establishing cut-off criteria that if the projected fighter did not have ___ % chance of winning, then don’t place a bet. Although there are many ways to get this criterion, we can use one that is commonly used:


Since winning-or-losing has a binary outcome, the probability of either event happening is just the sum total of either probability. So, the formula can be re-written as:

In the data set, we know what the payout would be, and that breaking even means a net 0 value, we can just use this formula to find the cutoff probability to decide whether to place a wager or not.

# Create a data frame listing each fighter and their prospective payout 

making.money = as.data.frame(cbind("Actual Winner" = Evaluation$Actual_winners, "My Picks" = Evaluation$`Model Predictions`, "Red Corner" = Evaluation$`Red Corner`, "Red Corner Payout" = round(UFC_258_clean$R_ev[1:10], 2), "Blue Corner" = Evaluation$`Blue Corner`, "Blue Corner Payout" = round(UFC_258_clean$B_ev[1:10], 2)))

#Evaluation that's is referred to here is the above data frame that compares my projected fight picks against proposed MMA analyst picks. 

# Function need to find probability cutoff to make a wager or not
prob_to_win = function(ev) {
  
  p_value = 100/(ev+100)
  p_value = p_value * 100
  p_value = round(p_value, 2)
  return(p_value)
}

# The payouts for my projected picks 
projected_winner_payouts = list(35.97, 93.48, 48.78, 39.53, 56.50, 20.53, 23.37, 100, 112, 72.46) 

min.prob.to.make.bet = unlist(lapply(projected_winner_payouts, prob_to_win))

making.money = cbind(making.money, "Min probability to bet" = as.data.frame(cbind(min.prob.to.make.bet)))
making.money$`probability to bet` = making.money[, 7]

# Using the logistic regression model to get the probability of fighter winning 

red_corner_probability_to_win = round(predict(log.reg_cor, UFC_258_clean, type = 'response')*100, 3)

# Since our picks does include those in the blue corner to be favored, just subtract them from 100 to get their probability 

prob_pick.to.win = c(75.51, (100-47.55), 60.28, 72.14, (100-47.41), 82.68, 77.10, 54.18, 57.61, 51.56)

picks = c("Kamaru Usman", "Alexa Grasso", "Kelvin Gastelum", "Ricky Simon", "Julian Marquez", "Roldofo Vieira", "Belal Muhammad", "Polyana Viana", "Andre Ewell", "Gabe Green")

should_bet_or_not = as.data.frame(cbind("Actual Winner" = making.money$`Actual Winner`, "My picks" = picks, "Probability to win" = prob_pick.to.win, "Probability to bet" = making.money$`probability to bet`))

should_bet_or_not = should_bet_or_not %>% mutate(
    `Place bet` = as.factor(ifelse(`Probability to win` > `Probability to bet`, "Yes", "No")), 
    `Wager payouts` = c(35.97, 93.46, 48.78, 39.53, 56.50, 20.53, 22.37, 100, 112, 72.46)
    )

Using the approach, it looks like I would have placed 5 wagers instead of 10 and stand to make $168.96 off of a $500 wager (assuming I’ve made equivalent $100 wagers). That’s a net 33.79% profit!

There you have it…

With a bit of prep work and some machine learning knowledge, I’ve successfully built a model that ended up doing a better job than some MMA analysts out there. Now does this mean we should completely discredit these folks in favor of machine learning algorithms? Absolutely not! If anything, it shows that data is a great compliment to existing base knowledge.

MMA analysts certainly have their place by providing certain perspectives that may not be completely fleshed out in terms of data analytics. Plus, while making predictions is a part of their roles, they are essentially there to drive up interest in upcoming fights and they do a great job at that. Those that happen to do very poorly at fight predictions, do a way better job at driving up interest. However, some of my favorite in terms of MMA analyses come from folks like Robin Black, Brett Okomoto and Luke Thomas. I would definitely recommend them if you are interested in learning more about this sport.

Now what I showed here is very elementary, and is something that you can definitely build on further. Things like additional data or using other black-box machine learning models like deep neural networks would likely provide better and more consistent performance compared to what is shown. So, it’s probably best to not quit my day job and go into MMA gambling for a full-time gig.

You can check out my repository to see all of the codes used in this project there.

If there are any inconsistencies or questions, let me know in the comments below.

Thanks for reading.

7 Likes

Love this article @michael.hoang17 ! How do you write so well? :heart_eyes:

2 Likes

Thanks, @theparidhi0. It’ll take some time to find your “voice” when it comes to writing, but if you reference some of your favorite articles, writers/authors, or blogs, you’ll find it eventually.

A pro-tip I got from some writers/media folks that I know is to make an initial draft that would interest you as a reader and then just adjust the setup and tone to whoever your audience would be or would like to be.