Personal Project Simulation of Missing Data with categorical data type

Hey guys,
I kinda fell into a pit here.
I have been starting out at work with some rather basic statistical tests.
I deal with quite some missing data.
To improve my knowledge and capabilities I started a personal project.
Right now I am trying to simulate a missing data mechanism called “Missing Completely at Random”.
Furthermore, to improve my skills I have to decided to work with as many functions as possible AND necessary.

Below you can see my code.
I am struggeling with two things.
First my code throws an error:
Error in rbind(deparse.level, ...): numbers of columns of arguments do not match
I get that error but I do not understand how to fix it because the mutate() function just adds another column to the data set. I want it to replace a variable with NA values.
How do I fix that?

Second one.
If I replace the whole line of mutate with:
data_frame.NA.sample[, feature] <- NA
I get another error:
`Error in Summary.factor… : min not meaningful with factors``

My emphasis is on nominal/categorical data. I do not want to use dummy variables.

Any suggestions would be welcome.

The dataset I am using is:
https://vincentarelbundock.github.io/Rdatasets/csv/gamclass/german.csv

mcar_sample <- function(data_frame, percent, feature, index = NULL){ sample <- sample(nrow(data_frame), nrow(data_frame) * percent, replace = FALSE) NA.sample <- -sample data_frame.sample <- data_frame[sample, ] data_frame.NA.sample <- data_frame[NA.sample, ] %>% mutate(feature = NA) data_frame.total <- rbind(data_frame.NA.sample, data_frame.sample) }

Sincerly

Hello @chjherzog,

Sorry for the delay of the answer.

I think it’s a very good initiative to try things on your own. That’s how you learn. Three things to learn from your code.

  • Your function must return a result if you want to have access to the output of your function outside the function (in your function you should return data_frame.total)

  • When using the sample() function, it’s worth using set.seed(1) to make sure your result will be reproducible.

  • In R, there are basically two ways to process your data, either using functions from “Tidyverse” packages or built-in functions. It is recommended to avoid mixing these two ways as much as possible (mainly in the same instruction). In Dataquest, we try to use mainly “Tidyverse” because it is more intuitive and faster. Besides, heads-up that a “tidyverse” instruction can be easier outside a function. (This is a typical example of this mixture data_frame.NA.sample <- data_frame[NA.sample, ] %>% mutate(feature = NA)).

Let’s now dive, in detail, into your code.

I was not able to reproduce this error could you please share the code that produced it? Nevertheless, here’s your modified function that works for me.

mcar_sample <- function(data_frame, percent, feature, index = NULL){ 
  set.seed(1) #Note the use of set.seed() here
  
 #Specifying the parameters of the function helps to have a clear code and facilitates error detection.
  sample <- sample(x = nrow(data_frame),
                   size = floor(nrow(data_frame) * percent), 
                   replace = FALSE)
  
  out_data_frame <- data_frame #make a copy of the original dataframe
  
  out_data_frame[sample, feature] <- NA #replace the selected rows of the column named feature  by NA

  out_data_frame #Note how we return the output
}

I checked the function with this code:

#Reading the dataset using the function `read_csv()` from the package `readr`
german <- readr::read_csv("german.csv", col_names = T)

#Using `mcar_sample()` function to introduce some NAs on the variable `V3`
out_german <- mcar_sample(german, 0.25, "V3" )

#Checking if V3 contains 25% of NAs
sum(is.na(out_german$V3))/nrow(out_german) #should be equal to 0.25
  • If you want to use the mutate() function, I will recommend not to use a function but rather write this:
set.seed(1)

sample <- sample.int(n = nrow(german),
                 size = floor(nrow(german) * 0.25), 
                 replace = F) 

library(dplyr)
library(tidyr)
out_german <-  german %>%
  mutate( V3 = replace(V3, row_number() %in% sample, NA_character_))

Five ingredients are necessary.

  1. Creating the sample.
  2. Using the function row_number() to identify each row in the dataframe.
  3. Using the function %in% to check which of those row numbers are in the sample vector.
  4. Using the function replace function to replace the values where the previous condition is satisfies by NA. We use NA_character_ because in R each type has its NA value.
  5. Overwriting the existing column with the mutate() function.
  • If you want to use your function for this problem then you cannot use the mutate() function inside. Actually, the mutate() function doesn’t expect to receive a variable that contains the name of a variable to create/overwrite but rather expects the name of the variable directly. Hence, in this code data_frame.NA.sample <- data_frame[NA.sample, ] %>% mutate(feature = NA) instead of overwriting the column whose name is contained in the variable feature, a new column named “feature” is created in the dataframe. The solution to this problem is to use the mutate_at() and vars() functions.
mcar_sample <- function(data_frame, percent, feature, index = NULL){ 
  set.seed(1)
  
  sample <- sample(x = nrow(data_frame),
                   size = floor(nrow(data_frame) * percent), 
                   replace = FALSE) 
  
  out_german <-  german %>%
    mutate_at( vars(feature), ~replace(., row_number() %in% sample, NA_character_))
   
}

In addition to the previous ingredients, you can notice the use of the mutate_at() that takes two arguments:

  1. The vars() function, which tells R that the name of the column is contained in the variable feature.
  2. The action to make on this column. The symbol ~ here tells R that replace() is a function.
  • If you want to use rbind() or other similar functions you can write these pieces of code:
mcar_sample <- function(data_frame, percent, feature, index = NULL){ 
  set.seed(1)
  
  sample <- sample(x = nrow(data_frame),
                   size = floor(nrow(data_frame) * percent), 
                   replace = FALSE) 
  
  notNA.sample <- -sample
  
  data_frame.sample <- data_frame[notNA.sample, ] 
  
  data_frame.NA.sample <- data_frame[sample, ]
  
  data_frame.NA.sample[, feature] <- NA #replace the selected rows of the column named feature  by NA

  #before running the `rbind()` function `ncol(data_frame.sample)` and `ncol(data_frame.NA.sample)` must be the same which is not the case
  data_frame.total <- rbind(data_frame.NA.sample, data_frame.sample) 
  
  data_frame.total #the result should be return 
}
  • Using “tidyverse” functions:
notna.sample <-  german %>%
  filter(!(row_number() %in% sample)) #Note the use of `!` to take the negation of the condition.

na.sample <-  german %>%
  filter(row_number() %in% sample) %>%
  mutate(V3 = NA)

out_german <- bind_rows(notna.sample, na.sample) #bind_rows is equivalent to the `rbind()` function.

Of course, don’t hesitate to ask if something above or elsewhere is not clear for you.

Best,
John.

1 Like