Hello @chjherzog,
Sorry for the delay of the answer.
I think it’s a very good initiative to try things on your own. That’s how you learn. Three things to learn from your code.
-
Your function must return a result if you want to have access to the output of your function outside the function (in your function you should return data_frame.total
)
-
When using the sample()
function, it’s worth using set.seed(1)
to make sure your result will be reproducible.
-
In R, there are basically two ways to process your data, either using functions from “Tidyverse” packages or built-in functions. It is recommended to avoid mixing these two ways as much as possible (mainly in the same instruction). In Dataquest, we try to use mainly “Tidyverse” because it is more intuitive and faster. Besides, heads-up that a “tidyverse” instruction can be easier outside a function. (This is a typical example of this mixture data_frame.NA.sample <- data_frame[NA.sample, ] %>% mutate(feature = NA)
).
Let’s now dive, in detail, into your code.
I was not able to reproduce this error could you please share the code that produced it? Nevertheless, here’s your modified function that works for me.
mcar_sample <- function(data_frame, percent, feature, index = NULL){
set.seed(1) #Note the use of set.seed() here
#Specifying the parameters of the function helps to have a clear code and facilitates error detection.
sample <- sample(x = nrow(data_frame),
size = floor(nrow(data_frame) * percent),
replace = FALSE)
out_data_frame <- data_frame #make a copy of the original dataframe
out_data_frame[sample, feature] <- NA #replace the selected rows of the column named feature by NA
out_data_frame #Note how we return the output
}
I checked the function with this code:
#Reading the dataset using the function `read_csv()` from the package `readr`
german <- readr::read_csv("german.csv", col_names = T)
#Using `mcar_sample()` function to introduce some NAs on the variable `V3`
out_german <- mcar_sample(german, 0.25, "V3" )
#Checking if V3 contains 25% of NAs
sum(is.na(out_german$V3))/nrow(out_german) #should be equal to 0.25
- If you want to use the
mutate()
function, I will recommend not to use a function but rather write this:
set.seed(1)
sample <- sample.int(n = nrow(german),
size = floor(nrow(german) * 0.25),
replace = F)
library(dplyr)
library(tidyr)
out_german <- german %>%
mutate( V3 = replace(V3, row_number() %in% sample, NA_character_))
Five ingredients are necessary.
- Creating the sample.
- Using the function
row_number()
to identify each row in the dataframe.
- Using the function
%in%
to check which of those row numbers are in the sample
vector.
- Using the function
replace
function to replace the values where the previous condition is satisfies by NA
. We use NA_character_
because in R each type has its NA value.
- Overwriting the existing column with the
mutate()
function.
- If you want to use your function for this problem then you cannot use the
mutate()
function inside. Actually, the mutate()
function doesn’t expect to receive a variable that contains the name of a variable to create/overwrite but rather expects the name of the variable directly. Hence, in this code data_frame.NA.sample <- data_frame[NA.sample, ] %>% mutate(feature = NA)
instead of overwriting the column whose name is contained in the variable feature
, a new column named “feature” is created in the dataframe. The solution to this problem is to use the mutate_at()
and vars()
functions.
mcar_sample <- function(data_frame, percent, feature, index = NULL){
set.seed(1)
sample <- sample(x = nrow(data_frame),
size = floor(nrow(data_frame) * percent),
replace = FALSE)
out_german <- german %>%
mutate_at( vars(feature), ~replace(., row_number() %in% sample, NA_character_))
}
In addition to the previous ingredients, you can notice the use of the mutate_at()
that takes two arguments:
- The
vars()
function, which tells R that the name of the column is contained in the variable feature
.
- The action to make on this column. The symbol
~
here tells R that replace()
is a function.
- If you want to use
rbind()
or other similar functions you can write these pieces of code:
mcar_sample <- function(data_frame, percent, feature, index = NULL){
set.seed(1)
sample <- sample(x = nrow(data_frame),
size = floor(nrow(data_frame) * percent),
replace = FALSE)
notNA.sample <- -sample
data_frame.sample <- data_frame[notNA.sample, ]
data_frame.NA.sample <- data_frame[sample, ]
data_frame.NA.sample[, feature] <- NA #replace the selected rows of the column named feature by NA
#before running the `rbind()` function `ncol(data_frame.sample)` and `ncol(data_frame.NA.sample)` must be the same which is not the case
data_frame.total <- rbind(data_frame.NA.sample, data_frame.sample)
data_frame.total #the result should be return
}
- Using “tidyverse” functions:
notna.sample <- german %>%
filter(!(row_number() %in% sample)) #Note the use of `!` to take the negation of the condition.
na.sample <- german %>%
filter(row_number() %in% sample) %>%
mutate(V3 = NA)
out_german <- bind_rows(notna.sample, na.sample) #bind_rows is equivalent to the `rbind()` function.
Of course, don’t hesitate to ask if something above or elsewhere is not clear for you.
Best,
John.