Data Cleaning with R

I am currently working on the data cleaning with R section. (page 11 / 14)
I am confused why we need the names() function in the code below.
I appreciate your kind help in advance.


ny_schools <- list(sat_results, ap_2010, class_size, demographics, graduation, hs_directory)
names(ny_schools) <- c(“sat_results”, “ap_2010”, “class_size”, “demographics”, “graduation”,
“hs_directory”)

duplicate_DBN <- ny_schools %>%
map(mutate, is_dup = duplicated(DBN)) %>%
map(filter, is_dup == “TRUE”)


Thanks

Hi @jaehoown! The names() function makes it easier to understand which dataframe is which when you inspect the results. For example, if we ran this code:

ny_schools <- list(sat_results, ap_2010, class_size, demographics, graduation, hs_directory)

duplicate_DBN <- ny_schools %>%
  map(mutate, is_dup = duplicated(DBN))  %>%
  map(filter, is_dup == "TRUE")

print(duplicate_DBN)

The output looks like this:

[[1]]
# A tibble: 0 x 8
# … with 8 variables: DBN <chr>, `SCHOOL NAME` <chr>, `Num of SAT Test
#   Takers` <dbl>, `SAT Critical Reading Avg. Score` <dbl>, `SAT Math Avg.
#   Score` <dbl>, `SAT Writing Avg. Score` <dbl>, avg_sat_score <dbl>,
#   is_dup <lgl>

[[2]]
# A tibble: 1 x 8
  DBN   SchoolName `AP Test Takers` `Total Exams Ta… `Number of Exam…
  <chr> <chr>                 <dbl>            <dbl>            <dbl>
1 04M6… YOUNG WOM…               NA               NA               NA
# … with 3 more variables: exams_per_student <dbl>, high_score_percent <dbl>,
#   is_dup <lgl>

[[3]]
# A tibble: 0 x 8
# Groups:   CSD, SCHOOL CODE [0]
# … with 8 variables: CSD <dbl>, `SCHOOL CODE` <chr>, `SCHOOL NAME` <chr>,
#   avg_class_size <dbl>, avg_largest_class <dbl>, avg_smallest_class <dbl>,
#   DBN <chr>, is_dup <lgl>

[[4]]
# A tibble: 0 x 13
# … with 13 variables: DBN <chr>, Name <chr>, frl_percent <dbl>,
#   total_enrollment <dbl>, ell_percent <dbl>, sped_percent <dbl>,
#   asian_per <dbl>, black_per <dbl>, hispanic_per <dbl>, white_per <dbl>,
#   male_per <dbl>, female_per <dbl>, is_dup <lgl>

[[5]]
# A tibble: 0 x 5
# … with 5 variables: DBN <chr>, `School Name` <chr>, `Total Grads - % of
#   cohort` <chr>, `Dropped Out - % of cohort` <chr>, is_dup <lgl>

[[6]]
# A tibble: 0 x 4
# … with 4 variables: DBN <chr>, school_name <chr>, `Location 1` <chr>,
#   is_dup <lgl>

So it’s not clear at first glance which dataframe is which. But if we run this code:

ny_schools <- list(sat_results, ap_2010, class_size, demographics, graduation, hs_directory)
names(ny_schools) <- c("sat_results", "ap_2010", "class_size", "demographics", "graduation", "hs_directory")

duplicate_DBN <- ny_schools %>%
  map(mutate, is_dup = duplicated(DBN))  %>%
  map(filter, is_dup == "TRUE")

print(duplicate_DBN)

The output is easier to understand because we’ve named each dataframe included in the list:

$sat_results
# A tibble: 0 x 8
# … with 8 variables: DBN <chr>, `SCHOOL NAME` <chr>, `Num of SAT Test
#   Takers` <dbl>, `SAT Critical Reading Avg. Score` <dbl>, `SAT Math Avg.
#   Score` <dbl>, `SAT Writing Avg. Score` <dbl>, avg_sat_score <dbl>,
#   is_dup <lgl>

$ap_2010
# A tibble: 1 x 8
  DBN   SchoolName `AP Test Takers` `Total Exams Ta… `Number of Exam…
  <chr> <chr>                 <dbl>            <dbl>            <dbl>
1 04M6… YOUNG WOM…               NA               NA               NA
# … with 3 more variables: exams_per_student <dbl>, high_score_percent <dbl>,
#   is_dup <lgl>

$class_size
# A tibble: 0 x 8
# Groups:   CSD, SCHOOL CODE [0]
# … with 8 variables: CSD <dbl>, `SCHOOL CODE` <chr>, `SCHOOL NAME` <chr>,
#   avg_class_size <dbl>, avg_largest_class <dbl>, avg_smallest_class <dbl>,
#   DBN <chr>, is_dup <lgl>

$demographics
# A tibble: 0 x 13
# … with 13 variables: DBN <chr>, Name <chr>, frl_percent <dbl>,
#   total_enrollment <dbl>, ell_percent <dbl>, sped_percent <dbl>,
#   asian_per <dbl>, black_per <dbl>, hispanic_per <dbl>, white_per <dbl>,
#   male_per <dbl>, female_per <dbl>, is_dup <lgl>

$graduation
# A tibble: 0 x 5
# … with 5 variables: DBN <chr>, `School Name` <chr>, `Total Grads - % of
#   cohort` <chr>, `Dropped Out - % of cohort` <chr>, is_dup <lgl>

$hs_directory
# A tibble: 0 x 4
# … with 4 variables: DBN <chr>, school_name <chr>, `Location 1` <chr>,
#   is_dup <lgl>

I hope this helps. Please let me know of any follow-up questions. Best,
-Casey

3 Likes