Creating a single dataset containing multiple .text documents, how would I restructure it in the one-token-per-row format using unnest_tokens()?

Hey guys, I’m doing a project where I analyse Trumps speeches (text analysis).
Text files used are here if you’re interested:
Trump speech files

At the moment, my goal is to create a single dataset based on the text files I have provided

BemidjiSep18_2020.txt
FayettevilleSep19_2020.txt
FreelandSep10_2020.txt
HendersonSep13_2020.txt
LatrobeSep3_2020.txt
MindenSep12_2020.txt
MosineeSep17_2020.txt
OhioSep21_2020.txt
PittsburghSep22_2020.txt
Winston-SalemSep8_2020.txt

Table looks like this:

image

My code looks like this:

# Read the files in
# lapply function returns a list the same length as the txt_files_ls
# Create a dataframe by reading in the table 
# Set the header to "F" as we will be adding this in later
# Separate the data using "sep="\t"" this means the data is tab delimited and from seperate documents
# read.table("file.txt", header=T/F, sep="\t") is an alternative to read.delim
txt_files_df_list <- lapply(txt_files_ls, function(x) {data.frame(read.table(file = x, header = F, sep ="\t", colnames(x)))})

# Combine them and set the column name to speech using the setName function 
# The do.call function constructs and executes a function call from a name or function in this case "r.bind"
combined_df <- setNames(do.call("rbind", txt_files_df_list),
                        c("Speech")) 

# Create an R object for the locations of speeches, listing them in the same order as they were inputted into the list 
location <- c("Bemidji", "Fayetteville", "Freeland", "Henderson", "Latrobe", "Minden", "Mosinee", "Ohio", "Pittsburgh", "Winston-Salem" )


# Using the dplyr package and the function mutate add in the new R object of the locations and create a new dataframe
combined_df_2 <- mutate(combined_df, Location= location)

# Create an R object for the dates of the speeches extracted from the file titles, place them in the same order as they were inputted into the list 
date <- c("2020-09-18", "2020-09-19", "2020-09-10", "2020-09-13", "2020-09-03", "2020-09-12", "2020-09-17", "2020-09-21", "2020-09-22", "2020-09-08")

# Transform the data into date data using the as_date function and adding the format of which the date is written 
date_2<- lubridate::as_date(date, '%Y-%m-%d')

# Again using the dplyr package and the mutate function add in the new R object of the dates with the new format of data
combined_df_3 <- mutate(combined_df_2, Date=date_2)

# Seeing the structure of the combined dataset to check that the speech and location columns are characters and the date column is date
str(combined_df_3)

view(combined_df_3)

My question is how would I break the text in to individual tokens and transform it to a tidy data structure.
How would I tokenize the dialogue, splitting each sentence in separate words?

When I try to do it myself with the code:

test_df <- combined_df_3 %>% 
  unnest_tokens(word, combined_df_3$Speech) 

I get the error :

image

Any guidance would be appreciated!
Also, if there’s a way to somehow make my original code smaller, where I extract the name and date of the file name and put them into individual columns which contains the content of file(Speech), Location and date columns. That would also be helpful!

1 Like