# When is the Witching Hour? Documenting the Process

## Planning

I think I want to do something related to Halloween for this project, so I’m going to go to social-searcher.com to find my data set, searching for the word “ghosts.” It might be interesting to see what I can find to correlate the social network to the amount of fear shown by sentiment analysis. I don’t really have a hypothesis yet, but I have a hunch that it might be interesting to see how the different networks compare.

## Initial analysis

### Import data

This gives me the error “more columns than column names,” an error that suggests the file has some kind of mismatch between the first row (where column names are defined) and later rows. I’ll try reading it again by skipping the first line to see what’s wrong:

ghosts <- read.csv("ghosts.csv", skip = 1)

This seems to have worked perfectly. I guess the CSV file was malformed in some way. But too many of my columns are factors, so I’m going to change one parameter. And since I want to make sure all the characters are read correctly, I’m going to declare my encoding:

skip = 1,
stringsAsFactors = FALSE,
encoding = "UTF-8")

dark <- ghosts

Now a few columns should be switched back to factors:

dark$network <- as.factor(dark$network)
dark$type <- as.factor(dark$type)
dark$popularity_name <- as.factor(dark$popularity_name)

Things are mostly well formed, but my data is still a little messy. Some of the “networks” in column one are bizarre, so I’m going to remove those rows, and then drop these levels. The same thing happens in a couple other factor columns.

# I've added "vkontakte" here to the networks, as its text is almost wholly in the Cyrillic alphabet, and I can't read it.
badrows <- c(grep("Bust That Ghost", as.character(dark$network)), grep("vkontakte", as.character(dark$network)),
grep("2019", as.character(dark$type)), grep("http", as.character(dark$popularity_name)))

dark$network <- droplevels(dark$network)
dark$type <- droplevels(dark$type)
dark$popularity_name <- droplevels(dark$popularity_name)

Now I’m going to limit the columns to make the data easier to handle:

my_data <- dark[ , c("network", "posted", "text", "type")]

At this point, I think I’m ready to pull in the sentiment analysis.

### Sentiment analysis

Before getting too far along, I need to load up some useful packages:

library(syuzhet)
library(dplyr)
library(tm)

Now, I’m going to do some quick cleaning of the “text” column of my_data:

# Remove URLs
my_data$text <- gsub(pattern = "http[s]?.*", replace = "", my_data$text)

my_data$text <- gsub(pattern = "@[A-z0-9]*", replace = "", my_data$text)

# Remove any punctuation
my_data$text <- gsub(pattern = "\\W", replace = " " , my_data$text)

# Remove any digits
my_data$text <- gsub(pattern = "\\d", replace = " " , my_data$text)

# Remove any single letters or single digits
my_data$text <- gsub(pattern = "\\b[A-z]{1}\\b", replace = " ", my_data$text)

# Remove linebreaks, convert to lower case, and remove extra spacing
my_data$text <- gsub("\n", " ", my_data$text) %>%
tolower() %>%
stripWhitespace()

Now add the columns of sentiment:

this_data <- fears_group_dark[fears_group_dark$group == grp, ] barplot(this_data$fear, names.arg = this_data$darkness, main = grp, col = c("yellow", "navy"), ylab="average fear level") } Some of these groups are surprising. I maybe could have guessed that the search term for vampires would show more fearfulness during the day, but I never would have guessed it for witches! ### Visualizing by hour of posting Since night time postings don’t behave how I expected them to, I want to see how the fear values break down by hour. fear_time <- my_sentiment %>% group_by(hour) %>% summarise(fear = mean(fear)) barplot(fear_time$fear, names.arg = fear_time$hour, ylab="average fear level", xlab = "hour of post") Visualizing the fear values for each hour looks super interesting! In fact, it suggests trends that are more more interesting than my original question: Instead of showing fear of the dark, these posts on social media seem to have greater values of fear right at daybreak. That’s really dark. (Pardon the pun.) I would like to modify my project to show something about this chart, and I think I want to include something like it in my final analysis. But I want it to look nice, so I’m going to add color by day and night: the_colors <- c(rep("navy", 6), rep("yellow", 12), rep("navy", 6)) barplot(fear_time$fear, names.arg = fear_time\$hour, col=the_colors,  ylab="average fear level", xlab = "hour of post")

I’m happy with this visualization! And I’m happy with these analyses. From a hunch, I moved toward a hypothesis and further analysis that was just a little more complex.

There’s nothing inherently wrong with the original question, and it could have been answered: Vimeo’s network shows the greatest average value of fear. But that wasn’t very interesting to me, so I could have stopped wholly by answering my second question, finding my hypothesis to be negative: No, social networkers don’t show more fear in the posts they make at night time. But answering this question made me curious to answer other questions, which lead to other analyses. All that remains is to write up the report (available as a webpage, a PDF or an R Markdown file).

## Save

Now that I have everything done, I’m going to make sure I back up the work by saving the important objects to disk:

saveRDS(my_sentiment, "my_sentiment.RDS")
saveRDS(fear_time, "fear_time.RDS")

It’s always a good idea to back up your work once you’ve reached certain milestones, but I have an ulterior motive, too: I’ll be continuing this analysis with visualizations in ggplot2 on another webpage, and I want to be able to import these tables there with little difficulty.