Text mining
Data retrieval
Now, let’s move forward to simple text analysis. First, we need to prepare the data! (as usual)
tokens <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text) %>% # Keeps only id and text of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens
Let’s have a look at word frequencies.
tokens %>%
count(word, sort = TRUE)
This is polluted by small words. Let’s filter that (FIRST METHOD).
tokens %>% mutate(length = nchar(word))
Data frequencies
Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the sample!
tokens %>%
mutate(length = nchar(word)) %>%
filter(length > 4) %>% # Keep words with length larger than 4
count(word, sort = TRUE) %>% # Count words
head(21) %>% # Keep only top 12 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

A better way to proceed is to remove “stop words” like “a”, “I”, “of”, “the”, etc (SECOND METHOD). Also, it would make sense to remove the search item and “https”.
data("stop_words")
tidy_tokens <- tokens %>%
anti_join(stop_words) # Remove unrelevant terms
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top 15 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Problem: strange characters remain. We are going to remove them by converting the text to ASCII format and omit NA data.
tidy_tokens <- tokens %>%
anti_join(stop_words) %>% # Remove unrelevant
mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
na.omit() %>% # Remove missing
filter(nchar(word) > 1, # Remove small words
!(word %in% c("https", "t.co", search_term)) # search_term defined above
)
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Perfect!
n-grams
See https://www.tidytextmining.com/ngrams.html
tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col()

Again: same issue with stop words! So we must remove them again. But it’s more complicated now. We can use the separate() function to help us.
tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(stop_words$word, "https", search_term)),
!(word2 %in% c(stop_words$word, "https", search_term)),
bigram != search_term) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram")

Sentiment
This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want to download data (lexicons).
Just say yes in the console (type the correct answer: if not, you will be blocked/struck).
First, we need to load some sentiment lexicon. AFINN is one such sentiment database.
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
To create a nice visualization, we need to extract the time of the tweets.
tokens_time <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens_time
We then use inner_join() to merge the two sets. This function removes the cases when a match does not occur.
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
sentiment <- tokens_time %>%
inner_join(afinn) %>%
mutate(day = day(created_at),
hour = hour(created_at) / 24,
minute = minute(created_at) / 60 / 24,
time = day + hour + minute)
Joining, by = "word"
sentiment
We then compute the average sentiment, minute-by-minute.
Of course, average sentiment can be misleading. Indeed, if a text contains the terms “I’m not happy”, then only “happy” will be tagged, which is the opposite of the intended meaning.
sentiment %>%
group_by(time, day, hour, minute) %>%
summarise(avg_sentiment = mean(value)) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = avg_sentiment)) + geom_col() + geom_smooth()
`summarise()` regrouping output by 'time', 'day', 'hour' (override with `.groups` argument)

There are 24 bars per day, but the y-axis is not optimal…
What about emotions? The NRC lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions.
```r
if(!require(sentimentr)){install.packages(c(\sentimentr\, \textcat\))}
library(sentimentr)
library(textcat)
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
We then create the merged dataset.
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuZW1vdGlvbnMgPC0gdG9rZW5zX3RpbWUgJT4lIFxuICBpbm5lcl9qb2luKG5yYykgJT4lICAgICAgICAgICAgICAgICAgIyBNZXJnZSBkYXRhIHdpdGggc2VudGltZW50XG4gIG11dGF0ZShkYXkgPSBkYXkoY3JlYXRlZF9hdCksXG4gICAgICAgICBob3VyID0gaG91cihjcmVhdGVkX2F0KS8yNCxcbiAgICAgICAgIG1pbnV0ZSA9IG1pbnV0ZShjcmVhdGVkX2F0KS8yNC82MCxcbiAgICAgICAgIHRpbWUgPSBkYXkgKyBob3VyICsgbWludXRlKSAgICMgQ3JlYXRlIGRheSBjb2x1bW5cbmBgYCJ9 -->
```r
emotions <- tokens_time %>%
inner_join(nrc) %>% # Merge data with sentiment
mutate(day = day(created_at),
hour = hour(created_at)/24,
minute = minute(created_at)/24/60,
time = day + hour + minute) # Create day column
Joining, by = "word"
emotions # Show the result
The merging has reduced the size of the dataset, but there still remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each day.
g <- emotions %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
filter(day == 16) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis(option = "magma", discrete = T, direction = -1)
`summarise()` regrouping output by 'time', 'sentiment', 'day', 'hour' (override with `.groups` argument)
ggplotly(g)
This can also be shown in percentage format.
```r
tweets_en %>%
rowid_to_column(\element_id\) # This creates a new column with row number
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\)
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\) %>%
group_by(day = day(created_at)) %>%
summarise(avg_sent = mean(sentiment)) %>%
ggplot(aes(x = as.factor(day), y = avg_sent)) + geom_col()
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\) %>%
ggplot(aes(x = as.factor(day(created_at)), y = sentiment)) +
geom_jitter(size = 0.2) +
geom_boxplot(aes(color = as.factor(day(created_at))), alpha = 0.5) +
theme(legend.position = \none\) + xlab(\day\)
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuZW1vdGlvbnMgJT4lIFxuICBtdXRhdGUoc2VudGltZW50ID0gaWZfZWxzZShzZW50aW1lbnQgPCBcIm5lZ2F0aXZlXCIsIFwicG9zaXRpdmVcIiwgXCJuZWdhdGl2ZVwiKSkgJT4lIFxuICBncm91cF9ieSh0aW1lLCBzZW50aW1lbnQsIGRheSwgaG91ciwgbWludXRlKSAlPiVcbiAgc3VtbWFyaXNlKGludGVuc2l0eSA9IG4oKSkgJT4lXG4gIG11dGF0ZSh0aW1lID0gbWFrZV9kYXRldGltZSh5ZWFyID0gMjAyMCwgbW9udGggPSAxMCwgZGF5ID0gZGF5LCBob3VyID0gaG91cioyNCwgbWluID0gbWludXRlKjI0KjYwKSkgJT4lXG4gIGdncGxvdChhZXMoeCA9IHRpbWUsIHkgPSBpbnRlbnNpdHksIGZpbGwgPSBzZW50aW1lbnQpKSArIGdlb21fY29sKHBvc2l0aW9uID0gXCJmaWxsXCIpICtcbiAgdGhlbWUoYXhpcy50ZXh0LnggPSBlbGVtZW50X3RleHQoYW5nbGUgPSA4MCwgXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNpemUgPSAxMCxcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgaGp1c3QgPSAxKSkgKyB4bGFiKFwiVGltZVwiKSArXG4gIHNjYWxlX2ZpbGxfbWFudWFsKHZhbHVlcyA9IGMoXCIjMDAxMTQ0XCIsIFwiI0ZGREQ5OVwiKSlcbmBgYCJ9 -->
```r
emotions %>%
mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_manual(values = c("#001144", "#FFDD99"))
Advanced sentiment
The problem with the preceding methods is that they don’t take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says not happy, counting the word happy is not a good idea! The package sentimentr is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book Supervised Machine Learning for Text Analysis in R hosted at https://smltar.com)
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
First, let’s keep only the tweets written in English!
tweets_en <- tweets %>%
mutate(language = textcat(text)) %>%
filter(language == "english") %>%
dplyr::select(created_at, text)
NOTE: the code above was used to show the function textcat: the language is already coded in the tweets via the lang column/variable. (it suffices to keep the instances for which lang == “en”)
Next, we compute advanced sentiment.
tweet_sent <- tweets_en$text %>%
get_sentences() %>% # Intermediate function
sentiment() # Sentiment!
tweet_sent
NOTE: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant.

