Text mining
Data retrieval
Now, let’s move forward to simple text analysis. First, we need to prepare the data! (as usual)
tokens <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text) %>% # Keeps only id and text of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens
Let’s have a look at word frequencies.
tokens %>%
count(word, sort = TRUE)
This is polluted by small words. Let’s filter that (FIRST METHOD).
tokens %>% mutate(length = nchar(word))
Data frequencies
Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the sample!
tokens %>%
mutate(length = nchar(word)) %>%
filter(length > 4) %>% # Keep words with length larger than 4
count(word, sort = TRUE) %>% # Count words
head(21) %>% # Keep only top 12 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

A better way to proceed is to remove “stop words” like “a”, “I”, “of”, “the”, etc (SECOND METHOD). Also, it would make sense to remove the search item and “https”.
data("stop_words")
tidy_tokens <- tokens %>%
anti_join(stop_words) # Remove unrelevant terms
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top 15 words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Problem: strange characters remain. We are going to remove them by converting the text to ASCII format and omit NA data.
tidy_tokens <- tokens %>%
anti_join(stop_words) %>% # Remove unrelevant
mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
na.omit() %>% # Remove missing
filter(nchar(word) > 1, # Remove small words
!(word %in% c("https", "t.co", search_term)) # search_term defined above
)
tidy_tokens %>%
count(word, sort = TRUE) %>% # Count words
head(20) %>% # Keep only top words
ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Perfect!
n-grams
See https://www.tidytextmining.com/ngrams.html
tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col()

Again: same issue with stop words! So we must remove them again. But it’s more complicated now. We can use the separate() function to help us.
tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
na.omit() %>%
separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
filter(!(word1 %in% c(stop_words$word, "https", search_term)),
!(word2 %in% c(stop_words$word, "https", search_term)),
bigram != search_term) %>%
group_by(bigram) %>%
count(sort = T) %>%
head(20) %>%
ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram")

Sentiment
This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want to download data (lexicons).
Just say yes in the console (type the correct answer: if not, you will be blocked/struck).
First, we need to load some sentiment lexicon. AFINN is one such sentiment database.
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
To create a nice visualization, we need to extract the time of the tweets.
tokens_time <- tweets %>%
mutate(id = 1:nrow(tweets)) %>% # This creates a tweet id
select(id, text, created_at) %>% # Keeps id, text and date of the tweet
unnest_tokens(word, text) # Creates tokens!
tokens_time
We then use inner_join() to merge the two sets. This function removes the cases when a match does not occur.
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
sentiment <- tokens_time %>%
inner_join(afinn) %>%
mutate(day = day(created_at),
hour = hour(created_at) / 24,
minute = minute(created_at) / 60 / 24,
time = day + hour + minute)
Joining, by = "word"
sentiment
We then compute the average sentiment, minute-by-minute.
Of course, average sentiment can be misleading. Indeed, if a text contains the terms “I’m not happy”, then only “happy” will be tagged, which is the opposite of the intended meaning.
sentiment %>%
group_by(time, day, hour, minute) %>%
summarise(avg_sentiment = mean(value)) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = avg_sentiment)) + geom_col() + geom_smooth()
`summarise()` regrouping output by 'time', 'day', 'hour' (override with `.groups` argument)

There are 24 bars per day, but the y-axis is not optimal…
What about emotions? The NRC lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions.
```r
if(!require(sentimentr)){install.packages(c(\sentimentr\, \textcat\))}
library(sentimentr)
library(textcat)
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
We then create the merged dataset.
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuZW1vdGlvbnMgPC0gdG9rZW5zX3RpbWUgJT4lIFxuICBpbm5lcl9qb2luKG5yYykgJT4lICAgICAgICAgICAgICAgICAgIyBNZXJnZSBkYXRhIHdpdGggc2VudGltZW50XG4gIG11dGF0ZShkYXkgPSBkYXkoY3JlYXRlZF9hdCksXG4gICAgICAgICBob3VyID0gaG91cihjcmVhdGVkX2F0KS8yNCxcbiAgICAgICAgIG1pbnV0ZSA9IG1pbnV0ZShjcmVhdGVkX2F0KS8yNC82MCxcbiAgICAgICAgIHRpbWUgPSBkYXkgKyBob3VyICsgbWludXRlKSAgICMgQ3JlYXRlIGRheSBjb2x1bW5cbmBgYCJ9 -->
```r
emotions <- tokens_time %>%
inner_join(nrc) %>% # Merge data with sentiment
mutate(day = day(created_at),
hour = hour(created_at)/24,
minute = minute(created_at)/24/60,
time = day + hour + minute) # Create day column
Joining, by = "word"
emotions # Show the result
The merging has reduced the size of the dataset, but there still remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each day.
g <- emotions %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
filter(day == 16) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_viridis(option = "magma", discrete = T, direction = -1)
`summarise()` regrouping output by 'time', 'sentiment', 'day', 'hour' (override with `.groups` argument)
ggplotly(g)
This can also be shown in percentage format.
```r
tweets_en %>%
rowid_to_column(\element_id\) # This creates a new column with row number
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\)
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\) %>%
group_by(day = day(created_at)) %>%
summarise(avg_sent = mean(sentiment)) %>%
ggplot(aes(x = as.factor(day), y = avg_sent)) + geom_col()
tweets_en %>%
rowid_to_column(\element_id\) %>%
left_join(tweet_sent, by = \element_id\) %>%
ggplot(aes(x = as.factor(day(created_at)), y = sentiment)) +
geom_jitter(size = 0.2) +
geom_boxplot(aes(color = as.factor(day(created_at))), alpha = 0.5) +
theme(legend.position = \none\) + xlab(\day\)
<!-- rnb-source-end -->
<!-- rnb-chunk-end -->
<!-- rnb-text-begin -->
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuZW1vdGlvbnMgJT4lIFxuICBtdXRhdGUoc2VudGltZW50ID0gaWZfZWxzZShzZW50aW1lbnQgPCBcIm5lZ2F0aXZlXCIsIFwicG9zaXRpdmVcIiwgXCJuZWdhdGl2ZVwiKSkgJT4lIFxuICBncm91cF9ieSh0aW1lLCBzZW50aW1lbnQsIGRheSwgaG91ciwgbWludXRlKSAlPiVcbiAgc3VtbWFyaXNlKGludGVuc2l0eSA9IG4oKSkgJT4lXG4gIG11dGF0ZSh0aW1lID0gbWFrZV9kYXRldGltZSh5ZWFyID0gMjAyMCwgbW9udGggPSAxMCwgZGF5ID0gZGF5LCBob3VyID0gaG91cioyNCwgbWluID0gbWludXRlKjI0KjYwKSkgJT4lXG4gIGdncGxvdChhZXMoeCA9IHRpbWUsIHkgPSBpbnRlbnNpdHksIGZpbGwgPSBzZW50aW1lbnQpKSArIGdlb21fY29sKHBvc2l0aW9uID0gXCJmaWxsXCIpICtcbiAgdGhlbWUoYXhpcy50ZXh0LnggPSBlbGVtZW50X3RleHQoYW5nbGUgPSA4MCwgXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNpemUgPSAxMCxcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgaGp1c3QgPSAxKSkgKyB4bGFiKFwiVGltZVwiKSArXG4gIHNjYWxlX2ZpbGxfbWFudWFsKHZhbHVlcyA9IGMoXCIjMDAxMTQ0XCIsIFwiI0ZGREQ5OVwiKSlcbmBgYCJ9 -->
```r
emotions %>%
mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>%
group_by(time, sentiment, day, hour, minute) %>%
summarise(intensity = n()) %>%
mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
theme(axis.text.x = element_text(angle = 80,
size = 10,
hjust = 1)) + xlab("Time") +
scale_fill_manual(values = c("#001144", "#FFDD99"))
Advanced sentiment
The problem with the preceding methods is that they don’t take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says not happy, counting the word happy is not a good idea! The package sentimentr is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book Supervised Machine Learning for Text Analysis in R hosted at https://smltar.com)
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
First, let’s keep only the tweets written in English!
tweets_en <- tweets %>%
mutate(language = textcat(text)) %>%
filter(language == "english") %>%
dplyr::select(created_at, text)
NOTE: the code above was used to show the function textcat: the language is already coded in the tweets via the lang column/variable. (it suffices to keep the instances for which lang == “en”)
Next, we compute advanced sentiment.
tweet_sent <- tweets_en$text %>%
get_sentences() %>% # Intermediate function
sentiment() # Sentiment!
tweet_sent
NOTE: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant.

---
title: "Third party data and basic text mining"
output:
  html_notebook:
    toc: true
    toc_float: true
---

# The general idea

Data transfer is highly controlled. The key notions are **authentication** and **protocol**.

# Downloading tweets with *rtweet*

There are several packages that run an interface with twitter: *rtweet*, *RTwitterAPI*, *streamR* and *twitteR*.		
Recent packages are better because firms update their API policies (and access), thus old protocols sometimes do not work!

## First things first
**First**, the packages. Download...

```{r, warning = FALSE, message = FALSE}
if(!require(rtweet)){install.packages("rtweet")}
```

... and activate.

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(plotly)
library(rtweet)
```

## Authentication

**Second**: you need your twitter credentials (you need a twitter account).
You also need a **developer account**: https://developer.twitter.com/en/apply-for-access 
Login on twitter and go to: https://developer.twitter.com 

![](twitter1.png)

The next step is crucial: we need to retrieve identification credentials.   
In order to do that, you need to create a Twitter app. Below, you can see mine. 
To create one, simply click on the "Create an app"  button (on the right)

![](twitter2.png)

If you click on the "details" of an app, you can see this:

![](twitter3.png)

The second tab is called "**Keys and tokens**" $\rightarrow$ that's where the info is!!!

![](twitter4.png)


Now we are ready to proceed. The lines below open the connexion with the API.

```{r, warning = FALSE, message = FALSE}
consumer_key <- "your_consumer_key"
consumer_secret <- "you_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"

create_token(app = "the_name_of_your_app",
             consumer_key = consumer_key, 
             consumer_secret = consumer_secret, 
             access_token = access_token, 
             access_secret = access_secret
)
```




Authentication is an important part of the process. For more info on that:  
- https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html   
- https://httr.r-lib.org/reference/index.html (section Authentication)

## Extraction

If no error appears, we are ready to query. Depending on the number of requested tweets, this can take some time.

```{r, message = FALSE, warning = FALSE}
search_term <- "edge computing"
tweets <- search_tweets(
  search_term,          # What to search for
  n = 5000,             # Number of tweets to download
  include_rts = FALSE   # Exclude re-tweets
)
```
For large queries, the progress bar helps.   
Note that many options are available, like: exclude retweets, limit search to particular geographical zones (inside radiuses).

# Text mining

## References
The reference book is: https://www.tidytextmining.com      
A great interactive tutorial: https://juliasilge.shinyapps.io/learntidytext/    
And the package is:

```{r, message = FALSE, warning = FALSE}
if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)
```
(see also: https://quanteda.io/index.html)

## Data retrieval

Now, let's move forward to simple text analysis. First, we need to prepare the data! (as usual)

```{r, warning = FALSE, message = FALSE}
tokens <- tweets %>% 
  mutate(id = 1:nrow(tweets)) %>%  # This creates a tweet id
  select(id, text) %>%             # Keeps only id and text of the tweet
  unnest_tokens(word, text)        # Creates tokens!
tokens
```

Let's have a look at word frequencies.

```{r, warning = FALSE, message = FALSE}
tokens %>%
  count(word, sort = TRUE)
```

This is polluted by small words. Let's filter that (*FIRST METHOD*).

```{r, warning = FALSE, message = FALSE}
tokens %>% mutate(length = nchar(word))
```


## Data frequencies
Now let's omit the small words (smaller than 5 characters).   
**NOTE**: all the thresholds below depend on the sample! 

```{r, warning = FALSE, message = FALSE}
tokens %>%
  mutate(length = nchar(word)) %>%
  filter(length > 4) %>%             # Keep words with length larger than 4
  count(word, sort = TRUE) %>%       # Count words
  head(21) %>%                       # Keep only top 12 words
  ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")
```

A better way to proceed is to remove "stop words" like "a", "I", "of", "the", etc (*SECOND METHOD*).
Also, it would make sense to remove the search item and "https".

```{r, warning = FALSE, message = FALSE}
data("stop_words")
tidy_tokens <- tokens %>% 
  anti_join(stop_words)                    # Remove unrelevant terms
tidy_tokens %>%
  count(word, sort = TRUE) %>%             # Count words
  head(20) %>%                             # Keep only top 15 words
  ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")
```

**Problem**: strange characters remain. We are going to remove them by converting the text to ASCII format and omit *NA* data. 

```{r, warning = FALSE, message = FALSE}
tidy_tokens <- tokens %>% 
  anti_join(stop_words) %>%                            # Remove unrelevant
  mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
  na.omit() %>%                                        # Remove missing
  filter(nchar(word) > 1,                              # Remove small words
         !(word %in% c("https", "t.co", search_term))  # search_term defined above
  )
tidy_tokens %>%
  count(word, sort = TRUE) %>%         # Count words
  head(20) %>%                         # Keep only top words
  ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")
```

Perfect!

## Word cloud

This data can also be shown with a word cloud. We simply use the *wordcloud* package: https://cran.r-project.org/web/packages/wordcloud/index.html 

The package *wordcloud2* adds a few features: https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

```{r, warning = FALSE, message = FALSE}
if(!require(wordcloud)){install.packages("wordcloud")}
library(wordcloud)
cloud_data <- tidy_tokens %>% count(word)
wordcloud(words = cloud_data$word, 
          freq = cloud_data$n, min.freq = 2,
          max.words = 100, random.order = FALSE, rot.per = 0.15, 
          colors = brewer.pal(8, "Dark2"))
```

## n-grams

See https://www.tidytextmining.com/ngrams.html

```{r bigrams, message = F, warning = F}
tweets %>% 
  mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
  select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  group_by(bigram) %>%
  count(sort = T) %>%
  head(20) %>%
  ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col()
```

Again: same issue with stop words! So we must remove them again. But it's more complicated now.
We can use the *separate*() function to help us.

```{r}
tweets %>% 
  mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
  select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
  na.omit() %>%
  separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
  filter(!(word1 %in% c(stop_words$word, "https", search_term)),
         !(word2 %in% c(stop_words$word, "https", search_term)),
         bigram != search_term) %>%
  group_by(bigram) %>%
  count(sort = T) %>%
  head(20) %>%
  ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram")
```


## Sentiment

This section is inspired from: https://www.tidytextmining.com/sentiment.html    
Sometimes, you may be asked in the process if you *really* want to download data (lexicons).  
Just say yes in the **console** (type the correct answer: if not, you will be blocked/struck).

First, we need to load some sentiment lexicon. AFINN is one such sentiment database. 

```{r}
if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
```

To create a nice visualization, we need to extract the **time** of the tweets.

```{r}
tokens_time <- tweets %>% 
  mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
  select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
  unnest_tokens(word, text)          # Creates tokens!
tokens_time
```

We then use **inner_join**() to merge the two sets. This function removes the cases when a match does not occur.

```{r}
library(lubridate)
sentiment <- tokens_time %>% 
  inner_join(afinn) %>%
  mutate(day = day(created_at),
         hour = hour(created_at) / 24,
         minute = minute(created_at) / 60 / 24,
         time = day + hour + minute)
sentiment
```

We then compute the average sentiment, minute-by-minute.   
Of course, average sentiment can be misleading. Indeed, if a text contains the terms "*I'm not happy*", then only "*happy*" will be tagged, which is the opposite of the intended meaning.

```{r}
sentiment %>%
  group_by(time, day, hour, minute) %>%
  summarise(avg_sentiment = mean(value)) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = avg_sentiment)) 
```
There are 24 bars per day, but the *y*-axis is not optimal...  

What about emotions? The **NRC** lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions. 

```{r, message = FALSE, warning = FALSE}
nrc <- get_sentiments("nrc")
nrc <- nrc %>%
  mutate(sentiment = as.factor(sentiment),
         sentiment = recode_factor(sentiment,
                                   joy = "joy",
                                   trust = "trust",
                                   surprise = "surprise",
                                   anticipation = "anticipation",
                                   positive = "positive",
                                   negative = "negative",
                                   sadness = "sadness",
                                   anger = "anger",
                                   fear = "fear",
                                   digust = "disgust",
                                   .ordered = T))
```

We then create the merged dataset.

```{r}
emotions <- tokens_time %>% 
  inner_join(nrc) %>%                  # Merge data with sentiment
  mutate(day = day(created_at),
         hour = hour(created_at)/24,
         minute = minute(created_at)/24/60,
         time = day + hour + minute)   # Create day column
emotions                               # Show the result
```

The merging has reduced the size of the dataset, but there still remains enough to pursue the study.   
Finally, we move to the pivot-table that counts emotions for each day.

```{r}
g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() + 
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_viridis(option = "magma", discrete = T, direction = -1)
ggplotly(g)
```

This can also be shown in percentage format. 

```{r}
g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_viridis(option = "magma", discrete = T, direction = -1)
ggplotly(g)
```

```{r}
emotions %>% 
  mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_manual(values = c("#001144", "#FFDD99"))
```




## Advanced sentiment 

The problem with the preceding methods is that they don't take into account **valence shifters** (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says *not happy*, counting the word *happy* is not a good idea! The package *sentimentr* is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr  
(see also: https://www.sentometrics.org and the book **Supervised Machine Learning for Text Analysis in R** hosted at https://smltar.com)

```{r, warning = FALSE, message = FALSE}
if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)
```

First, let's keep only the tweets written in English!

```{r}
tweets_en <- tweets %>%
  mutate(language = textcat(text)) %>%
  filter(language == "english") %>%
  dplyr::select(created_at, text)
```

**NOTE**: the code above was used to show the function *textcat*: the language is already coded in the tweets via the **lang** column/variable. (it suffices to keep the instances for which lang == "en")

Next, we compute advanced sentiment. 

```{r}
tweet_sent <- tweets_en$text %>%
  get_sentences() %>%  # Intermediate function
  sentiment()          # Sentiment!
tweet_sent
```

**NOTE**: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant. 

```{r}
tweets_en %>%
  rowid_to_column("element_id") # This creates a new column with row number

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id")

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id") %>%
  group_by(day = day(created_at)) %>%
  summarise(avg_sent = mean(sentiment)) %>%
  ggplot(aes(x = as.factor(day), y = avg_sent)) + geom_col() + xlab("day")

tweets_en %>%
  rowid_to_column("element_id") %>%
  left_join(tweet_sent, by = "element_id") %>%
  ggplot(aes(x = as.factor(day(created_at)), y = sentiment)) + 
  geom_jitter(size = 0.2) +
  geom_boxplot(aes(color = as.factor(day(created_at))), alpha = 0.5) +
  theme(legend.position = "none") + xlab("day")
```



# Resources

Below, a short list of resources (to access third-party data):   

- **text mining with R** (online book): https://www.tidytextmining.com      
- **Bloomberg**: https://cran.r-project.org/web/packages/Rblpapi/index.html   
- **gmail**: https://cran.r-project.org/web/packages/gmailr/vignettes/gmailr.html   
- **Google Maps**: https://cran.rstudio.com/web/packages/mapsapi/vignettes/intro.html  
- **Google trends**: https://github.com/PMassicotte/gtrendsR
- **Google APIs** (more generally): https://cran.r-project.org/web/packages/gargle/vignettes/auth-from-web.html
- **Facebook API**: developers.facebook.com/ads/blog/post/v2/2018/05/15/facebook-reach-frequency-api/  

Possibly deprecated:  
- **Facebook**: https://cran.r-project.org/web/packages/Rfacebook/index.html    
- **Instagram**: https://cran.r-project.org/web/packages/instaR/index.html

```{r}

```
