rtweet-workshop

class: center, middle, inverse, title-slide

# rtweet-workshop
## Collecting and analyzing Twitter data
### Michael W. Kearney School of Journalism Informatics Institute University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> @kearneymw </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> @mkearney </a>
</th>
</tr>
</table>

---

background-image: url(img/logo.png)
background-size: 350px auto
background-position: 50% 20%
class: center, bottom

View these slides at [rtweet-workshop.mikewk.com](https://rtweet-workshop.mikewk.com)

View companion script at [rtweet-workshop.mikewk.com/script.Rmd](https://rtweet-workshop.mikewk.com/script.Rmd)

View the Github repo at [github.com/mkearney/rtweet-workshop](https://github.com/mkearney/rtweet-workshop)

---
class: tight

## About {rtweet}

- On Comprehensive R Archive Network (CRAN)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)[![CRAN status](https://www.r-pkg.org/badges/version/rtweet)](https://cran.r-project.org/package=rtweet)

- Growing base of users

![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/rtweet)[![Depsy](http://depsy.org/api/package/cran/rtweet/badge.svg)](http://depsy.org/package/r/rtweet)

- Fairly stable

[![Build Status](https://travis-ci.org/mkearney/rtweet.svg?branch=master)](https://travis-ci.org/mkearney/rtweet)[![Coverage Status](https://codecov.io/gh/mkearney/rtweet/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/rtweet?branch=master)[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)

- Package website: [rtweet.info](http://rtweet.info)

[![Generic badge](https://img.shields.io/badge/website-up-green.svg)](http://rtweet.info/)

- Github repo: [mkearney/rtweet](https://github.com/mkearney/rtweet)

[![Generic badge](https://img.shields.io/badge/stars-352-blue.svg)](https://github.com/mkearney/rtweet/)[![Generic badge](https://img.shields.io/badge/forks-81-blue.svg)](https://github.com/mkearney/rtweet/)

---

## Roadmap
1. Search tweets
1. Timelines
1. Favorites
1. Lookup tweets
1. Friends/followers
1. Lookup users
1. Lists
1. Stream

---

## Install

- Install **{rtweet}** from [CRAN](https://cran.r-project/package=rtweet).

```r
install.packages("rtweet")
```

- Or install the **development version** from [Github](https://github.com/mkearney/rtweet).

```r
#devtools::install_github("mkearney/rtweet")
```

- Load **{rtweet}**

```r
library(rtweet)
```

---

## httpuv

To authorize rtweet's embedded **rstats2twitter** app via web browser, the **{httpuv}** pakage is required

```r
## install httpuv for browser-based authentication
install.packages("httpuv")
```

---
class: inverse, center, middle

# Accessing web APIs

---

## Some background

**Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users.

+ APIs are similar to web browsers but with different purpose:
   - Web browsers render **visual content**
   - Web APIs manage and organize **data**
+ For public APIs, many sites only allow **authorized** users
   - Twitter, Facebook, Instagram, Github, etc.

---

## developer.twitter.com

To create your own token (with write and DM-read access), users must...

1. Apply and get approved for a developer account with Twitter
1. Create a Twitter app (fill out a form)
   - For step-by-step instructions on how to create a Twitter app and corresponding token, see 
   [rtweet.info/articles/auth.html](https://rtweet.info/articles/auth.html)

---
class: inverse, center, middle

# Twitter Data!

---
class: inverse, center, middle

# 1. Searching for tweets

---

## `search_tweets()`

Search for one or more keyword(s)

```r
rds <- search_tweets("rstats data science")
rds
```

> *Note*: implicit `AND` between words

---

## `search_tweets()`

Search for exact phrase

```r
## single quotes around doubles
ds <- search_tweets('"data science"')

## or escape the quotes
ds <- search_tweets("\"data science\"")
ds
```

---

## `search_tweets()`

Search for keyword(s) **and** phrase

```r
rpds <- search_tweets("rstats python \"data science\"")
rpds
```

---

## `search_tweets()`

+ `search_tweets()` returns 100 most recent matching tweets by default

+ Increase `n` to return more (tip: use intervals of 100)

```r
rstats <- search_tweets("rstats", n = 10000)
rstats
```

> Rate limit of 18,000 per fifteen minutes

---

## `search_tweets()`

**PRO TIP #1**: Get the firehose for free by searching for tweets by
verified **or** non-verified tweets

```r
fff <- search_tweets(
 "filter:verified OR -filter:verified", n = 18000)
fff
```

Visualize second-by-second frequency

```r
ts_plot(fff, "secs")
```

---

---

## `search_tweets()`

**PRO TIP #2**: Use search operators provided by Twitter, e.g.,

+ filter by language and exclude retweets and replies

```r
rt <- search_tweets("rstats", lang = "en", 
 include_rts = FALSE, `-filter` = "replies")
```

+ filter only tweets linking to news articles

```r
nws <- search_tweets("filter:news")
```

---

## `search_tweets()`

+ filter only tweets that contain links

```r
links <- search_tweets("filter:links")
links
```

+ filter only tweets that contain video

```r
vids <- search_tweets("filter:video")
vids
```

---

## `search_tweets()`

+ filter only tweets sent `from:{screen_name}` or `to:{screen_name}` certain users

```r
## vector of screen names
users <- c("cnnbrk", "AP", "nytimes", 
 "foxnews", "msnbc", "seanhannity", "maddow")
paste0("from:", users, collapse = " OR ")
#> "from:cnnbrk OR from:AP OR from:nytimes OR from:foxnews OR 
#> from:msnbc OR from:seanhannity OR from:maddow"

tousers <- search_tweets(paste0("from:", users, collapse = " OR "))
tousers
```

---

## `search_tweets()`

+ filter only tweets with at least 100 favorites or 100 retweets

```r
pop <- search_tweets(
 "(filter:verified OR -filter:verified) (min_faves:100 OR min_retweets:100)")
```

+ filter by the type of device that posted the tweet.

```r
rt <- search_tweets("lang:en", source = '"Twitter for iPhone"')
```

---

## `search_tweets()`

**PRO TIP #3**: Search by geolocation (ex: tweets within 25 miles of Columbia, MO)

```r
como <- search_tweets(
 geocode = "38.9517,-92.3341,25mi", n = 10000
)
como
```

---

## `search_tweets()`

Use `lat_lng()` to convert geographical data into `lat` and `lng` variables.

```r
como <- lat_lng(como)
par(mar = c(0, 0, 0, 0))
maps::map("state", fill = TRUE, col = "#ffffff", 
 lwd = .25, mar = c(0, 0, 0, 0), 
 xlim = c(-96, -89), y = c(35, 41))
with(como, points(lng, lat, pch = 20, col = "red"))
```

> This code plots geotagged tweets on a map of Missouri

---

---

## `search_tweets()`

**PRO TIP #4**: (for developer accounts only) Use `bearer_token()` to increase rate limit to 45,000 per
fifteen minutes.

```r
mosen <- search_tweets(
 "mccaskill OR hawley", 
 n = 45000, 
 token = bearer_token()
)
```

---
class: inverse, center, middle

# 2. User timelines

---

## `get_timeline()`

Get the most recent tweets posted by a user.

```r
cnn <- get_timeline("cnn")
```

---

## `get_timeline()`

Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users.

```r
nws <- get_timeline(c("cnn", "foxnews", "msnbc"), n = 3200)
```

---

## `ts_plot()`

Group by `screen_name` and plot hourly frequencies of tweets.

```r
nws %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("hours")
```

---
class: inverse, center, middle

# 3. User favorites

---

## `get_favorites()`

Get up to the most recent 3,000 tweets favorited by a user.

```r
kmw_favs <- get_favorites("kearneymw", n = 3000)
```

---
class: inverse, center, middle

# 4. Lookup statuses

---

## `lookup_tweets()`

```r
## `lookup_tweets()`
status_ids <- c("947235015343202304", "947592785519173637",
 "948359545767841792", "832945737625387008")
twt <- lookup_tweets(status_ids)
```

---
class: inverse, center, middle

# 5. Getting friends/followers

---

## Friends/followers

Twitter's API documentation distinguishes between **friends** and **followers**.

+ **Friend** refers to an account a given user follows
+ **Follower** refers to an account following a given user

---

## `get_friends()`

Get user IDs of accounts **followed by** (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter.

```r
fds <- get_friends("jack")
fds
```

---

## `get_friends()`

Get friends of **multiple** users in a single call.

```r
fds <- get_friends(
 c("hadleywickham", "NateSilver538", "Nate_Cohn")
)
fds
```

---

## `get_followers()`

Get user IDs of accounts **following** (AKA followers) [@mizzou](https://twitter.com/mizzou).

```r
mu <- get_followers("mizzou")
mu
```

---

## `get_followers()`

Unlike friends (limited by Twitter to 5,000), there is **no limit** on the number of followers.

To get user IDs of all 55(ish) million followers of @realDonaldTrump, you need two things:

1. A stable **internet** connection 
1. **Time** – approximately five and a half days

---

## `get_followers()`

Get all of Donald Trump's followers.

```r
## get all of trump's followers
rdt <- get_followers(
 "realdonaldtrump", 
 n = 56000000, 
 retryonratelimit = TRUE
)
```

---
class: inverse, center, middle

# 6. Lookup users

---

## `lookup_users()`

Lookup users-level (and most recent tweet) associated with vector of `user_id` or `screen_name`.

```r
## vector of users
users <- c("hadleywickham", "NateSilver538", "Nate_Cohn")

## lookup users twitter data
usr <- lookup_users(users)
usr
```

---

## `search_users()`

It's also possible to search for users. Twitter will look for matches in user names, screen names, and profile bios.

```r
## search for breaking news accounts
bkn <- search_users("breaking news")
bkn
```

---
class: inverse, center, middle

# 7. Lists

---

## `lists_memberships()`

+ Get an account's list memberships (lists that include an account)

```r
## lists that include Nate Silver
nsl <- lists_memberships("NateSilver538")
nsl
```

---

## `lists_members()`

+ Get all list members (accounts on a list)

```r
## all members of congress
cng <- lists_members(owner_user = "cspan", slug = "members-of-congress")
cng
```

---
class: inverse, center, middle

# 8. Streaming tweets

---

## `stream_tweets()`

**Sampling**: small random sample (`~ 1%`) of all publicly available tweets

```r
ss <- stream_tweets("")
```

**Filtering**: search-like query (up to 400 keywords)

```r
sf <- stream_tweets("mueller,fbi,investigation,trump,realdonaldtrump")
```

---

## `stream_tweets()`

**Tracking**: vector of user ids (up to 5000 user_ids)

```r
## user IDs from congress members (lists_members ex output)
st <- stream_tweets(cng$user_id)
```

**Location**: geographical coordinates (1-360 degree location boxes)

```r
## world-wide bounding box
sl <- stream_tweets(c(-180, -90, 180, 90))
```

---

## `stream_tweets()`

The default duration for streams is thirty seconds `timeout = 30`

+ Specify specific stream duration in seconds

```r
## stream for 10 minutes
stm <- stream_tweets(timeout = 60 * 10)
```

---

## `stream_tweets()`

Stream JSON data directly to a text file

```r
stream_tweets(timeout = 60 * 10, 
  file_name = "random-stream-2018-11-13.json",
  parse = FALSE)
```

Read-in a streamed JSON file

```r
rj <- parse_stream("random-stream-2018-11-13.json")
```

---

## `stream_tweets()`

Stream tweets indefinitely.

```r
stream_tweets(timeout = Inf, 
  file_name = "random-stream-2018-11-13.json",
  parse = FALSE)
```

---

## `lookup_coords()`

A useful convenience function–though it now requires an API key–for quickly looking up coordinates

```r
## stream tweets sent from london
luk1 <- stream_tweets(q = lookup_coords("London, UK"), timeout = 60)

## search tweets sent from london
luk2 <- search_tweets(
 geocode = lookup_coords("London, UK"), n = 1000)
```

---
class: inverse, center, middle

# Analyzing Twitter data

---

## Data set

For these examples, let's gather a data set of iPhone and Android users

```r
iphone_android <- search_tweets(
 paste0('(filter:verified OR -filter:verified) AND \n', 
 '(source:"Twitter for iPhone" OR source:"Twitter for Android")'),
 include_rts = FALSE,
 n = 18000
)

## view breakdown of tweet source (device)
table(iphone_android$source)
```

---

## Text processing

Tokenize tweets [into words]

```r
## tokenize each tweet into words vecotr
wds <- tokenizers::tokenize_tweets(iphone_android$text)

## collapse back into stirngs
txt <- purrr::map_chr(wds, paste, collapse = " ")

## get sentiment using afinn dictionary
iphone_android$sent <- syuzhet::get_sentiment(
 iphone_android$text, method = "afinn"
)
```

---

## Compare groups

Group by source and summarize some numeric variables

```r
iphone_android %>%
  group_by(source) %>%
  summarise(sent = mean(sent, na.rm = TRUE), 
    avg_rt = mean(retweet_count, na.rm = TRUE),
    avg_fav = mean(favorites_count, na.rm = TRUE),
    tweets = mean(statuses_count, na.rm = TRUE),
    friends = mean(retweet_count, na.rm = TRUE),
    followers = mean(retweet_count, na.rm = TRUE),
    ff_rat = (friends + 1) / (friends + followers + 1)
  )
```

---

## Features

Easily automate feature extraction for Twitter data.

```r
## install package
remotes::install_github("mkearney/textfeatures")

## feature extraction
tf <- textfeatures::textfeatures(iphone_android)

## add dependent variable
tf$y <- tweet_source_data$source == "Twitter for iPhone"
```

---

## Machine learning

Run a boosted model

```r
## load gbm and estimate model
library(gbm)
m1 <- gbm(y ~ ., data = tf[1:15000, -1], n.trees = 200)
#summary(m1)

## generate predictions
p <- predict(m1, newdata = tf[15001:nrow(tf), -1],
 type = "response", n.trees = 200)

## how'd we do?
table(p > .50, tf$y[15001:nrow(tf)])
```

---

## Tweetbotornot

A package designed to estimate the probability of an account being a bot.

```r
## install from Github
remotes::install_github("mkearney/tweetbotornot")

## estimate some accounts
bp <- tweetbotornot::tweetbotornot(c(
 "kearneymw", 
 "realdonaldtrump",
 "netflix_bot", 
 "tidyversetweets",
 "thebotlebowski")
)
bp
```