class: center, middle, inverse, title-slide # rtweet-workshop ## Collecting and analyzing Twitter data ### Michael W. Kearney
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- background-image: url(img/logo.png) background-size: 350px auto background-position: 50% 20% class: center, bottom View these slides at [rtweet-workshop.mikewk.com](https://rtweet-workshop.mikewk.com) View companion script at [rtweet-workshop.mikewk.com/script.Rmd](https://rtweet-workshop.mikewk.com/script.Rmd) View the Github repo at [github.com/mkearney/rtweet-workshop](https://github.com/mkearney/rtweet-workshop) --- class: tight ## About {rtweet} - On Comprehensive R Archive Network (CRAN) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)[![CRAN status](https://www.r-pkg.org/badges/version/rtweet)](https://cran.r-project.org/package=rtweet) - Growing base of users ![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/rtweet)[![Depsy](http://depsy.org/api/package/cran/rtweet/badge.svg)](http://depsy.org/package/r/rtweet) - Fairly stable [![Build Status](https://travis-ci.org/mkearney/rtweet.svg?branch=master)](https://travis-ci.org/mkearney/rtweet)[![Coverage Status](https://codecov.io/gh/mkearney/rtweet/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/rtweet?branch=master)[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing) - Package website: [rtweet.info](http://rtweet.info) [![Generic badge](https://img.shields.io/badge/website-up-green.svg)](http://rtweet.info/) - Github repo: [mkearney/rtweet](https://github.com/mkearney/rtweet) [![Generic badge](https://img.shields.io/badge/stars-352-blue.svg)](https://github.com/mkearney/rtweet/)[![Generic badge](https://img.shields.io/badge/forks-81-blue.svg)](https://github.com/mkearney/rtweet/) --- ## Roadmap 1. Search tweets 1. Timelines 1. Favorites 1. Lookup tweets 1. Friends/followers 1. Lookup users 1. Lists 1. Stream --- ## Install - Install **{rtweet}** from [CRAN](https://cran.r-project/package=rtweet). ```r install.packages("rtweet") ``` - Or install the **development version** from [Github](https://github.com/mkearney/rtweet). ```r #devtools::install_github("mkearney/rtweet") ``` - Load **{rtweet}** ```r library(rtweet) ``` --- ## httpuv To authorize rtweet's embedded **rstats2twitter** app via web browser, the **{httpuv}** pakage is required ```r ## install httpuv for browser-based authentication install.packages("httpuv") ``` --- class: inverse, center, middle # Accessing web APIs --- ## Some background **Application Program Interfaces** (APIs) are sets of protocols that govern interactions between sites and users. + APIs are similar to web browsers but with different purpose: - Web browsers render **visual content** - Web APIs manage and organize **data** + For public APIs, many sites only allow **authorized** users - Twitter, Facebook, Instagram, Github, etc. --- ## developer.twitter.com To create your own token (with write and DM-read access), users must... 1. Apply and get approved for a developer account with Twitter 1. Create a Twitter app (fill out a form) - For step-by-step instructions on how to create a Twitter app and corresponding token, see [rtweet.info/articles/auth.html](https://rtweet.info/articles/auth.html) --- class: inverse, center, middle # Twitter Data! --- class: inverse, center, middle # 1. <br /> Searching for tweets --- ## `search_tweets()` Search for one or more keyword(s) ```r rds <- search_tweets("rstats data science") rds ``` <br> > *Note*: implicit `AND` between words --- ## `search_tweets()` Search for exact phrase ```r ## single quotes around doubles ds <- search_tweets('"data science"') ## or escape the quotes ds <- search_tweets("\"data science\"") ds ``` --- ## `search_tweets()` Search for keyword(s) **and** phrase ```r rpds <- search_tweets("rstats python \"data science\"") rpds ``` --- ## `search_tweets()` + `search_tweets()` returns 100 most recent matching tweets by default + Increase `n` to return more (tip: use intervals of 100) ```r rstats <- search_tweets("rstats", n = 10000) rstats ``` <br> > Rate limit of 18,000 per fifteen minutes --- ## `search_tweets()` **PRO TIP #1**: Get the firehose for free by searching for tweets by verified **or** non-verified tweets ```r fff <- search_tweets( "filter:verified OR -filter:verified", n = 18000) fff ``` Visualize second-by-second frequency ```r ts_plot(fff, "secs") ``` --- <p style="text-align:center"> <img src="img/fff.png" /> </p> --- ## `search_tweets()` **PRO TIP #2**: Use search operators provided by Twitter, e.g., + filter by language and exclude retweets and replies ```r rt <- search_tweets("rstats", lang = "en", include_rts = FALSE, `-filter` = "replies") ``` + filter only tweets linking to news articles ```r nws <- search_tweets("filter:news") ``` --- ## `search_tweets()` + filter only tweets that contain links ```r links <- search_tweets("filter:links") links ``` + filter only tweets that contain video ```r vids <- search_tweets("filter:video") vids ``` --- ## `search_tweets()` + filter only tweets sent `from:{screen_name}` or `to:{screen_name}` certain users ```r ## vector of screen names users <- c("cnnbrk", "AP", "nytimes", "foxnews", "msnbc", "seanhannity", "maddow") paste0("from:", users, collapse = " OR ") #> "from:cnnbrk OR from:AP OR from:nytimes OR from:foxnews OR #> from:msnbc OR from:seanhannity OR from:maddow" tousers <- search_tweets(paste0("from:", users, collapse = " OR ")) tousers ``` --- ## `search_tweets()` + filter only tweets with at least 100 favorites or 100 retweets ```r pop <- search_tweets( "(filter:verified OR -filter:verified) (min_faves:100 OR min_retweets:100)") ``` + filter by the type of device that posted the tweet. ```r rt <- search_tweets("lang:en", source = '"Twitter for iPhone"') ``` --- ## `search_tweets()` **PRO TIP #3**: Search by geolocation (ex: tweets within 25 miles of Columbia, MO) ```r como <- search_tweets( geocode = "38.9517,-92.3341,25mi", n = 10000 ) como ``` --- ## `search_tweets()` Use `lat_lng()` to convert geographical data into `lat` and `lng` variables. ```r como <- lat_lng(como) par(mar = c(0, 0, 0, 0)) maps::map("state", fill = TRUE, col = "#ffffff", lwd = .25, mar = c(0, 0, 0, 0), xlim = c(-96, -89), y = c(35, 41)) with(como, points(lng, lat, pch = 20, col = "red")) ``` <br> > This code plots geotagged tweets on a map of Missouri --- <p style="text-align:center"> <img src="img/como.png" /> </p> --- ## `search_tweets()` **PRO TIP #4**: (for developer accounts only) Use `bearer_token()` to increase rate limit to 45,000 per fifteen minutes. ```r mosen <- search_tweets( "mccaskill OR hawley", n = 45000, token = bearer_token() ) ``` --- class: inverse, center, middle # 2. <br /> User timelines --- ## `get_timeline()` Get the most recent tweets posted by a user. ```r cnn <- get_timeline("cnn") ``` --- ## `get_timeline()` Get up to the most recent 3,200 tweets (endpoint max) posted by multiple users. ```r nws <- get_timeline(c("cnn", "foxnews", "msnbc"), n = 3200) ``` --- ## `ts_plot()` Group by `screen_name` and plot hourly frequencies of tweets. ```r nws %>% dplyr::group_by(screen_name) %>% ts_plot("hours") ``` --- class: inverse, center, middle # 3. <br /> User favorites --- ## `get_favorites()` Get up to the most recent 3,000 tweets favorited by a user. ```r kmw_favs <- get_favorites("kearneymw", n = 3000) ``` --- class: inverse, center, middle # 4. <br /> Lookup statuses --- ## `lookup_tweets()` ```r ## `lookup_tweets()` status_ids <- c("947235015343202304", "947592785519173637", "948359545767841792", "832945737625387008") twt <- lookup_tweets(status_ids) ``` --- class: inverse, center, middle # 5. <br /> Getting friends/followers --- ## Friends/followers Twitter's API documentation distinguishes between **friends** and **followers**. + **Friend** refers to an account a given user follows + **Follower** refers to an account following a given user --- ## `get_friends()` Get user IDs of accounts **followed by** (AKA friends) [@jack](https://twitter.com/jack), the co-founder and CEO of Twitter. ```r fds <- get_friends("jack") fds ``` --- ## `get_friends()` Get friends of **multiple** users in a single call. ```r fds <- get_friends( c("hadleywickham", "NateSilver538", "Nate_Cohn") ) fds ``` --- ## `get_followers()` Get user IDs of accounts **following** (AKA followers) [@mizzou](https://twitter.com/mizzou). ```r mu <- get_followers("mizzou") mu ``` --- ## `get_followers()` Unlike friends (limited by Twitter to 5,000), there is **no limit** on the number of followers. To get user IDs of all 55(ish) million followers of @realDonaldTrump, you need two things: 1. A stable **internet** connection 1. **Time** – approximately five and a half days --- ## `get_followers()` Get all of Donald Trump's followers. ```r ## get all of trump's followers rdt <- get_followers( "realdonaldtrump", n = 56000000, retryonratelimit = TRUE ) ``` --- class: inverse, center, middle # 6. <br /> Lookup users --- ## `lookup_users()` Lookup users-level (and most recent tweet) associated with vector of `user_id` or `screen_name`. ```r ## vector of users users <- c("hadleywickham", "NateSilver538", "Nate_Cohn") ## lookup users twitter data usr <- lookup_users(users) usr ``` --- ## `search_users()` It's also possible to search for users. Twitter will look for matches in user names, screen names, and profile bios. ```r ## search for breaking news accounts bkn <- search_users("breaking news") bkn ``` --- class: inverse, center, middle # 7. <br /> Lists --- ## `lists_memberships()` + Get an account's list memberships (lists that include an account) ```r ## lists that include Nate Silver nsl <- lists_memberships("NateSilver538") nsl ``` --- ## `lists_members()` + Get all list members (accounts on a list) ```r ## all members of congress cng <- lists_members(owner_user = "cspan", slug = "members-of-congress") cng ``` --- class: inverse, center, middle # 8. <br /> Streaming tweets --- ## `stream_tweets()` **Sampling**: small random sample (`~ 1%`) of all publicly available tweets ```r ss <- stream_tweets("") ``` **Filtering**: search-like query (up to 400 keywords) ```r sf <- stream_tweets("mueller,fbi,investigation,trump,realdonaldtrump") ``` --- ## `stream_tweets()` **Tracking**: vector of user ids (up to 5000 user_ids) ```r ## user IDs from congress members (lists_members ex output) st <- stream_tweets(cng$user_id) ``` **Location**: geographical coordinates (1-360 degree location boxes) ```r ## world-wide bounding box sl <- stream_tweets(c(-180, -90, 180, 90)) ``` --- ## `stream_tweets()` The default duration for streams is thirty seconds `timeout = 30` + Specify specific stream duration in seconds ```r ## stream for 10 minutes stm <- stream_tweets(timeout = 60 * 10) ``` --- ## `stream_tweets()` Stream JSON data directly to a text file ```r stream_tweets(timeout = 60 * 10, file_name = "random-stream-2018-11-13.json", parse = FALSE) ``` Read-in a streamed JSON file ```r rj <- parse_stream("random-stream-2018-11-13.json") ``` --- ## `stream_tweets()` Stream tweets indefinitely. ```r stream_tweets(timeout = Inf, file_name = "random-stream-2018-11-13.json", parse = FALSE) ``` --- ## `lookup_coords()` A useful convenience function–though it now requires an API key–for quickly looking up coordinates ```r ## stream tweets sent from london luk1 <- stream_tweets(q = lookup_coords("London, UK"), timeout = 60) ## search tweets sent from london luk2 <- search_tweets( geocode = lookup_coords("London, UK"), n = 1000) ``` --- class: inverse, center, middle # Analyzing Twitter data --- ## Data set For these examples, let's gather a data set of iPhone and Android users ```r iphone_android <- search_tweets( paste0('(filter:verified OR -filter:verified) AND \n', '(source:"Twitter for iPhone" OR source:"Twitter for Android")'), include_rts = FALSE, n = 18000 ) ## view breakdown of tweet source (device) table(iphone_android$source) ``` --- ## Text processing Tokenize tweets [into words] ```r ## tokenize each tweet into words vecotr wds <- tokenizers::tokenize_tweets(iphone_android$text) ## collapse back into stirngs txt <- purrr::map_chr(wds, paste, collapse = " ") ## get sentiment using afinn dictionary iphone_android$sent <- syuzhet::get_sentiment( iphone_android$text, method = "afinn" ) ``` --- ## Compare groups Group by source and summarize some numeric variables ```r iphone_android %>% group_by(source) %>% summarise(sent = mean(sent, na.rm = TRUE), avg_rt = mean(retweet_count, na.rm = TRUE), avg_fav = mean(favorites_count, na.rm = TRUE), tweets = mean(statuses_count, na.rm = TRUE), friends = mean(retweet_count, na.rm = TRUE), followers = mean(retweet_count, na.rm = TRUE), ff_rat = (friends + 1) / (friends + followers + 1) ) ``` --- ## Features Easily automate feature extraction for Twitter data. ```r ## install package remotes::install_github("mkearney/textfeatures") ## feature extraction tf <- textfeatures::textfeatures(iphone_android) ## add dependent variable tf$y <- tweet_source_data$source == "Twitter for iPhone" ``` --- ## Machine learning Run a boosted model ```r ## load gbm and estimate model library(gbm) m1 <- gbm(y ~ ., data = tf[1:15000, -1], n.trees = 200) #summary(m1) ## generate predictions p <- predict(m1, newdata = tf[15001:nrow(tf), -1], type = "response", n.trees = 200) ## how'd we do? table(p > .50, tf$y[15001:nrow(tf)]) ``` --- ## Tweetbotornot A package designed to estimate the probability of an account being a bot. ```r ## install from Github remotes::install_github("mkearney/tweetbotornot") ## estimate some accounts bp <- tweetbotornot::tweetbotornot(c( "kearneymw", "realdonaldtrump", "netflix_bot", "tidyversetweets", "thebotlebowski") ) bp ```