16 Nov

Tidytext Tutorials

Recently, we ran our workshop on tidytext. This is one of the most popular basic text-as-data packages available in R and is a great introductory tool for analyzing English text computationally.

One of the benefits of tidytext is that it is easy to use with other packages. Want to analyze Twitter content? Scrape it with rtweets, and then analyze the tweets using tidytext. Interested in analyzing books or news articles instead? You can upload existing corpora in R, or import it yourself.

Another benefit is that it is a well-supported package. Not only is there an amazing textbook by Julia Siege and David Robinson (which you can find here), but many people have produced guides and tutorials for using tidytext with a variety of text data (books, song lyrics, blog posts, tweets, and news articles, to name a few).

Below is a list of tidytext tutorials available online, made by dozens of R programmers and teachers who are much more experienced than I am. The content analyzed ranges significantly, and I encourage you to check out the ones that match your interest. I’ve organized them into three categories: easy, moderate, and advanced.

Easy

These tutorials guide you through using tidytext. They explain each processing step individually, from unnesting tokens to using sentiment dictionaries.
Tidytext and Jane Austen (from Siege & Robinson’s textbook):[link]
Tidytext and Harry Potter: [link]
Tidytext and Manifesto corpus: [link] (Learn about the Manifesto Corpus here)
Tidytext and Wickham’s Twitter: [link] (also a great introduction for those who want to learn Mike Kearney’s excellent rtweet package)

 

Moderate

These tutorials are more focused on interpreting tidytext results. Rather than explaining each step, most code is presented in chunks (often with many pipes). There is also more extensive analysis here of the content being processed.
Tidytext and Harry Potter: [link]
Tidytext and Lord of the Rings: [link]
Tidytext and Twitter: [link]
Tidytext and Wickham’s blog[link] (also includes some statistics)
Tidytext and the Weinstein Effect (news): [link]
Tidytext and Hacker News headlines (news): [link]

 

Advanced

These tutorials combine tidytext with other packages, such as rtweet or httr. They show how tidytext can be used as part of a broader library repertoire. Some tutorials also require the use of regular expressions or other coding mechanics. 
Tidytext and tweets via TAGS (moderate/advanced): [link] (Includes using httr)
Tidytext and Twitter (advanced): [link] (uses regular expressions, processes json files).

 

R developer Emil Hvitfeldt has also compiled a collection of available text data in his github, which you can find here. This collection includes movie reviews (via text2vec), French corpora (via proustr), and fairy tales in multiple languages (via hcandersenr). It is a true treasure trove for aspiring text analysts.

Hopefully, you will find some of these tutorials useful. Please let me know if you find others! Happy coding!

Jo