16 Nov

Tidytext Tutorials

Recently, we ran our workshop on tidytext. This is one of the most popular basic text-as-data packages available in R and is a great introductory tool for analyzing English text computationally.

One of the benefits of tidytext is that it is easy to use with other packages. Want to analyze Twitter content? Scrape it with rtweets, and then analyze the tweets using tidytext. Interested in analyzing books or news articles instead? You can upload existing corpora in R, or import it yourself.

Another benefit is that it is a well-supported package. Not only is there an amazing textbook by Julia Siege and David Robinson (which you can find here), but many people have produced guides and tutorials for using tidytext with a variety of text data (books, song lyrics, blog posts, tweets, and news articles, to name a few).

Below is a list of tidytext tutorials available online, made by dozens of R programmers and teachers who are much more experienced than I am. The content analyzed ranges significantly, and I encourage you to check out the ones that match your interest. I’ve organized them into three categories: easy, moderate, and advanced.

Easy

These tutorials guide you through using tidytext. They explain each processing step individually, from unnesting tokens to using sentiment dictionaries.
Tidytext and Jane Austen (from Siege & Robinson’s textbook):[link]
Tidytext and Harry Potter: [link]
Tidytext and Manifesto corpus: [link] (Learn about the Manifesto Corpus here)
Tidytext and Wickham’s Twitter: [link] (also a great introduction for those who want to learn Mike Kearney’s excellent rtweet package)

 

Moderate

These tutorials are more focused on interpreting tidytext results. Rather than explaining each step, most code is presented in chunks (often with many pipes). There is also more extensive analysis here of the content being processed.
Tidytext and Harry Potter: [link]
Tidytext and Lord of the Rings: [link]
Tidytext and Twitter: [link]
Tidytext and Wickham’s blog[link] (also includes some statistics)
Tidytext and the Weinstein Effect (news): [link]
Tidytext and Hacker News headlines (news): [link]

 

Advanced

These tutorials combine tidytext with other packages, such as rtweet or httr. They show how tidytext can be used as part of a broader library repertoire. Some tutorials also require the use of regular expressions or other coding mechanics. 
Tidytext and tweets via TAGS (moderate/advanced): [link] (Includes using httr)
Tidytext and Twitter (advanced): [link] (uses regular expressions, processes json files).

 

R developer Emil Hvitfeldt has also compiled a collection of available text data in his github, which you can find here. This collection includes movie reviews (via text2vec), French corpora (via proustr), and fairy tales in multiple languages (via hcandersenr). It is a true treasure trove for aspiring text analysts.

Hopefully, you will find some of these tutorials useful. Please let me know if you find others! Happy coding!

Jo

 

15 Nov

Updated Computational Methods Workshop List

Date Topic Tools
Sept 14 Setting up R, data wrangling basics dplyr (R)
Sept 21 Data collection – MediaCloud MediaCloud archive and Python
Sept 28 No meeting
Oct 5 Data collection – scraping rvest(R)
Oct 12 No meeting
Oct 19 Data collection – Working with Twitter Data rtweet (R)
Oct 26 Text Analysis – tidytext tidytext (R)
Nov 2 Text Analysis – Programs with GUIs (Graphic User Interfaces) LIWC, Diction, and Leximancer
Nov 9 CANCELED
Nov 16 CANCELED
Nov 23 THANKSGIVING
Nov 30 Reporting Results – Data visualization ggplot2 (R)
Dec 7 rmarkdown rmarkdown (R)
23 Sep

Week 2 Follow Up

Hello computational fans!

Thanks to those who attended the MediaCloud workshop! Zhongkai and I were thrilled with the attendance and eagerness to learn.

Zhongkai’s Python code and my powerpoint are available in our Box folder: https://uwmadison.box.com/v/computational1819

For those who are not familiar with Python, I would recommend DataCamp’s newly developed course, Analyzing Social Media Data in Python. The instructor, Alex, is an alumna of our Computational Methods Research Group, and the Social Media and Democracy research group. Though only the first chapter is free, subscriptions are relatively cheap (and DataCamp is a great resource for people hoping to teach themselves data science skills independently).

Next week, we will not have a Computational Methods workshop. Our department will be hosting Former Dean of UT-Austin’s Moody College of Communication Rod Hart. A pioneer of computational methods in mass communication, Hart developed a dictionary-based text analysis program called DICTION, which measures various rhetorical and semantic features such as certainty, optimism, and commonality. We will actually be teaching DICTION at our Nov 2 workshop, so if you are interested in learning more, I encourage you to attend his talk .

The week after (Oct 5), we will be going over the R scraping package, rvest. For the workshop, we will be using the data we collected from MediaCloud, so please make sure you save any results you want to use.

Jo

15 Sep

Week 1 Workshop Follow Up

Hello everyone!

First, an important meeting update: all our subsequent meetings will be held in Vilas 5011.

Thanks to everyone who attended the Computational Methods Research Group today. Resources from our workshop are available through this link (all our workshops this year will be available here: https://uwmadison.box.com/v/computational1819).

To practice these skills, I recommend the fourth chapter of the textbook R for Data Science, which can be found here: http://r4ds.had.co.nz/workflow-basics.html . This was my first R textbook and is great for learning important basics in both data science methodology and R.

Next week, we will be going over MediaCloud, an open-source media “archive” (it includes news outlets from many countries). This archive can be accessed via a website or Python (we will teach both). The workshop will be between an hour and an hour and a half. I strongly encourage attending both this workshop and the October 5th’s rvest (web scraping) workshop, as we will be scraping (or collecting) MediaCloud content. It may also be a good opportunity for you to scrape new data you are interested in.

Looking forward to a great semester!

Jo

13 Sep

2018-2019 Computational Methods Workshops

Hello friends and colleagues!

The computational methods group will begin our workshops this Friday, on September 14th. We will meet weekly in Nafziger at 3:30 (workshops will occasionally be held later). This Friday, we will be going over the basics of R and data science, including some commands in dplyr.

No prior experience in R, statistics, or programming is necessary. Though each class can operate as a stand-alone workshop, their skillsets build upon one another.

Below is a schedule of our workshops:

Date Topic Tools
Sept 14 Setting up R, data wrangling basics dplyr (R)
Sept 21 Data collection – MediaCloud MediaCloud archive and Python
Sept 28 No meeting
Oct 5 Data collection – scraping rvest(R)
Oct 12 No meeting
Oct 19 Data collection – Working with Twitter Data tweetr (R)
Oct 26 Text Analysis – tidytext tidytext (R)
Nov 2 Text Analysis – Programs with GUIs (Graphic User Interfaces) [may begin late] LIWC, Diction, and Leximancer
Nov 9 Reporting Results – LaTeX [may begin late] LaTeX
Nov 16 Reporting Results – rmarkdown [may begin late] rmarkdown (R)
Nov 23 THANKSGIVING
Nov 30 Reporting Results – Data visualization ggplot2 (R)
Dec 7 Text Analysis II – quanteda quanteda (R)

Prior to our first meeting, please make sure you have downloaded and installed R and RStudio.

You can install R from here: https://www.r-project.org/

You can install Rstudio from here: https://www.rstudio.com/products/rstudio/download/#download

Best,
Jo
10 Mar

The Twitter Exploit

Research Paper: The Twitter Exploit

Authors: Josephine Lukito, Chris Wells, Yini Zhang, Larisa Doroshenko, Sang Jung Kim, Min-Hsin Su, Yiping Xia, Deen Freelon

Cover of "The Twitter Exploit" (Lukito et al., 2018)

Executive Summary: Researchers have begun to describe the behavior of social media accounts associated with Russia’s Internet Research Agency (IRA). The bulk of this work, including public statements by social media platforms Twitter and Facebook, has focused on the impact these accounts had within the social media sphere and estimated that American exposure to IRA content numbered in the hundreds of millions of impressions.

We describe a new dimension to the reach of IRA accounts, demonstrating that it extended beyond social media and into American journalistic media.1 Searching 33 major media outlets during and after the 2016 election, we found 32 outlets with at least one story that embedded a tweet from IRA accounts: a total of 116 articles. These findings suggest that projections of IRA reach based on social media metrics are likely underestimated.

Moreover, the deep penetration of IRA content into news media is indicative of the extent
to which the Russian information operation affected American political discourse, and points to significant challenges facing journalism in the social media age.

In this study, we describe the nature of the IRA content that appeared in news media, and what it was doing there—what role it played in the construction of journalistic content. We also discuss how American journalism must respond in the face of intentional information manipulation in the political sphere. Here, we summarize our key findings.

To download a copy of our research paper, visit this link.

This piece was published in tandem with a piece at Columbia Journalism Review.

20 Jan

Computational Methods Spring Workshop Schedule

Computational Methods will meet every other week at 3:30 PM in the MCRC lab this semester (Spring 2018). This semester, we will be focusing on natural language processing and machine learning strategies and will have stand-along workshops for a multitude of other programs. Scheduled workshops include:

2/2/2018 – tidytext (R package), taught by Josephine Lukito
2/16/2018 – SQL Workshop, taught by Research Data Services
3/16/2018 – LaTeX workshop, taught by Chuan Liu
4/13/2018 – LDA Topic Modeling using topicmodels (R package), taught by Josephine Lukito
4/27/2018 – Structural Topic Modeling using stm (R package), taught by Chuan Liu
5/4/2018 – Leximancer workshop, taught by Josephine Lukito

Hope to see you all at one (or more) of our workshops!

Best,
Josephine Lukito
Computational Methods Lead

04 Sep

Computational Methods Workshop Schedule

Computational Methods will meet every other week at 3:30 PM in the MCRC lab this semester (Fall 2017). This semester, we will be focusing on data collection and preliminary analysis. Scheduled workshops include:

9/15/2017 – MediaCloud (Desktop and Python Access), taught by Josephine Lukito and Zhongkai Sun
10/6/2017 – WordStat, taught by Josephine Lukito
10/20/2017 – rvest (R package), taught by Josephine Lukito
11/17/2017 – LIWC and Diction, taught by Josephine Lukito
12/1/2017 – OpenNLP (R package), taught by Josephine Lukito

Hope to see you all at our workshops!

Josephine Lukito
Computational Methods Lead

15 Aug

Hello world!

Welcome to the Computational Methods Research Group! We hope to use this blog to talk about recent research and work we have done. Please check back for more updates!

Josephine Lukito
Computational Methods Lead