R & Flourish Tutorial: Textual Analysis Visualization

Flourish is a powerful online tool for creating embeddable, interactive visualizations, and it’s been essential for our textual analysis of the Carletonian. It allows you to upload a dataset and instantly create a graph, but the catch is that your data has to be formatted in a very specific way. In this tutorial, I’ll show you how to use R to clean your text data for basic textual analysis and how to build a sentiment analysis graph in Flourish.

Step 1: Setup R Studio

If you already have R installed, you can skip this step. For those who have never used R Studio, however, the easiest way to access it is through your web browser. 

Go to Maize and login with your Carleton username and password. 

It’s good practice to create a new project every time you work on something new. Click on Project in the upper right corner and select New Project. Pick a destination for your project, and then Create Project.

In the lower right quadrant, select New File > R Script. Give it a descriptive name and select Ok.

Now you’re all set up! You should see your R file open in the upper left quadrant of your screen. This is where we will be working.

Step 2: Upload your text file

The first step in any R project is usually to install the libraries that you will be working with. I won’t go into the details here, but you can simply copy and paste the code below into your file:

library(tidyverse)
library(tidytext)
library(stringr)

Next, we need to upload our actual txt file. For the sake of this tutorial, we will use King_James_Bible.txt, which I downloaded from Project Gutenberg and pre-cleaned the header and footer notes.

To import this dataset into R, you’ll need to upload the file into your working directory by clicking Upload in the lower right quadrant, and paste the following code into your file

bible_raw <- read_file("King_James_Bible.txt")
bible_df <- tibble(text = bible_raw)

Step 3: Clean & tokenize your data

Now that we’ve uploaded our data, we need to clean the text by lowercasing all letters, and removing numbers, punctuation, and extra whitespace:

bible_clean <- bible_df |>
  mutate(
    text = str_remove_all(text, "[0-9]+"),
    text = str_remove_all(text, "[[:punct:]]"),
    text = str_squish(text)
  )

In order to arrange the text in an analyzable format, we also need to tokenize it, which means turning the text into a list of individual words that we can count and analyze.

bible_tokens <- bible_clean |>
  unnest_tokens(word, text)

Finally, we will need to remove stopwords, which are common filler words that don’t contribute much to sentiment or meaning.

data("stop_words") 
bible_tokens_clean <- bible_tokens |>
  anti_join(stop_words, by = "word") |>
  filter(str_detect(word, "[a-z]"))

Your bible_tokens_clean dataset should now be a list of lowercase words with the stopwords removed and ready for analysis.

Step 4: Format data for sentiment analysis

At this point, there are lots of different types of analyses and visualizations you could create with your tokenized dataset. For this tutorial specifically, we will create a bar chart of the 20 most common positive and negative words in the King James Bible.

First, we will need the sentiment labels for each word in our dataset. The tidytext R library has a built-in Bring sentiment lexicon that can label each of the words as “positive” or “negative.” The code below attaches these labels to our dataset and drops any words that are not in the lexicon (you may get a many-to-many warning, but this is okay and you can ignore it).

bible_sent <- bible_tokens_clean |>
  inner_join(get_sentiments("bing"), by = "word")

Next, we need to count how often each labeled word appears, then pull the top 20 words for positive and for negative. The following code produces one table with columns word, n (count), and sentiment, containing the 20 most frequent positive words and the 20 most frequent negative words.

top_words <- bible_sent |>
  count(word, sentiment, sort = TRUE) |> 
  group_by(sentiment) |>
  slice_head(n = 20) |>
  ungroup()

Step 5: Export new dataset for Flourish

Finally, we need to save our new dataset as a csv file so that we can import it to flourish.

write_csv(top_words, "top_words.csv")

To export the file (if you’re using the web-browser version of R Studio, click the checkbox in the lower right quadrant next to the file name, and select More > Export.

Step 6: Import to Flourish and enjoy your new graph 😀

Congratulations, you have made it to the final section of the tutorial! Login to Flourish and select New Visualization. You will see an assortment of options to choose from. For now, we’ll choose a regular ol’ Bar Chart.

Flourish will give you a default dataset when you select your visualization type, but we don’t care about that. Let’s upload our own dataset! Select Data at the top of the page, then Upload Data File. Import the csv we just created.

Next, we need to tell Flourish what we want to see in our visualization. In the right side panel, set Labels to the word column (A), Values to the n column (C), and Charts Grid to our sentiments column (B)

Now, switch back to the Preview panel. On the right side, switch the grid mode to Grid of Charts.

Finally, in the Grid of Charts section on the right side of the screen, you will switch off the Y-axis same across grid setting

That’s it! You can now publish, export, or embed your interactive bar chart. You can see my results and my full R code below.

View my full R Code:

# ================= install libraries =========================

library(tidyverse)
library(tidytext)
library(stringr)

# ================= upload & clean data =========================

bible_raw <- read_file("King_James_Bible.txt")

bible_df <- tibble(text = bible_raw)

bible_clean <- bible_df |>
  mutate(
    text = str_remove_all(text, "[0-9]+"),
    text = str_remove_all(text, "[[:punct:]]"),
    text = str_squish(text)
  )

bible_tokens <- bible_clean |>
  unnest_tokens(word, text)

data("stop_words") 
bible_tokens_clean <- bible_tokens |>
  anti_join(stop_words, by = "word") |>
  filter(str_detect(word, "[a-z]"))

# ================= sentiment analysis =========================

bible_sent <- bible_tokens_clean |>
  inner_join(get_sentiments("bing"), by = "word")

top_words <- bible_sent |>
  count(word, sentiment, sort = TRUE) |> 
  group_by(sentiment) |>
  slice_head(n = 20) |>
  ungroup()

write_csv(top_words, "top_words.csv")

3 thoughts on “R & Flourish Tutorial: Textual Analysis Visualization

  1. This is a great tutorial and is very easy to follow! I really like how you have your steps laid out in WordPress and how you very effectively use italics and the code block to make it easy for the reader to understand what to follow. I wasn’t able to follow along in R Studio, because it seems like I need to have started a class that uses it, to have access to the software; however, I felt like I was still able to learn how to use it due to how well your tutorial is laid out!

  2. Daya, I can tell you put a lot of effort into your tutorial! I like how you explicitly included each step of R code to use – it made it very easy to follow along throughout your tutorial. I also thought that your explanation of how to use Flourish was easy to follow. Overall, a very good tutorial!

  3. Your tutorial is very indepth. Thank you for sharing what you know! Some of the more complicated tutorials are ones that require you to use more than one program to achieve the final product, so it’s nice that the extra step wasn’t a reason to shy away from making this tutorial. The final graph is embedded very nicely. So much so, that I thought it was a picture until my mouse hovered over the bars and displayed information on the specific count!

Leave a Reply to Cameron Hoffmann Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

css.php