My first R package: openEndedAnalyzeR

Recently I created my first R package and deployed it to GitHub. I developed it as part of my work at Consumers Energy, working with survey data frequently to help improve customer experience. Below is the vignette that I have created to document the basic usage of the package.

You can visit the site repo and get installation instructions here.

openEndedAnalyzeR enables fast, pre-packaged analysis of qualitative survey responses (“open-ended” or “verbatims”). It is built on the framework of the tidytext package. It combines tidytext functionality for text mining with tidyverse data manipulation to lower the barrier of entry to text mining for beginning R users.

Usage

The package contains three types of functions:

  1. Tidying: These functions prepare the survey data for analysis. You should always start an analysis with tidy_verbatim(), turning the survey responses into a tidy bag-of-phrases.
  2. Pre-packaged analysis: As analysis methods are identified that may be useful in more than one context, and can be generalized, functions will be created to enable others to create canned analysis outputs. At the launch of the package two methods were available: create_phraseclouds() and quantify_verbatim().
  3. Intermediate analysis prep: These functions apply additional processing to the tidy data before being utilized for analysis (pre-packaged, or customized by the user). Presently this is the response_sentiment() function.

Suggested workflow

One should always start with tidy_verbatim() to obtain a tidy bag-of-phrases. Bag-of-words is a text mining term meaning that a body of text is turned into a list of all of the words in the text. I use the term bag-of-phrases because tidy_verbatim() allows the user to specify the number of words in each phrase (also known as an n-gram).

For users with more advanced R skills or need to perform a very custom analysis, they may stop at this point and write their own code. Otherwise the next step is to pass the tidy_verbatim() output into one of the analysis functions.

This is not necessarily a linear process. openEndedAnalyzeR does create report-worthy visuals but is also helpful for exploratory analysis. You might start with uni-grams (single words), do analysis, and then desire to see more context for particular words. Then you would start over, creating a bag-of-phrases with tri-grams (three words), repeating the analysis.

Example analysis on Bloomington, IN Community Survey data

This vignette will analyze the bloom_survey data frame included with the package. This data frame contains responses to a survey that the city of Bloomington, IN conducted in 2017. It contains mostly quantitative questions with two qualitative questions at the end: What is the thing you like the least/most about Bloomington? The vignette will demonstrate the package’s functions on these two questions.

Here is a preview of the data, with unnecessary columns removed and columns re-named:

bloom_survey2 <- bloom_survey %>% 
  mutate(id = as.character(id)) %>%
  select(id, q3a, q3b, q19, q20, q19verb, q20verb) %>%
  rename(recommend_living = q3a, remain_five_years = q3b, like_least_code = q19, like_most_code = q20, like_least_verb = q19verb, like_most_verb = q20verb)


head(bloom_survey2)
id recommend_living remain_five_years like_least_code like_most_code like_least_verb like_most_verb
1 1 1 3 6 Gap between stated goal to provide transportation alternatives to cars and the reality of poorly maintained sidewalks intersections unsafe for pedestrians, and under-funding of bike/ped. infrastructure. Compact urban form around campus & downtown.
2 1 1 2 2 Lack of affordable housing/ homelessness. Trails.
3 1 1 2 5 Lack of good, affordable housing. A liberal oasis in a far right state.
4 1 1 NA NA NA NA
5 NA 2 1 1 Being bothered by panhandlers downtown. Educational & cultural opportunities.
6 1 1 2 3 Over-saturation of student housing downtown: the city feels like a dorm. Sense of vitality & community.

Create tidy bag-of-phrases

I create unigram and trigram phrases here, and summarize the top six phrases by count. Stop words are not yet removed for the unigrams and so only useless words are shown. The analysis functions will remove the stop words (or allow the user to choose to leave them).

The function automatically removes blank survey responses (which would be NA). In the case of this data a string “na” was already in the data which forced me to filter it out.

selected_cols <- c("like_least_verb", "like_most_verb")
tidy_verbatims1 <- tidy_verbatim(bloom_survey2, n = 1) %>%
  filter(phrase != "na")
head(tidy_verbatims1)
id column_nm phrase
1 like_least_verb gap
1 like_least_verb between
1 like_least_verb stated
1 like_least_verb goal
1 like_least_verb to
1 like_least_verb provide
tidy_verbatims1 %>% group_by(phrase) %>% summarise(phrase_count = n()) %>% arrange(-phrase_count) %>% head
phrase phrase_count
the 423
of 349
to 239
and 230
a 129
is 125

tidy_verbatims3 <- tidy_verbatim(bloom_survey2, n = 3)
head(tidy_verbatims3)
id column_nm phrase
1 like_least_verb gap between stated
1 like_least_verb between stated goal
1 like_least_verb stated goal to
1 like_least_verb goal to provide
1 like_least_verb to provide transportation
1 like_least_verb provide transportation alternatives
tidy_verbatims3 %>% group_by(phrase) %>% summarise(phrase_count = n()) %>% arrange(-phrase_count) %>% head
phrase phrase_count
sense of community 19
the lack of 13
a lot of 12
i like the 9
things to do 9
quality of life 8

Looking for topics with phrase clouds

The two questions from this survey were coded by researchers and put into groups. How well can we identify broad topics using the package without manually reading the responses? The create_phraseclouds() function counts the frequency of phrases and produces a phrasecloud. It includes several parameters to allow you to control the size of the cloud (eg, top x% of phrases, max # of phrases, min # of phrases). Reasonable defaults are used if nothing is changed.

After creating phrase clouds for each question, and n-gram, I have briefly reviewed the visual and noted topics that emerge.

NOTE: RStudio will often give warnings when creating wordclouds because the plot area is not large enough to display everything. Within this code chunk I have set the height and width of plots to 10 inches. Pay attention to the warnings when you run this function! Ultimately the easiest way to make sure you get everything is to write the wordcloud out to an external file and give it plenty of space.

Tri-gram cloud topics

create_phraseclouds(tidy_verbatims3, max_phrases = 50)

Least like

  • Housing options
  • Parking
  • Recycling

Most like

  • Small-town feel/sense of community/friendliness
  • Outdoor activities (parks, trails)
  • Quality of life related
  • Diversity
  • Entertainment options

Unigram cloud topics

create_phraseclouds(tidy_verbatims1, max_phrases = 50)

Least like (new information in italics)

  • Housing options
  • Parking
  • Recycling
  • Homelessness
  • Public safety
  • Transportation issues
  • Job opportunities
  • Presence of university

Most like

  • Small-town feel/sense of community/friendliness
  • Outdoor activities (parks, trails)
  • Quality of life related
  • Diversity
  • Entertainment options
  • Presence of university

How does this compare to the researchers’ manual analysis?

Per the report prepared by National Research Center Inc, the verbatims were all analyzed and coded into topics. If a response spanned more than one topic the researchers assigned it to the first one mentioned (p63 of the NRC report). The plots below show the topics created and count of surveys each was assigned to:

I identified eight topics in the least like question and six in the most like question. The researchers identified eight in each, excluding Other and Don’t know. They grouped the responses differently and made the categories more specific, probably at least in part due to knowing more about the full context of each response.

That said, my review of the phrase clouds yielded a rough equivalent for at least six of the topics in each question.

If one wanted to classify each response into a topic, this could be used as a starting point for at least two different methods:

  • Unsupervised clustering: Utilize six/four as the k in k-modes clustering. The tidy data would need to be tranposed and turned into dummies so that the data was structured as one row per survey, one column per phrase. This is done easily with the tidyr and fastDummies packages.
  • Supervised classification: Using these topics create a case_when argument to classify responses that have one of the most-common phrases from the cloud. Responses with these phrases will be used to train the model to classify responses without these phrases. Any response where the predicted probabilities are all below a threshold (for example, if none of the topics have a probability over 0.5) would be placed into an “Other” topic.

Quantify verbatims

Many surveys, including the Bloomington community survey, combine quantitative and qualitative questions. When verbatims are utilized to allow a customer to help explain why they gave a particular numeric response the analyst has to sort through many responses and look for common themes related to good/bad numeric scores.

The quantify_verbatim() function aims to help with this. It performs a regression of all unique n-grams in a qualitative response on the numeric scores of a quantitative question.

This should be used, and interpreted, carefully. Spurious associations will likely be identified if you just start throwing random combinations of questions against the wall!

The Bloomington data is not a great example for this. The way they split into two questions, what thing do you like least/most, already allows one to understand the direction of the association. For example, with “recycling” showing up frequently in the least liked question, we know that something related to recycling is a negative for the city’s residents.

Consider this as a limited demonstration of how the function works, and the output it produces but not necessarily a scenario in which it should be used.

A question about one’s likelihood to recommend that others live in Bloomington is used as the quantitative response, with the likelihood one will stay living in Bloomington for five years as an optional regressor. The response options are on what is to me a strange scale: 1 to 4, with 1 as very likely and 4 as very unlikely; and 5 representing Don’t know. For this reason I converted the 5’s to 2.5 to try and place them more in a more reasonable place on a continuous scale.

Coefficient estimates and intervals colored red have a statistical significance below an alpha of 0.1. Then as the points get lighter the p-values get further from alpha.

adjusted_verbatim <- bloom_survey2 %>%
  mutate(recommend_living = if_else(recommend_living == 5, 2.5, recommend_living),
         remain_five_years = if_else(remain_five_years == 5, 2.5, remain_five_years))

quantify_verbatim(adjusted_verbatim, tidy_verbatims3, quant_var = "recommend_living", xreg = c("remain_five_years"))