Anyone who has been the unfortunate subject of a manager who makes too many decisions by gut feeling knows how important it is for leaders to ground their decisions in facts and observations. Good leaders know this, and in a bid to make more data driven decisions their organizations have had to become data centric in nature.
But as valuable as dashboards and benchmarks are for making good decisions, reducing months of hard work into a few key metrics can make a project feel like it's missing its soul.
Achieving this balance is where qualitative data shines. In the era of AI and automation, the beautiful thing about qualitative data is that language is inherently human in nature, and so analysts can balance art and science to find the subtleties in unstructured data. If quantitative data tells leaders and analysts what they should be doing, then qualitative data tells them how they should do it.
The qualitative data collected by business, governments, NGOs, and education institutions is predominantly text data (surveys, interview transcripts, online reviews, etc.), and there’s a lot of it. To handle these enormous datasets analysts employ text analysis techniques from the field of Natural Language Processing (NLP). NLP is a big field, but simply put it is the process of breaking down human language into a form a computer can understand.
This article will walk through a basic NLP data pipeline and describe some of the preprocessing elements unique to NLP, break down some of the NLP techniques that we use at Kai Analytics, and address some of the areas where NLP can be improved. The article will conclude by describing some broader applications of NLP outside the data science community.
This is a big article, so use the table of contents below to find the part that interest you.
Learn more about qualitative data analysis in our free e-book.
A guide to qualitative analysis tools and best practices for professionals.
On this page:
The Argument for NLP
To better understand the role that qualitative data plays in informed decision making it is helpful to examine how NLP techniques are used in specific case studies. To illustrate this, we’ll examine how NLP techniques can reduce student attrition for colleges and universities.
Education Institutions in the US can spend as much as $2,357 to recruit a single student, only for an estimated 30% of them to drop out after only one year of study. This will cost institutions approximately $37 Billion US dollars a year and means that 30% of new students will drop out with a year’s worth of debt and no degree to show for it.
To prevent students from leaving school, the first step is to understand why the students who dropped out left. This analysis will hopefully reveal a common theme for the school to address.
Traditionally this analysis would rely heavily on quantitative measures such as average GPA, attendance record, or level of satisfaction. Notice that none of these methods actually ask students what prompted their decision to leave; hence producing solutions that are prescriptive in nature and don’t address underlying concerns and challenges faced by students.
By using qualitative data to understand student concerns, an analyst will have a more descriptive understanding of the problem. They can then deliver actionable recommendations to leadership that will address the “why” of student attrition.
It’s worth nothing that many students leave because they do not feel safe on campus. Analysts can use NLP techniques to quickly identify and flag reports of Title IX violations on campus (e.g., discrimination, sexual violence, harassment, retaliation and hostility). Effective intervention first requires that survivors are heard.
Most people’s first exposure to NLP is the infamous word cloud, but the truth is that word clouds are only a small part of the picture. More often NLP is used to analyze sentiment and key themes, with the ability to drill down by sub-population on important issues. Represented graphically, NLP makes navigating complex issues intuitive, and communicating results straightforward.
Through a combination of techniques, including thematic analysis and segmentation (discussed below), analysts can create powerful stakeholder personas. These personae illustrate the archetypes of a population, including those quieter individuals whose ideas are often drowned out by their louder peers.
Kai Analytics used some of these more advanced techniques at Bastyr University, where giving students and faculty the feeling they were being listened to increased approval rates to 91%, and reduced attrition by 4%, saving an estimated $259,000 USD in revenue compared to the year before.
Applied correctly, NLP can help research departments reduce the personal bias of their analysts. Anytime a person reads something it is filtered through their own experiences and opinions before they understand it, and so naturally, some sort of bias must occur. A traditional solution is to assign two analysts to the same dataset, which will take twice as long and cost twice as much.
NLP is a more efficient solution, so long as care is taken to train the model well. Bias in NLP models is discussed in more detail below; but, by using NLP to minimize personal bias, analysts can save on costs and secure buy-in for their more informed recommendations.
Using NLP to increase operational efficiency does more than just save on costs. Analysts know that cleaning data is half the battle and that attempting to read, categorize, and analyze thousands of student responses can feel very daunting. In a Kai Analytics survey of institutional research and assessment professionals, respondents reported spending up to 2 weeks analyzing student comments with a team of 3-4 analysts. In comparison, an NLP program can do this preprocessing in a matter of seconds, giving analysts more time and emotional bandwidth to interpret results.
When qualitative data is used to evaluate how people felt about an experience (university course, tour, workplace environment, etc.) it can be difficult to separate constructive feedback from opinions on specific people. NLP helps to tackle this problem by automatically masking names, sensitive numbers, and protected groups or organizations. When survey respondents know that their feedback won’t be directly attached to a specific person, they often feel more comfortable expressing authentic feelings.
The field of NLP offers analysts some very powerful techniques that make incorporating qualitative data into their research more efficient and rewarding. The benefits of using these techniques for data science are usually increased efficiency, a defensible process for reducing personal bias and a system for protecting privacy. Even an understanding of the basic concepts of the field can help you to evaluate options for processing qualitative data.
What is an NLP Pipeline?
All data science involves some sort of data processing pipeline, and text analysis is no different. Since the techniques used here fall under the NLP umbrella, it’s called an NLP Pipeline. Pipelines like this can be used to achieve many different results. This article examines each step of the NLP pipeline, breaking it down into its components, as well as various techniques for interpreting data. This article will describe the journey of a qualitative dataset in the higher education space as it travels through the NLP pipeline and how it is used to create recommendations that guide effective policy.
Working with Survey Data
Survey data is one form of qualitative data, and a specialty of Kai Analytics. However, this pipeline could be used to analyze any sort of text data, from tweets and reviews to focus group notes and interview transcripts.
After collecting the results of their survey, an analyst will typically aggregate them together in a .csv format. While modern survey platforms offer export formats like excel, .csv, or even SSPS, .csv is standard. This is because the plain nature of .csv typically handles written responses better.
For example, survey respondents will often make lists, which they identify by putting a dash in front of the statement.
-The content was interesting.
-The marking was fair.
If the analyst were to use an excel format, Excel would interpret “- “as a negative sign and expect a numeric value. This would cause Excel to generate a #NAME? error. The analyst could, in theory, manually place an apostrophe in front of each dash, but it’s exactly that kind of manual work that they are trying to avoid by using NLP.
To learn more about designing surveys and collecting data from diverse populations, check out these two articles:
Data Preprocessing
Now that the data has been collected and exported, preferably in a .csv format, it is time to do some pre-processing. These steps prepare the qualitative survey data for the analysis techniques used later. Pre-processing is all about turning messy human thought into a format that a computer can understand and analyze. This is done by cleaning up the data, breaking down sentences into parts, removing stop words (like, me, I, you, the), tagging grammatical structures to give the program context (I like to bike vs. road biking is like running a marathon), and simplifying the data to make analysis easier.
Step 1: Basic Data Cleaning
In this first step, the analyst will perform some basic data cleaning steps, such as accounting for blank responses, removing duplicates, and spell checking. The idea is to reduce noise in the data set so that analysis is more efficient. The problem with passionate survey responses is that, usually, they are not passionately edited. So that needs to be taken care of before any meaningful analysis can begin. This can be done in a couple of ways. The most obvious is to use the spell check function in a word processor and go through the errors one by one. This works for small datasets but isn’t scalable. To deal with a larger dataset, an analyst can use either a rule-based or deep learning approach. These work by establishing a benchmark for what is correct, and then having the program automatically correct other mistakes. These approaches are covered in more detail below under “Sentiment Analysis”, but ultimately this is just another way to do a spell-check.
Step 2: Tokenization
The first step is for the analyst to break down sentences into individual words (or tokens!).
So, this list of responses:
Would be broken down into this:
Step 3: Removing Stopwords
When humans write sentences, we write them in such a way that they sound good and make sense when we say them. We use words like “the” to give our sentences structure, but these words are not really necessary to derive the subject, theme, or sentiment of the sentence, and on a macro scale they will clutter up data. Analysts call these Stop Words, and they remove them.
Not all stop words clutter the data. This is where context comes into play. Depending on the situation; “not” and other negation words are crucial for sentiment analysis, but they would still clutter results. So, analysts must review the list of stop words and add or subtract from it as necessary.
Step 4: Tagging Parts of Speech
This step is crucial to the preprocessing phase as it allows the computer to understand the context of sentences despite the absence of stop words. Part of Speech tagging is where the analyst assigns “tags” to each word in the sentence, like in the example below. These tags are crucial in the next step of the preprocessing phase, Lemmatization.
Step 5: Lemmatization
Lemmatization is the process of getting to the dictionary root of each word. For example, biking becomes bike, [he] bikes become bike, and so does [the]bikes. So bike, bike, bike. Hold on that doesn’t make sense, now they all look the same. This is where step 3 comes in. “He bikes” and “The bikes” obviously mean two different things, communicated by the determiner words: he, & the. But since those words were removed in step 2, the meaning is lost. So, the analyst returned the meaning by adding Part of Speech tags in step 3, so “He bikes” would be given a verb tag, becoming “bikes (VBG)”. Now when these words are lemmatized in step 4, the part of speech tag stays with it while the word is reduced to its lemma. Now the example looks like biking, [he]bikes, and [the]bikes; become bike (VBG), bike (VBG), bike (NNPS). This allows the program to analyze all these words as single items while retaining the context for techniques like sentiment analysis.
Text Analysis Techniques
Now that the qualitative data has been cleaned and preprocessed it’s ready to tell the analyst some important things about the people who created it. The story of the data is told by the techniques used to analyze it. What follows are some useful NLP techniques for analyzing qualitative survey data, each with a description of the theory, an explanation of the results, and the strengths and weaknesses of each one.
N-Grams
Theory
N-Grams form the core of NLP and are the foundation of many of the other techniques in this article. While most techniques build on the N-Gram in some way, N-grams themselves can still tell an analyst some useful surface-level statistics about the qualitative data they are analyzing.
An N-gram is simply a way of breaking down text data into manageable pieces. The N is equal to the number of entities in the gram. n = number of words in the gram.
How to Interpret Results
The results of N-gram analysis can be used for many things, but here is a simple application that uses the N-gram to build a network graph, and a better word cloud. If you are interested in the python code used to perform this analysis, as well as all the preprocessing steps above, check out this video of our CEO Kai Chang presenting this concept.
In this application, the analyst will use a Bigram. Bigrams are the “goldilocks zone” for most analyses, as they offer more context than a simple word on its own but are not a noisy as a Trigram.
Break the data apart into Bigrams.
Count the number of times each Bigram occurs to show how often the idea is repeated throughout the data.
These results can be displayed with a network graph, like the one shown below. In this style of chart, words are displayed on their own, with the lines that join them showing the Bigram relationship. The thickness of the line shows the “strength” of the relationship, or how often that Bigram was repeated throughout the dataset. This chart also shows multiple viewpoints on the same topic. The course has links to great, good, and excellent, so it could be interpreted that this course is having a positive impact on respondents.
Strengths:
An easy technique to understand and implement
Easily built on to perform more complicated analyses
Flexible
Limitations:
If the response data covers a wide range of topics, N-grams won’t provide enough clarity to interpret and identify major themes. Solution: Segment your dataset.
Topic Modelling
Theory
Topic modelling, sometimes called thematic extraction, is an NLP technique that statistically uncovers themes or topics from a large body of text. Topic modelling is very effective for understanding what the main ideas are in a body of text, and how those ideas are related to one another.
Like most NLP techniques, Topic modelling starts by using N-grams to break down the text and then converting the words (tokens) into numbers; so that the computer can understand and store the data. This process of converting words to numbers is called vectorization, pictured in the graph on the left. Vectorization can be compared to giving each word a position in 3D space, called a vector. In a large document, or over a large dataset, words with more similar meanings (semantics) will appear closer together in this 3D space. By measuring the distance between words, analysts can create a matrix of words with a similarity score for each pair of words, otherwise known as word embeddings. In the graph on the right, Australia appears closer in relation to Canberra and Peru closer in relation to Lima. The principle of vectorization is also important to understanding bias in Machine Learning, which is discussed later.
Vectorization is flexible enough to be applied on larger scales. Remember that N-grams can hold any number of words (n = number of words in the gram), so sentences can be vectorized as well. Or whole articles, if these articles were assigned semantic meaning. In this way, an analyst could analyze all Wikipedia entries or New York Times articles to see which topics are covered most often, which topics are talked about together the most, and how the publication talked about various issues. This could provide valuable insight into how our society perceives different issues, or how brands position themselves. But how does a computer go about learning the distance between words?
More Theory
Take the example of a student, Sarah, writing a course evaluation. When Sarah is asked about the course, she might write about both final exams and course grading. This prompts the model to associate Sarah’s evaluation with the topic's “exam” and “grading”. The model will repeat this process with every student in the class, and see what topics are most often associated with each other.
To use an analogy, picture the analyst as a dating coach tasked with leading a speed dating event. Their goal is to match the highest number of couples with shared interests. As the coach, they dictate how many rounds, or iterations, will take place. Each round randomly pairs up two singles to talk. Now for the sake of this example, imagine that all these conversations are about course evaluations and that shared interests are found by grouping words into topics. If a topic makes sense, then that couple would successfully match.
In the first round of conversation, the words that match up may be professionals, material, course (Topic 1); work, articles (Topic 2); and love, textbook (Topic 3). Clearly, these topics don’t make sense because the words inside them don’t share similar semantic meanings. So, the dating coach continues to shuffle singles around for rounds 2, 3, 4, 5...all the way up to n.
After the nth round of speed dating, or after every word has been compared to every other word in a giant matrix, words that have the closest semantic meaning will end up grouped together like a single person finding someone who shares their interests. In the successful round, the words that match up may be course, relevant, useful (Topic 1); material, textbook, articles (Topic 2); and work, professionals (Topic 3). These topics finally make sense, and the singles can leave the event as couples with shared interests.
When a dating coach meets a couple, they may instinctively know if and why a couple works or not. This is called domain knowledge, and the analyst will use theirs to understand that Topic 1 is probably talking about courses that students found relevant and useful, Topic 2 is about course materials, and that Topic 3 may have to do with jobs and alumni success.
It’s important to note that topic modelling uses a probabilistic distribution, so some words might choose to spend weekdays with one Topic and weekends with another!
How to Interpret Results
One method of visualizing topic modelling results is to display the distance between topics on a 2D plot. This is called an Inter-Topic Distance map, a form of the Latent Dirichlet Allocation visualization. The size of the circles represents how often the topic comes up in the text. The distance between the circles shows how related the topics are. If 2 circles were to overlap, then they would be closely related. In the graph above, topics 2 and 4 share quite a bit of overlap. So, comments in theme 2 will likely share some similarities with theme 4. Perhaps theme 2 is about great courses, and theme 4 is about textbooks. So, an analyst could surmise that the quality of the textbook has a significant impact on students' perception of the course. The analyst could then dig deeper into the themes to see what about textbooks was closely related to great courses, and then recommend that those practices be implemented to improve course ratings.
Strengths:
A quick way to understand major topics or themes in a large amount of data
Limitations:
Still requires domain knowledge to understand what the topics are about.
The number of topics needs to be estimated using a coherence score, statistical test or best practice.
Sentiment Analysis
Theory
Sentiment Analysis, or opinion mining, is a technique used to determine how respondents feel about a subject. At its core, sentiment analysis shows whether data is positive, negative, or neutral. In more advanced forms it can be used to recognize basic feelings and emotions (happy, grateful, sad, angry), the urgency of a statement (urgent or not urgent), and the intention of a respondent (enrolling or dropping out). There are 3 basic approaches to sentiment analysis used in NLP that rely on different levels of machine learning. These 3 approaches apply to many more NLP techniques, as the approaches are really just different ways of leveraging machine learning to achieve text analysis results.
Rule-Based Approach
In this approach, rules are established by the analyst to tell the computer the meaning behind grammatical structures. These rules take the form of the techniques used in preprocessing, such as stemming, tokenization, and part of speech tags.
The analyst creates a large dictionary of polarized words, maybe a list of positive words and a list of negative ones, or words that describe anger or joy.
The program counts the number of words and tallies up their polarity.
The aggregated score shows the overall sentiment of the response. Other methods can help create weighted averages that account for the length of the comment or the use of strong words.
The obvious positive of the rule-based approach is its simplicity to understand and implement but be warned. As rules are added models get more complex and so systems require fine-tuning and maintenance to operate, an inefficiency often expressed in dollars.
Machine Learning Approach
This approach leverages machine learning algorithms to counteract some of the long-term maintenance costs and complexity of a rule-based approach, as well as to give the writer of this article something to talk about at dinner parties. This article won’t delve into the science behind machine learning models, but the core concept is simple.
A machine learning model is fed training data, which in this case would consist of comments alongside an overall score. Kind of like an online review and its star rating. These “correct answers” allow the model to be trained to categorize words and what different grammatical structures look like (concepts like irony and sarcasm are the holy grail for these models). The model is then fed raw data, and it uses what it has learned to categorize words and ultimately analyze sentiment.
The obvious strength of this approach is its ability to operate independently; however, this learning takes time, and machine learning algorithms can have spotty results, especially at the start. They are also vulnerable to bias in the training data, online trolling, and review bombing. Picture a naive toddler.
Hybrid Approach
This approach combines rule-based and machine learning approaches to get the best of both worlds, reducing the maintenance costs of a complex rule-based approach while still giving analysts enough control to increase the accuracy of machine learning approaches. An example of a program like this is our tool Unigrams (bear with me). Unigrams is built on the rule-based systems developed by our analysts to analyze higher education data and utilizes machine learning algorithms to take these systems a step further and build’s its own domain knowledge of word categories and grammatical structures. In the case of Unigrams, this domain knowledge is specific to the words that are important to professionals in Higher Education. For example, at the University of Victoria "SUB" refers to the Student Union Building and not a popular sandwich from the cafeteria. An analyst could tell Unigrams to store this rule in its domain knowledge to make analysis more efficient, but could just as easily remove it without breaking other rules.
So, analysts can use machine learning models to immortalize the things they’ve learned in a computer that can do monotonous tasks, like preprocessing, in seconds. The analyst can then use the machine learning model in new situations they encounter or give it to team members who are solving similar problems. By doing this the analyst can save the time they would usually spend setting up data and use that time to try new methods, solve new problems, or go try that new SUB they keep reading about.
How to Interpret Results
The results of sentiment analysis are easy to read. Since sentiment analysis ranks phrases on a polar scale, bar graphs like the one below can be used to show the distribution of responses, with the average highlighted to show overall sentiment. However, it is important to remember that these are the feelings of the group overall and may not show the thoughts and feelings of various subpopulations. To do that the analyst must segment the data, discussed below.
Strengths:
Gives analysts context for the feelings around a topic.
A quick way to assess the overall feelings of respondents toward a particular topic.
Limitations:
Models struggle to handle implied meaning such as sarcasm.
Works best if the questions are narrowly focused. For example, “tell us what you liked or disliked about this course,” works better if it is split into two questions.
Segmentation
Theory
Segmentation in this context refers to the technique in statistical analytics of separating data according to the segment of the survey population it came from. While segmentation itself isn’t an NLP technique, it is very useful in the field of survey analytics since it can be used to break up survey respondents into different subpopulations. The techniques above can then be applied to these subpopulations to discover the thoughts and feelings of different groups on campus, making recommendations more in line with strategic goals.
How to Interpret Results
The method used to interpret results will vary from technique to technique. In the graph below results are split up into different personas, based on demographic information, to see which groups said what. The larger the bar, the greater the number of people from that group whose responses fell into that category.
Segmentation cannot exist in a vacuum. To segment respondents, the survey needs to ask some demographic questions (age, sex, gender, race, sexual orientation, or more specific questions like housing: On-campus or off-campus?). These responses will then be used to match text responses to the groups that gave them. This is one of the reasons why privacy is so important in NLP. If an analyst expects respondents to fill in personal information to improve the quality of analysis, then that analyst must be able to guarantee to the respondent that their information will remain private and will only be used to aggregate responses.
Strengths:
Gives the analysis context
Gives analysts greater insight into the needs and desires of different sub-populations.
Limitations:
Limited by the socio-demographic questions asked to respondents.
Not appropriate for small populations where respondents risk become identified.
Bias in Natural Language Processing and Computational Linguistics
While NLP is very good for reducing the personal bias of the analyst reading the comments, the field faces some challenges with biases in the machine learning models. These biases are not maliciously programmed by the computer scientists who write these models, they are learned by the model itself from the training data it is fed.
Most of the time, the training data given to a machine learning model takes the form of some text written by human beings for human beings. As such, it contains the bias of the people who wrote it. The model then learns these biases as correct. Whether it is beautiful or ugly, the bias in our machine learning models is a reflection of ourselves, and the society that shapes us. By seeking to understand the bias in our models, we can learn where and how we can improve to make our society a fairer and more equitable place for all peoples.
For example, GPT-3, a model developed by Open-AI, developed inherent biases related to race, gender, and religion. Some of the gender bias was related to occupation. While words like “king” and “queen” are associated with specific genders, words like “computer programmer” and “homemaker” are not. However, as the model was fed training data it learned that:
Man is to Woman as King is to Queen
Man is to Woman as Computer Programmer is to Homemaker
While the first statement is correct, the second is not, and is problematic if the model is applied to real-world problems, such as hiring.
The principal at work here was first mentioned in the discussion of Topic Modeling above. That discussion included an explanation of how models store words with similar meanings closer together in a process called vectorization. In the case of Bias, words that should not have similar meanings are stored close together in error. The graph on the right displays proper storage for the gendered examples of Man is to Woman as King is to Queen. The graph on the left shows the improper storage of non-gendered words computer programmer and homemaker.
There are some solutions. Since the issue stems from the training data, there are before and after approaches to solving the problem.
De-Bias the data set. This includes removing problematic data sets from use but also compensating for bias. For instance, if a data set included the statement “He was a great computer programmer” then the statement “She was a great computer programmer” would be added to balance the data set. This way, the model learns equal weight for each word.
De-Bias the model. In this method separate algorithms are written that can identify and modify biased statements in the model's memory. Algorithms like the Double-Hard Debias algorithm are showing promising results and are favoured as they allow already running machine learning models to be De-Biased.
So, while Bias in machine learning models poses a formidable threat to their widespread adoption, steps are already being taken to address these problems and correct them. However, there is still a long way to go before a machine learning model can be reliably used for life or death situations, like predicting crime.
NLP in the World
This article described a process we use at Kai Analytics, employing NLP techniques to analyze a large amount of qualitative data from survey responses in order to derive insight about what stakeholders care about and use that insight to inform recommendations. But many of these core NLP techniques can be layered with other technologies, like ChatBots, AI, social media, or Word Processors to create other exciting tools. After all, Natural Language Processing (NLP) is nothing more than the process of translating human thought into something a computer can comprehend, communicated through the medium of the natural language.
Great developers can also sprinkle NLP concepts into their code to vastly improve user experience; like the way that Grammarly slowly learns to recognize how the writer wants a piece of writing should sound, and then makes recommendations so the writer can better achieve their goal.
To the left is the window that pops up when starting to edit a new document in Grammarly. Now, Kai Analytics is in no way associated with Grammarly so we can’t say for sure what is going on here. But it isn’t hard to spot the similarities between these “goals” and the sentiment analysis discussed above. In fact, those “experimental” disclaimers sure do look like someone collecting training data for a machine learning model.
Models like the one used by Grammarly are difficult to program and take a great deal of time and depth of understanding to achieve. But that doesn’t mean that the average developer or analyst cannot start to apply these topics in their own code. To learn how the NLP pipeline discussed above looks in a python use case, check out this video of our CEO Kai Chang presenting on Topic Modeling, and walking participants through some of his own python code
Comments