Posted by Ray Poynter, 14 June 2020
In order to assess automated text analytics, we need to think about what an insights professional wants from text analytics, and that depends on the project and the data sources. Too often we see text analytics assessed in terms of what the software can do, but that can miss the point, the question should be about the benefits it can give to the researcher or insighter, not the attributes of the software.
Here are some common situations along with the use that text analytics can be put to:
- Quant survey data: Analyse Open-ended comments to extract themes and quantify main themes.
- Quant survey data: Convert open-ended comments into codes and quantify.
- Quant survey data: Use open-ended comments to assess differences in themes and sentiment between groups (e.g. between Promoters and Detractors).
- Quant survey data: Track themes and sentiment over time.
- Depth interviews and focus groups: Analyse text to identify themes.
- Depth interviews and focus groups: Compare differences in themes between different groups (e.g. between younger and older users, users and non-users).
- Online discussions: Analyse text to extract themes.
- Online discussions: Compare differences in themes by groups (e.g. older & younger) and in terms of context (when prompted with X compared with when prompted with Y).
- Social Media posts – extract themes and sentiment.
- Social Media posts – track themes and sentiment over time and quantify.
Themes, sentiment and codes?
The list above only makes sense if you know what I mean by themes, sentiment and codes – because they all have quite different meanings in different contexts and when used by in the context of different disciplines.
I am using this in the sense of coding open-ended comments to turn text data into quantifiable numbers. This only makes sense when the underlying study supports quantification. The open-ended comments in a quant study can usually support quantification, for example the answers to the question “What did you like about your stay at the resort?”. We want to know how many people said things that could be coded as, for example, “Room”, “Staff”, “Restaurants”, etc. If twice as many people at location A said ‘Staff’ rather than “Room”, while at location B more people said “Room” than “Staff” we know that there is something worth investigating.
When working with text such as the transcript of a focus group, counting the codes usually makes little sense. For example, when discussing people’s experience of using a new credit card, knowing that everybody mentioned “easy to use” has some value, and knowing that nobody mentioned “colour of the card” has some value. But knowing that 25% mentioned the interest rate and 50% mentioned the mobile app has no value, because a) the sample was not created for that purpose, b) the sample is too small, and c) it depends on how the moderator ran the discussion.
Mostly, researchers divide sentiment into Positive, Negative, and Neutral. This is not the only way to look at sentiment, but it is a very common way. Sentiment is, like codes, normally a way of converting text data into numbers – which means the underlying data needs to be suitable for quantification. A good example of where sentiment is appropriate is when measuring social media comments about a brand over time. The measurement of sentiment over time is capable of highlighting changes, which the researcher can then try to link to other phenomena (for example advertising, PR, news stories, changes in product or service etc).
One of the challenges with sentiment analysis is knowing how to tweak it for different contexts. In general, sentiment analysis works by taking comments and scoring them, for example by using lists. If a comment has the word ‘happy’ it gets a plus number, if it has the word ‘unhappy’ it gets a negative number. The better systems also take into account modifiers and amplifiers, such as ‘not’ and ‘very. However, some words can shift from being negative to positive (or positive to negative) depending on the context. For example, ‘Serious’ is a negative word in many sentiment dictionaries, but in fields such as study, books and games it can often be a positive. Out-of-the-box sentiment analysis is generally to be avoided, the dictionary of terms (and of modifiers) should be inspected and tweaked.
There is some scope to use sentiment analysis with data from sources like focus groups and online discussions, but in a quite specialised way. If you have hundreds of transcripts from different focus groups, you would be able to say, for example, that the response to the ideas in this group were systematically more negative than 80% of all focus group discussions – but the moderator would probably have noticed that and already reported on it. You could compare the sentiment of discussion with older people and younger people in terms of sentiment – but that is likely to tell you more about whether older and younger people express themselves differently (and/or whether older and younger groups have a more or less positive/critical outlook in general), rather than information about the product.
I am using the term themes very widely. At one end of my scale, it might be as simple as word counts and n-grams. At the other end of the scale the themes could be topics derived by clustering the language used, as in topic modelling. [N-grams refers to words used together, for example, bi-grams are counts of pairs of words, e.g. how often does the pair “big data’ occur. Tri-grams are combinations of three words, for example how often does “yellow brick road” occur in the text.]
In quantitative studies, themes can be used to highlight differences between groups and over time. For example, if we looked at themes from surveys showing what people wanted from shopping malls over the last 12 months, we would expect to see an increase in the number of mentions of a theme related to safety/hygiene. To use themes quantitatively, the underlying text has to be capable of being quantified. If we are tracking social media mentions, then we can ascribe numbers to themes that express the relativities we would expect to find in a specific population (e.g. the population of people who tweet about fast food outlets). If we are analysing the transcripts from an online discussion, based on 50 people and perhaps generating 10,000 words, the numbers behind the themes have a much less precise meaning. For example, we might say from an online discussion, that older people mentioned ‘social distancing / hygiene’ much more than younger people, but we would not say 2.6 times as much (because the structure of the data does not support quantification and the role of the moderator needs to be accounted for).
In qual and quant studies themes can be used to understand what people were talking about. Unlike human-only analysis, text analytics is not swayed by articulate posts, extreme cases, hard cases, or amusing stories. The creation of themes can be used to springboard the start of human analysis or two double-check the results of human analysis. For example, when I have analysed text and then used text analytics there have been situations where a theme I thought was important turns out to be much less relevant than I had thought (usually when looking at a theme where I already thought it was important and where there were one or two very convincing comments in the text). There have also been situations where the text analytics suggests a theme that I had either overlooked or I had thought was not so important.
Analysis using themes works best when it adopts what is sometimes referred to as a centaur approach, i.e. when it combines human analysis and text analytics.
Final thoughts on automated text analytics
I want to finish this post with two thoughts about using and assessing automated text analytics.
1) Fully automated text analysis tends to work best for repeat projects, where the system has been tuned to the context. For example, if you conduct several hundred concept and ad tests for breakfast cereals per year in the USA, there is a good chance that you could fully automate the analysis of the text. However, even in these cases, I would tend to include a process whereby the software could flag up problems with analysing text, for example based on lower levels of certainty, or higher numbers of outliers.
2) You can’t assess automated text analytics in the same way you typically assess human analysis. Human analysis is typically assessed in terms of how accurate each step is. When we do that with automated analysis we see many anomalies. I recall years ago looking at a comment “Thank you Lufthansa, you made me miss my brothers wedding” which the software had coded as positive. Automated analysis works by dealing with large numbers, there will be mistakes, but if the mistakes are randomly distributed and sufficiently few in number, it has NO impact on the analysis at the total or group level. Automated analysis and human analysis should be compared in terms of the stories and messages, not in terms of the components. Human analysis misses most the data (e.g. from 10 million comments we might analyse 1,000), computers misallocate some of the individual comments, but look at all the comments. When we compare two cakes, we don’t compare the baking process, we compare the end product, the same should be true for text analytics.
Your thoughts and suggestions?
What would you add, remove, or change from this post?