What You Can Learn About a Popular Book by Looking at its Word Clouds

A word cloud is a curious technique for creating eye-catching and intuitively understandable text data visualizations, with the size of each word representing the frequency of its appearance in the text. We have full control over the text input that can (and should) be cleaned beforehand to obtain the most meaningful result. There are a lot of parameters to tune for improving the aesthetics and readability of the resulting visualization. In addition, we can create a word cloud based not on the word frequency but on another attribute assigned to each word. For instance, having a dictionary of film titles, it’s possible to assign to each title the year of its release, and then display this data.

Let’s combine the useful with the pleasant and take a look at the mysterious content of “The Little Prince” by creating word clouds for it. After 78 years since its first publication, this tiny book still attracts a lot of interest. Numerous wise phrases from the book became real worldwide-known aphorisms:

  • The stars are lit up so that each of us can find his own.
  • Anything essential is invisible to the eyes.
  • You become responsible forever for what you’ve tamed.

Many people love “The Little Prince”, others find it too confusing and misleading (you might check this article), but the fact is that nobody is indifferent to it.

So, can we unriddle some of the mysteries of this story using word clouds? Let’s try and see.

Web Scraping the Text

First, we’ll scrape the full book’s text from this website:

import requests
from bs4 import BeautifulSoup

url = 'https://englishonline.vn/learn-english-through-story-★-the-little-prince-by-antoine-de-saint-exupery/'

# This code was used for obtaining the initial text to see its structure and define the patterns in it
# soup = BeautifulSoup(requests.get(url).content,"lxml")
# soup

classes = []
for i in range(5):
    if i==0:
        classes.append("panel active entry-content")
    else:
        classes.append("panel entry-content")        
ids=['1-2-3-4-5', '6-7-8-9-10', '11-12-13-14-15', '16-17-18-19-20-21', '22-23-24-25-26-27']

text=''
for j in range(5):
    soup=BeautifulSoup(requests.get(url).content,"lxml")
    soup = soup.find_all('div', class_=classes[j], id="tab_chapter-"+ids[j])    
    soup_str = str(soup).split("Chapter "+ ids[j].split('-')[0]+"</strong></p><p>")[1]\
    .split("</p></div>]")[0].replace('<p>', ' ').replace('</p>', ' ').replace('\xa0', ' ')\
    .replace('<strong>', ' ').replace('</strong>', ' ')
    text+=' '+soup_str
    text = ' '.join(text.split())
    
print('Text begining:',2*'\n', text[:300], '\n')
print('Text end:',2*'\n', text[-300:])

Output:

Text begining: 
    
Once when I was six I saw a magnificent picture in a book about the jungle, called True Stories. It showed a boa constrictor swallowing a wild beast. Here is a copy of the picture. In the book it said: “Boa constrictors swallow their prey whole, without chewing. Afterward they are no longer able to  
    
Text end: 
    
I beg you not to hurry past. Wait a little, just under the star! Then if a child comes to you, if he laughs, if he has golden hair, if he doesn’t answer your questions, you’ll know who he is. If this should happen, be kind! Don’t let me go on being so sad: send word immediately that he’s come back…

Creating a Basic Word Cloud

Now, we’ll create a word cloud from this text using the wordcloud library of Python (installation: pip install wordcloud). At this step, we won’t do any data preparation or plot adjustment, only use the library’s built-in STOPWORDS list, to filter out auxiliary words. Let’s see if we’ll be able to get some insights from it:

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(random_state=111).generate(text)
plt.title('The most frequent words in "The Little Prince"', fontsize=27)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

output_5_0

We can make the following observations:

  • The most frequent word combination is little prince (what a surprise! :grinning:). Other word combinations are prince said, grown ups, and little fellow.
  • The word prince occurs twice on the word cloud, while the word little even 3 times.
  • Other popular words are planet, flower, sheep, star, and king, which is also quite expected. The little prince visited several planets, including the one with a solitary king on it, and loved the unique flower on his own planet. He received a sheep in a box as a present from the narrator. In the story, there was a businessman thinking that he possessed all the stars in the world, and also there was philosophical reasoning about looking at the stars and thinking of creatures who are far away from us, maybe on another planet.
  • We can distinguish some other main characters: fox, geographer, lamplighter, snake, baobab.
  • The most frequent cognitive verbs: know, asked, answered, understand. The little prince was always asking a lot of questions while ignoring the ones addressed to himself.
  • Despite using the STOPWORDS list, we see a lot of auxiliary and low-informative words: will, ll, s, m, come, nothing, never, make, etc.

Adjusting the Stopword List and Fixing the Word Cloud

It seems that the STOPWORDS list is not perfect and needs to be further expanded. For this purpose, we have to do some manipulations with the text:

  • Remove all the punctuation symbols, including some specific ones for this text.
  • Count the number of times each word occurs in the text ignoring upper or lower case.
  • Order the words by their frequency in descending order.
  • Manually select the words to exclude from the next word cloud.
import pandas as pd
import re, string 

def clean_text(text, chars):
    '''Removing unnecessary symbols from a text'''
    return re.sub('[%s]' % chars, ' ', text)

text=clean_text(text, string.punctuation+string.digits)\
               .replace('−−', ' ').replace('–',' ').replace('…',' ')\
               .replace('“',' ').replace('”',' ')\
               .replace('‘',' ').replace('’',' ')

def convert_str_to_lst(string):
    '''Converting a string into a list of lower case words'''
    return list(string.lower().split(" "))

# Counting the number of times each word occures in the text
word_list = convert_str_to_lst(text)
s = pd.Series(word_list)
freq = s.value_counts()

print('Number of unique words in "The Little Prince":\t', s.nunique(), '\n')
print('Preview: ', freq.index.tolist()[:50])

Output:

Number of unique words in "The Little Prince":	 1976 
    
Preview:  ['', 'the', 'i', 'a', 'to', 'and', 'you', 'of', 'he', 'little', 'it', 'that', 's', 'prince', 'was', 'said', 'my', 'but', 'for', 't', 'in', 'me', 'is', 'on', 'one', 'be', 'his', 'they', 'have', 'at', 'all', 'what', 'are', 'if', 'so', 'as', 'then', 'had', 'll', 'this', 'planet', 'no', 'him', 'very', 'not', 'like', 'with', 'm', 'them', 'there']

Now that we have a list of all the words in descending order by frequency, we can manually select the words to ignore in the next visualization. We’ll focus on auxiliary words, adding to them the main characters and cognitive verbs from the previous graph since we’re trying to dig deeper now and find new popular words. Besides, we’ll exclude the words such as chapter (referring to the book’s chapters), numbers in written like seven, and few low-informative but frequent words like time and people. While selecting the words to be excluded, we don’t have to check the whole descending list of words but only the first, say, 500 items of it. All in all, we’re going to display at maximum the next 150 most frequent words in our visualization.

This time, we’ll also fix some other issues related to the previous graph:

  • The combination of the default background color and the color palette for the words is not really good and deteriorates the overall visibility. Indeed, some of the words were almost indistinguishable not because of their size but because of their color. Even though the little prince himself said that “anything essential is invisible to the eyes”, let’s not listen to him in this case and use the parameters colormap and background_color to improve the graph readability :wink:
  • Earlier, we saw the word combinations like little prince and little fellow. By default, two-word collocations will be included in the word cloud, as well as the same words separately. This can be useful if we want to keep some reference to the context of the original text, especially if this text is long and we are not familiar with it. Let’s assign the collocations parameter to False this time, to avoid word duplication in the resulting graph.
  • Using the stopwords parameter, we’ll update the built-in STOPWORDS list with our findings.
  • The prefer_horizontal parameter assigns the ratio of times to try horizontal fitting over vertical. Its default value is 0.9, meaning that the algorithm will try rotating the word vertically if it doesn’t fit. Let’s assign it to 1 to exclude vertical words.
  • The random_state parameter takes a seed number for reproducing always the same word cloud.
  • Also, we’ll assign the width and height of the word cloud canvas (width and height), display at maximum 150 words instead of 200 by default (max_words), and only those having at minimum 3 letters (min_word_length).
stopwords = ['the', 'and', 'you', 'little', 'that', 'prince', 'was', 'said', 'but', 'for', 'one', 
             'his', 'they', 'have', 'all', 'what', 'are', 'then', 'this', 'planet', 'him', 'very',
             'not', 'them', 'there', 'when', 'flower', 'from', 'can', 'your', 'she',  'never', 
             'who', 'stars', 'out', 'know', 'too', 'sheep', 'time', 'would', 'where', 'just',
             'again', 'don', 'about', 'asked', 'here', 'only', 'king', 'made', 'nothing', 'more',
             'which', 'answered', 'how', 'hundred', 'come', 'were', 'course', 'people', 'now',
             'will', 'her', 'way', 'been', 'man', 'make', 'day', 'didn', 'quite', 'over',
             'understand', 'chapter', 'well', 'two', 'much', 'himself', 'first', 'does', 'could',
             'since', 'even', 'why', 'go', 'because', 'down', 'those', 'back', 'put', 'thousand',
             'told', 'has', 'five', 'into', 'anything', 'own', 'three', 'once', 'any', 'get',
             'other', 'friend', 'yes', 'than', 'same', 'couldn', 'four', 'must', 'another', 'let',
             'away', 'something', 'six', 'tell', 'myself', 'flowers', 'went', 'still', 'say',
             'always', 'thing', 'million', 'did', 'their', 'around', 'without', 'after', 'should',
             'sometimes', 'being', 'twenty', 'having', 'anyone', 'star', 'came', 'won', 'yet',
             'things', 'enough', 'already', 'some', 'far', 'whole', 'seven', 'really', 'isn',
             'haven', 'doesn', 'everything', 'yourself', 'next', 'ever', 'under', 'also', 'such',
             'might', 'others', 'going', 'soon', 'each', 'before', 'someone', 'thirty', 'whatever',
             'shall', 'somewhere']

fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(width=1000, height=700,
                      collocations=False,
                      stopwords=STOPWORDS.update(stopwords), 
                      colormap='Dark2',
                      background_color='white',
                      random_state=100,
                      prefer_horizontal=1,
                      min_word_length=3,
                      max_words=150).generate(text)
plt.text(x=0, y=-70, s='The most frequent words in "The Little Prince"', fontsize=29)
plt.text(x=300, y=-30, s='(digging deeper)', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

We discovered a new layer of frequent words in the book. Among them, the word good is the most popular one, mainly deriving from numerous greetings that the little prince exchanged with the other characters. Since we cannot see the 2-word collocations anymore and given that the word morning is also rather frequent, we can guess that he greeted the majority of characters in the morning. The word grown is now separated from ups as well, and we can conclude that sometimes it was mentioned in the text on its own. The characters that were hardly distinguishable on the previous graph are now displayed much better, especially the fox, whom the little prince tamed and felt responsible for. We can notice also some new characters: the businessman, the drunkard, the switchman, the vain (referring to the vain man). In addition, we see the volcanoes from the little prince’s tiny planet, the desert on Earth where he landed, the rose, being the little prince’s flower, The word drawing refers to few drawings which the narrator said he made in his life. A couple of his drawings represent a constrictor related to a boa constrictor who ate an entire elephant and looked like a hat.

There are still a lot of words like see and look that doesn’t give us any additional information. We could continue filtering out such words and discovering more and more story-specific words and characters, but… what for? We are either familiar with the story, and then we already know its characters and general context, or we aren’t, and then the words we’ll find can be interpreted in whatever way. At maximum, in the first case, we can prioritize the characters by their popularity.

Creating a Two-Dimensional Word Cloud

If we want to get the most from a word cloud, we have to know in advance what exactly we’re trying to investigate, prepare the data properly, and, probably, add one more dimension to the information to be displayed by introducing colors for each category.

For example, “The Little Prince” is usually considered to be quite a sad book, and there is even an opinion that it’s actually a war story. Let’s see if using a word cloud can give us some insights about it. For this purpose, we’ll do the following:

  • Manually selecting all the words with clearly positive or negative meaning and distributing them into 2 separate lists. Since there are not so many unique words in this story (1976, as we saw earlier), it’s still feasible to use the whole text. Otherwise, we could consider searching for such words among the most popular ones, with a certain cut-off by frequency.
  • Combining the words in both lists into distinct groups where possible. We can use both cognate words (admire, admiration, admirer, admirers) and synonyms (ashamed, abashed, embarrassed).
  • In each group, a representative word will be selected for then displaying it in a word cloud. The priority will be given to the most typically used one. For example, in the group abandoned-forsaken, it will be abandoned.
  • Counting the number of times each word occurs in the text. For the groups of words, the frequencies of all words inside each group are summed up and assigned to the representative word of the group.
  • Creating separate dictionaries for positive and negative words and then merging them.
  • Creating a word cloud. This time, we have to define a function for coloring positive and negative words differently (unfortunately, there is no easier way for this task, at least for now). Let’s use an intuitively comprehensible “temperature” template for it: red for positive, blue for negative words.
  • Unlike the previous visualizations, now we’ll use not the text but the merged dictionary as a direct input, using the generate_from_frequencies() method.

One more remark about data preparation: since the word good is both positive and by far the most frequent one among the other emotionally charged words, let’s exclude from further analysis its low-informative occurences: the ones from the numerous greetings (26 occurrences, by manual counting).

# Manually selected positive and negative words
positive = ['good', 'laughed', 'love', 'glad', 'beautiful', 'laugh', 'great', 'funny', 'happy',
            'proud', 'interesting', 'useful', 'best', 'responsible', 'reasonable', 'laughing',
            'admire', 'treasure', 'lovely', 'consoled', 'better', 'satisfied', 'magnificent',
            'blossoming', 'excited', 'liked', 'rich', 'wonderful', 'amazed', 'sweetly', 'fun',
            'perfumed', 'admiration', 'fresh', 'tenderness', 'loved', 'politely', 'pretty', 
            'majestic', 'smile', 'smiled', 'beauty', 'faithful', 'favorable', 'sweet', 'fonder',
            'intelligent', 'modest', 'enjoy', 'truly', 'gently', 'loves', 'harmless', 'gift',
            'prettiest', 'improved', 'pleasure', 'veritable', 'feast', 'happier', 'respectfully',
            'adornment', 'forgiveness', 'charming', 'smiles', 'pleasant', 'majestically', 
            'successful', 'amusing', 'praise', 'witty', 'courage', 'entertainment', 'sweetness',
            'richest', 'thanks', 'admirer', 'laughs', 'finest', 'greatest', 'trust', 'laughter',
            'succeeding', 'miraculous', 'convenient', 'loveliest', 'fortunately', 'peace', 
            'discreet', 'thank', 'nice', 'advantage', 'patient', 'elegant', 'indulgent', 
            'respected', 'warming', 'interested', 'loyalty', 'carefree', 'likes', 'goodwill',
            'handsomest', 'admirers', 'masterpiece', 'luck', 'innocent', 'pride', 'blessed', 'wise',
            'succeed', 'perfect', 'modestly', 'splendid', 'fine', 'entertaining', 'kindly', 'lucky',
            'inspired', 'attractive', 'delightful', 'inspiring']

negative = ['sad', 'bad', 'death', 'lost', 'lonely', 'trouble', 'danger', 'wrong', 'thirst', 
            'annoyed', 'fault', 'sadly', 'absurd', 'dying', 'frightened', 'die', 'silly', 'crash',
            'neglected', 'unfortunately', 'complicated', 'jammed', 'lazy', 'terrible', 'ashamed',
            'tired', 'tears', 'scared', 'afraid', 'ridiculous', 'nasty', 'discouraged', 
            'threatened', 'fear', 'problem', 'horror', 'imminent', 'worried', 'crying', 'timidly',
            'harm', 'suffer', 'weep', 'bored', 'unhappy', 'tolerate', 'sorry', 'cry', 'anxious',
            'regret', 'inconsequential', 'sobbing', 'complained', 'clumsy', 'tragedy', 'isolated',
            'irreparable', 'pain', 'disobey', 'beg', 'grave', 'tire', 'disobedience', 'condemn',
            'saddest', 'condemning', 'troublesome', 'exhausting', 'dictator', 'grumpily', 'forced',
            'insubordination', 'groans', 'reproaches', 'false', 'exhaustion', 'twinge', 'waste',
            'bother', 'intimidates', 'fatigue', 'terribly', 'feverish', 'rejected', 'uncomfortable',
            'wearies', 'sick', 'suffering', 'hardest', 'aggrieve', 'regretted', 'disasters',
            'exasperated', 'humiliated', 'revolver', 'fat', 'inflict', 'guns', 'boasted', 'painful',
            'bothered', 'disturbed', 'wept', 'misfortunes', 'impatiently', 'weed', 'humiliate',
            'failure', 'homesick', 'catastrophe', 'shipwrecked', 'touchy', 'violation', 'mistakes',
            'monotony', 'untidy', 'bitterness', 'misunderstandings', 'fade', 'mistrust', 'crossly',
            'puzzled', 'dangerous', 'disheartened', 'abandoned', 'unfair', 'worst', 'problems',
            'confuse', 'weeping', 'lack', 'annoying', 'kill', 'groaned', 'difficulties',
            'regretting', 'tedious', 'humbled', 'tormenting', 'abashed', 'war', 'ugly', 'blind',
            'irritating', 'shock', 'gloomy', 'embarrassed', 'rheumatism', 'pour', 'mercilessly',
            'drama', 'grief', 'trifles', 'retorted', 'failed', 'killed', 'depression',
            'disappointed', 'remorse', 'shot', 'despised', 'broken', 'pressing', 'dead', 'vanity',
            'critic', 'frightening', 'bewildered', 'hampers', 'weeds', 'monotonous', 'forsaken',
            'rage', 'hunger', 'infested']

# Creating distinct groups inside each list
positive_groups = [
    ['admire', 'admiration', 'admirer', 'admirers'], ['beautiful', 'beauty', 'handsomest'], 
    ['better', 'best'], ['entertaining', 'entertainment'], ['loyalty', 'faithful', 'trust'],
    ['fine', 'finest', 'nice'], ['fun', 'funny', 'amusing'], ['great', 'greatest'],
    ['happy', 'happier', 'glad'], ['innocent', 'harmless'], ['inspiring', 'inspired'],
    ['interesting', 'interested'], ['laugh', 'laughed', 'laughing', 'laughs', 'laughter'],
    ['liked', 'likes'], ['love', 'loved', 'loves'], ['lovely', 'loveliest', 'delightful', 'charming'],
    ['luck', 'lucky'], ['majestic', 'majestically'], ['modest', 'modestly', 'discreet'],
    ['pleasure', 'pleasant'], ['pretty', 'prettiest'], ['pride', 'proud'], 
    ['respected', 'respectfully'], ['rich', 'richest'], ['smile', 'smiled', 'smiles'],
    ['splendid', 'magnificent', 'wonderful'], ['succeed', 'succeeding', 'successful'],
    ['sweet', 'sweetly', 'sweetness'], ['tenderness', 'fonder'], ['thank', 'thanks'],
    ['truly', 'veritable']
]
negative_groups = [
    ['abandoned', 'forsaken'], ['annoying', 'annoyed', 'irritating'],
    ['ashamed', 'abashed', 'embarrassed'], ['bored', 'tedious', 'wearies'],
    ['bother', 'bothered', 'disturbed'], ['condemn', 'condemning'],
    ['confuse', 'bewildered', 'puzzled', 'shock'], ['crash', 'catastrophe', 'shipwrecked'],
    ['crossly', 'grumpily'], ['cry', 'crying', 'weep', 'weeping', 'wept', 'tears', 'sobbing'],
    ['danger', 'dangerous'], ['death', 'dead', 'die', 'dying'], ['discouraged', 'disheartened'],
    ['disobey', 'disobedience', 'insubordination'], ['exhaustion', 'exhausting'],
    ['failure', 'failed'], ['fear', 'scared', 'frightening', 'frightened', 'afraid'],
    ['grief', 'tragedy', 'disasters', 'misfortunes'], ['groans', 'groaned'], ['guns', 'revolver'],
    ['humiliate', 'humiliated', 'humbled'], ['kill', 'killed', 'shot'], ['lonely', 'isolated'],
    ['monotonous', 'monotony'], ['pain', 'painful'], ['problem', 'problems', 'difficulties'],
    ['rage', 'exasperated'], ['regret', 'regretted', 'regretting'], ['remorse', 'twinge'],
    ['sad', 'saddest', 'sadly'], ['sick', 'rheumatism'], ['suffer', 'suffering'], 
    ['terrible', 'terribly'], ['tire', 'tired', 'fatigue'], ['trouble', 'troublesome'], 
    ['threatened', 'intimidates'], ['vanity', 'boasted'], ['weed', 'weeds'], ['worried', 'anxious']
]

# Creating a dictionary from the earlier created `freq` series
dct = freq.to_dict()

# Excluding all the occurencies of `good` used in greetings
dct['good']-=26

def find_group_freq(group_list):
    '''Takes a list of lists, checks if in each sub-list each word starting from the second one 
       is present in the dictionary `dct`. If present, increases the frequency of the 
       representative word of that sub-list by the frequency of that word. Afterwards (and if not
       present), removes the checked word from `dct`.
    '''
    for lst in group_list:
        for word in lst[1:]:
            if word in dct:
                dct[lst[0]]+=dct[word]
            del dct[word]
    return dct

# Summarizing word frequencies inside each group of positive and negative words
# and updating the dictionary
find_group_freq(positive_groups)
find_group_freq(negative_groups)

# Leaving only the representative or stand-alone words for each group in both lists 
positive = list(set(positive) & set(dct.keys()))
negative = list(set(negative) & set(dct.keys()))
print('Number of positive words: ', len(positive))
print('Number of negative words: ', len(negative), '\n')

# Creating two separate dictionaries for positive and negative words and then merging them
positive_freq = {k: dct[k] for k in dct.keys()& set(positive)}
negative_freq = {k: dct[k] for k in dct.keys()& set(negative)}
merged = {**positive_freq, **negative_freq}

def color_text(word, font_size, position, orientation, font_path, random_state):
    '''Coloring the words of a word cloud according to their presence in `positive_freq`
       or `negative_freq` using a "temperature" template
    '''
    if word in positive_freq:
        return 'orangered'
    else:
        return 'royalblue'

# Creating a word cloud from the merged dictionary and coloring positive and negative words differently
fig = plt.subplots(figsize=(10,8))     
wordcloud = WordCloud(width=1000, height=700, 
                      color_func=color_text,
                      background_color='white', 
                      random_state=1,
                      prefer_horizontal=1).generate_from_frequencies(merged)
plt.text(x=0, y=-40, s='Positive vs. Negative words in "The Little Prince"', fontsize=26)
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Output:

Number of positive words:  73
Number of negative words:  110 

So we know that our story is widely considered to be a sad one and, indeed, it counts 1.5 times more unique negative words (excluding synonyms and cognate words) than positive ones. However, now we also see that among all the emotionally charged words of the book, the far most frequent ones are those with positive meaning: good (even after removing all the occurences of “good morning” etc.), laugh (the famous melodic laughter of the little prince, sounding like many little bells or like laughing stars), happy, followed by love, fun, and beautiful. Among the negative words, the most frequent ones are sad, death, fear, and cry.

Removing the Noise

Another interesting observation here is that the smallest words seem to be mostly the negative ones. Let’s remove from our word cloud the words that occur in the text only once:

# Creating new dictionaries for positive and negative words, excluding those that occur only once
positive_freq_new = {k:v for k,v in positive_freq.items() if v >1}
negative_freq_new = {k:v for k,v in negative_freq.items() if v >1}
merged_new = {**positive_freq_new, **negative_freq_new}
print('Number of positive words occured more than once: ', len(positive_freq_new))
print('Number of negative words occured more than once: ', len(negative_freq_new), '\n')

# Creating a word cloud from the new merged dictionary
fig = plt.subplots(figsize=(10,8))     
wordcloud = WordCloud(width=1000, height=700, 
                      color_func=color_text,
                      background_color='white', 
                      random_state=1,
                      prefer_horizontal=1).generate_from_frequencies(merged_new)
plt.text(x=0, y=-70, s='Positive vs. Negative words in "The Little Prince"', fontsize=26)
plt.text(x=230, y=-30, s='(occured more than once)', fontsize=26)
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

Output:

Number of positive words occured more than once:  48
Number of negative words occured more than once:  60 

The last word cloud looks much less “noisy” and, as we assumed, we lost mostly negative words. Indeed, this word cloud seems redder.

We have to be aware of some potential issues and limitations when creating a word cloud:

  • It requires proper data preparation, sometimes manual, iterative, and time-consuming. The noise has to be fixed. Otherwise, we’ll just obtain a mixture of obvious and low-informative words.
  • It lacks context, so can be prone to misinterpretations. Indeed, the word constrictor which we saw in our second word cloud can be easily related to a muscle rather than a snake. Also, word clouds don’t percept the difference between “happy” and “not happy”, so it’s always a good idea to cross-check such words in their real context (and I checked them as well), especially the most frequent ones. In fairness, it must be said that any tool of text mining suffers to some extent from the problem of context.
  • Unlike bar plots, a word cloud doesn’t allow a clear ranking of the words. We can distinguish the most frequent word, then the second one, the third, maybe the fourth. Then everything becomes more difficult.
  • A word cloud lacks a quantitative approach: we cannot translate a font size to a precise value of the word frequency.
  • If we use a continuous matplotlib colormap for our word cloud, like inferno, cool, etc., we should remember that the color for each word will be selected from it randomly. In this case, we shouldn’t expect any continuity/graduality of colors as an additional indicator of the word frequency. However, there are workarounds to obtain this effect anyway, creating user functions or using stylecloud.
  • With word clouds, there can an optical illusion that longer words or the words with ascenders (like k, b, l) or descenders (like j, q, g) seem bigger and hence more important than the words of the same frequency but without such features.
  • Having many vertical words (assigning lower values to the prefer_horizontal parameter) or using masked word clouds (based on a particular shape) reduces the graph readability.

Conclusion

Keeping in mind all the points above, as well as the results of our experiments, we can conclude that a word cloud technique is a good choice for creating beautiful and colorful visualizations for qualitative text data analysis, intuitively comprehensible by a large audience even without additional denotations. Indeed, there aren’t many other tools (at least for now) specialized in displaying text data, and they are more complicated to use and especially much harder for people to understand. The contextual issue can be partly fixed by using collocations. To add another dimension to our graphs, we can use colors for categorizing words according to some properties. Finally, using the generate_from_frequencies() method, we can create a word cloud based on another attribute assigned to each word.

As for “The Little Prince”, we refreshed some of its main characters and objects. What’s more important, using word clouds helped us to find out that, even being a sad book in general, the story still transmits many positive emotions and feelings, such as happiness, laughter, love, and beauty.

Thanks for reading!

5 Likes

Another brilliant article Elena. Surprised at how much is involved with word clouds. Thanks for breaking down the variations in using this method of visualisation.

1 Like

Thanks a lot, @Achi! :star_struck: And yes, even though the resulting visualizations look cool, there are a lot af preliminary cleaning and adjustements if we want to receive a relly meaningful one. Otherwise, we’re risking to obtain a cloud overloaded with “this”, “him”, “set”, “each” and other super-useful words! :joy:

1 Like