"Spaceborn" visualizations: some interesting plot types applied to a UFO dataset 🔭

While bar charts, histograms, scatter plots, line charts, and box plots are wide-spread and efficient tools for displaying data and finding patterns in it, there are other graphs, less popular but still very useful for creating excellent visualizations. In this article, we’re going to explore the following ones:

1. Stem Plot
2. Word Cloud
3. Treemap
4. Venn Diagram
5. Swarm Plot
.

To make our experiments with these plots more interesting, we’ll apply them to another type of less known objects: those unidentified flying :flying_saucer: For this purpose, we’ll use a Kaggle dataset UFO sightings 1969 to 2019 reported in North America.

First, we’ll import the dataset and do some essential cleaning. The province abbreviations were sorted out based on the corresponding Wikipedia pages for the USA and Canada.

import pandas as pd
import numpy as np 

df = pd.read_csv('nuforc_reports.csv')

print('Number of UFO sightings:', len(df), '\n')
print(df.columns.tolist())

Output:

Number of UFO sightings: 88125 
    
 ['summary', 'city', 'state', 'date_time', 'shape', 'duration', 'stats', 'report_link', 'text', 'posted', 'city_latitude', 'city_longitude']  

Data cleaning:

# Leaving only the necessary columns
df = df[['city', 'state', 'date_time', 'shape', 'text']]

# Removing rows with missing values
df = df.dropna(axis=0).reset_index(drop=True)

# Fixing an abbreviation duplication issue
df['state'] = df['state'].apply(lambda x: 'QC' if x=='QB' else x)

# Creating a list of Canadian provinces
canada = ['ON','QC','AB','BC','NB','MB','NS','SK','NT','NL','YT','PE']  

# Creating new columns: `country`, `year`, `month`, and `time`
df['country'] = df['state'].apply(lambda x: 'Canada' if x in canada else 'USA')
df['year'] = df['date_time'].apply(lambda x: x[:4]).astype(int)
df['month'] = df['date_time'].apply(lambda x: x[5:7]).astype(int)
df['month'] = df['month'].replace({1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr',
                                   5: 'May', 6: 'Jun',  7: 'Jul', 8: 'Aug', 
                                   9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
df['time'] = df['date_time'].apply(lambda x: x[-8:-6]).astype(int)

# Dropping an already used column
df = df.drop(['date_time'], axis=1)

# Dropping duplicated rows
df = df.drop_duplicates().reset_index(drop=True)

print('Number of UFO sightings after data cleaning:', len(df), '\n')
print(df.columns.tolist(), '\n')
print(df.head(3))

Output:

Number of UFO sightings after data cleaning: 79507 
    
['city', 'state', 'shape', 'text', 'country', 'year', 'month', 'time'] 
    
         city  state     shape                                        text country  year month  time    
0     Chester     VA     light    My wife was driving southeast on a fa...    USA   2019   Dec    18
1  Rocky Hill     CT    circle    I think that I may caught a UFO on th...    USA   2019   Mar    18
2      Ottawa     ON  teardrop   I was driving towards the intersection... Canada   2019   Apr     2

Now we have a cleaned dataset of 79,507 UFO sightings that occured from 1969 till 2019 inclusive in the USA and Canada. Just to keep in mind, it results that the predominant majority of them (96%) is related to the USA:

round(df['country'].value_counts(normalize=True)*100)

Output:

USA       96.0
Canada     4.0
Name: country, dtype: float64

Let’s finally start our ufological experiments.

1. Stem Plot

A stem plot represents a kind of a modified bar plot. Indeed, it’s a good alternative to both bar plots (especially those with a lot of bars, or with bars of similar length) and pie plots, since it helps to maximize data-ink ratio of a chart, making it more readable and comprehensible.

To create a stem plot, we can use the stem() function, or the hlines() and vlines() functions. The stem() function plots vertical lines at each x location from the baseline to y, and places a marker there.

We’ll start by creating a basic stem plot of UFO occurences by month, adding only some common matplotlib customization. For a classical (horizontal) stem plot, we can use either stem() or vlines() – the result will be the same. The second approach is given as an alternative way in the commented-out code below:

import matplotlib.pyplot as plt
import seaborn as sns

# Creating a series object for UFO occurences by month, in %
months = df['month'].value_counts(normalize=True)[['Jan', 'Feb', 'Mar', 
                                                   'Apr', 'May', 'Jun', 
                                                   'Jul', 'Aug', 'Sep', 
                                                   'Oct', 'Nov', 'Dec']]*100

# Defining a function for creating and customizing a figure in matplotlib
def create_customized_fig():
    fig, ax = plt.subplots(figsize=(12,6))
    plt.title('UFO occurences by month, %', fontsize=27)
    plt.ylim(0,15)
    plt.xticks(fontsize=20)
    plt.yticks(fontsize=20)
    ax.tick_params(bottom=False)
    sns.despine()
    return ' '

# PLOTTING
create_customized_fig()

# Creating a stem plot
plt.stem(months.index, months) 

# ALTERNATIVE WAY TO CREATE A STEM PLOT
# plt.vlines(x=months.index, ymin=0, ymax=months)
# plt.plot(months.index, months, 'o')

plt.show()

We see that the majority of UFO sightings in the USA and Canada are related to summer-autumn seasons, with a maximum around 12% in July, while in the winter-spring period there is much less activity, with a minimum 5% in February.

There are a few optional parameters for adjusting a stem plot:

  • linefmt – a string defining the properties of the vertical lines (color or line style). The lines can be solid ('-'), dashed ('--'), dash-dot ('-.'), dotted (':'), or there can be no lines at all.
  • markerfmt – a string defining the properties of the markers at the stem heads: 'o', '*', 'D', 'v', 's', 'x', etc., including ' ' for the absence of markers.
  • basefmt – a string defining the properties of the baseline (analogically to linefmt).
  • bottom – the y-position of the baseline.

Let’s apply them to our plot:

# Creating and customizing a figure in matplotlib
create_customized_fig()

# Creating and customizing a stem plot
plt.stem(months.index, months, 
         linefmt='C2:',   # line color and style
         markerfmt='D',   
         basefmt=' ')     

plt.show()

There are also some other properties, such as linewidth and markersize, not included in the standard keyword arguments of the stem() function. To tune them, we have to create markerline, stemlines, and baseline objects:

# Creating and customizing a figure in matplotlib
create_customized_fig()

# Creating `markerline`, `stemlines`, and `baseline` objects
# with the same properties as in the code above
markerline, stemlines, baseline = plt.stem(months.index, months, 
                                           linefmt='C2:', 
                                           markerfmt='D', 
                                           basefmt=' ') 

# Advanced stem plot customization
plt.setp(markerline, markersize=10)      
plt.setp(stemlines, 'linewidth', 5)      
markerline.set_markerfacecolor('yellow') 

plt.show()

Finally, we can consider creating a vertical stem plot. However, in this case, we can’t use the stem() function anymore, since it draws only vertical lines. Instead, we can use hlines() in combination with plot(). Apart from the necessary parameters y, xmin, and xmax (the last two are the respective beginning and end of each line), we can tune also the optional parameters color and linestyle ('solid', 'dashed', 'dashdot', 'dotted'). In addition, we have plenty of options to adjust in the plot() function itself, including colors, markers, and lines.

Let’s create a vertical stem plot for the UFO shape frequency distribution, to check whether some shapes are more common than the others:

# Creating a series of shapes and their frequencies in ascending order
shapes = df['shape'].value_counts(normalize=True, ascending=True)*100

fig, ax = plt.subplots(figsize=(12,9))

# Creating a vertical stem plot
plt.hlines(y=shapes.index, 
           xmin=0, xmax=shapes, 
           color='slateblue',
           linestyle='dotted', linewidth=5)
plt.plot(shapes, shapes.index, 
         '*', ms=17, 
         c='darkorange')

plt.title('UFO shapes by sighting frequency, %', fontsize=29)
plt.xlim(0,25)
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
ax.tick_params()
sns.despine()
plt.show()

We see that UFO, according to their witnesses, can take a wide range of incredible forms, including diamonds, cigars, chevrons, teardrops, and crosses. The far most frequent form (22%), however, is described as just a light, while all those highly-descriptive forms were seen much more rarely.

Here a vertical stem plot looks a better choice, since the names of the shapes are rather long, and in a horizontal plot they would be flipped vertically, reducing their readability.

As a reminder, for creating horizontal stem plots, we can use a similar function vlines() instead of stem(). All the parameters are the same as for hlines(), except for the “mirrored” necessary parameters x, ymin, and ymax.

It’s enough with the stem plot customization. Let’s learn something else about our friends aliens.

2. Word Cloud

A word cloud is a text data visualization, where the size of each word indicates its frequency. Using it, we can find the most important words in any piece of text. In particular, this technique is applied to sentiment analysis of reviews and survey results, and for SEO keyword identification.

Let’s analyze all the descriptions of UFO sightings given by American witnesses. For this purpose, we’ll install and import the wordcloud library (installation: pip install wordcloud), and create a basic graph:

from wordcloud import WordCloud, STOPWORDS

# Gathering sighting descriptions from all American witnesses
text = ''
for t in df[df['country']=='USA'].loc[:, 'text']:
    text += ' ' + t

fig = plt.subplots(figsize=(10,10)) 

# Creating a basic word cloud
wordcloud = WordCloud(width=1000, height=1000, 
                      collocations=False).generate(text)

plt.title('USA collective description of UFO', fontsize=27)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# Saving the word cloud
wordcloud.to_file('wordcloud_usa.png')

The most common words are light, object, and sky, followed by bright, time, moving, white, red, craft, star. Among the most frequent words, there are some low-informative ones, like one, second, saw, see, seen, looked, etc. All in all, we can assume that American witnesses mostly observed bright craft objects of white and red color, moving in the sky and emitting light.

In the word cloud above, we used the following parameters:

  • width and height – the width and height of the word cloud canvas.
  • collocations – whether to include collocations of two words. We set it to False to avoid word duplication in the resulting graph.

To add more advanced functionality and cosmetics to our word cloud, we can use the following parameters:

  • colormap – a matplotlib colormap to randomly draw colors from for each word.
  • background_color – word cloud background color.
  • stopwords – the words to be excluded from the analysis. The library already has the built-in STOPWORDS list containing some low-informative words like how, not, the, etc. This list can be supplemented with a user word list, or replaced with it.
  • prefer_horizontal – the ratio of times to try horizontal fitting as opposed to vertical. If this parameter is less than 1, the algorithm will try rotating the word if it doesn’t fit.
  • include_numbers – whether to include numbers as phrases or not (False by default).
  • random_state – a seed number used for reproducing always the same cloud.
  • min_word_length – a minimum number of letters a word must have to be included.
  • max_words – a maximum number of words to display in the word cloud.
  • min_font_size and max_font_size – maximum and minimum font sizes to be used for displaying words.

Armed with this new information, let’s create another, a more tuned word cloud. The customization will include adding a colormap for the words and background color, reducing the maximum number of words to be displayed from 200 (by default) to 100, considering only the words with 3+ letters (to avoid words like u and PD), allowing more vertical words (0.85 instead of the default 0.9), excluding some low-informative words from the analysis and ensuring the replicability of the word cloud.

This time, however, we’re curious to know Canadian people’s collective opinion about UFO:

# Gathering sighting descriptions from all Canadian witnesses
text = ''
for t in df[df['country']=='Canada'].loc[:, 'text']:
    text += ' ' + t

# Creating a user stopword list
stopwords = ['one', 'two', 'first', 'second', 'saw', 'see', 'seen', 'looked', 'looking', 'look', 'went', 'minute', 'back', 
             'noticed', 'north', 'south', 'east', 'west', 'nuforc', 'appeared', 'shape', 'side', 'witness', 'sighting', 
             'going', 'note', 'around', 'direction', 'approximately', 'still', 'away', 'across', 'seemed', 'time']

fig = plt.subplots(figsize=(10,10)) 

# Creating and customizing a word cloud
wordcloud = WordCloud(width=1000, height=1000, 
                      collocations=False,
                      colormap='cool',
                      background_color='yellow',
                      stopwords=STOPWORDS.update(stopwords), 
                      prefer_horizontal=0.85,
                      random_state=100,
                      max_words=100,
                      min_word_length=3).generate(text)

plt.title('Canadian collective description of UFO', fontsize=27)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# Saving the word cloud
wordcloud.to_file('wordcloud_canada.png')

It seems that the descriptions given by Canadian people are rather similar to those from Americans, with the addition of some other frequent words: orange, plane, night, minutes, seconds, cloud, flying, speed, sound. Hence, we can assume that Canadians witnessed bright craft objects, of white, red, or orange color, mostly at night time, moving/flying in the sky and emitting light and, probably, sound. At first, the objects looked like stars, planes, or clouds, and the whole process lasted several seconds to minutes.

The difference between Canadian and American collective descriptions can be partially explained by adding some more words to the stopword list. Or, maybe, “Canadian” aliens are really more orange, plane- or cloud-like, and noisy :grinning:

3. Treemap

A treemap is a visualization of hierarchical data as a set of nested rectangles, where the area of each rectangle is proportional to the value of the corresponding data. In other words, treemaps show what the whole data consists of, and can be a good alternative to pie charts.

Let’s find out what states of the USA are especially preferred for UFO to visit. We’ll install and import the squarify library (installation: pip install squarify), and create a basic treemap:

import squarify

# Extract the data
states = df[df['country']=='USA'].loc[:, 'state'].value_counts()

fig = plt.subplots(figsize=(12,6))

# Creating a treemap
squarify.plot(sizes=states.values, label=states.index)

plt.title('UFO sighting frequencies by state, the USA', fontsize=27)
plt.axis('off')
plt.show()

output_19_0

Looks like California is a real extraterrestrial base in the USA! It’s followed with a big gap by Florida, Washington, and Texas, while the territories of District of Columbia and Puerto Rico are visited by UFO very rarely.

The parameters sizes and label used above represent the numeric input for squarify and the corresponding label text. Other parameters that can be adjusted:

  • color – a user list of colors for the rectangles,
  • alpha – a parameter regulating color intensity,
  • pad – whether to draw rectangles with a small gap between them,
  • text_kwargs – a dictionary of keyword arguments (color, fontsize, fontweight, etc.) to tune the label text properties.

Let’s check at what time the most/least aliens were seen, and in the meanwhile practice the optional parameters:

import matplotlib

# Extracting the data
hours = df['time'].value_counts()

# Creating a list of colors from 2 matplotlib colormaps `Set3` and `tab20`
cmap1 = matplotlib.cm.Set3
cmap2 = matplotlib.cm.tab20
colors = []
for i in range(len(hours.index)):
    colors.append(cmap1(i))
    if cmap2(i) not in colors:
        colors.append(cmap2(i))
        
fig = plt.subplots(figsize=(12,6))

# Creating and customizing a treemap
squarify.plot(sizes=hours.values, label=hours.index,
              color=colors, alpha=0.8, 
              pad=True,
              text_kwargs={'color': 'indigo',
                           'fontsize': 20, 
                           'fontweight': 'bold'})

plt.title('UFO sighting frequencies by hour', fontsize=27)
plt.axis('off')
plt.show()

output_21_0
The respondents from our dataset mostly observed UFO in the time range from 20:00 till 23:00, or, more generally, from 19:00 till midnight. The least “UFO-prone” hours are 07:00-09:00. However, it doesn’t necessarily mean the “lack of aliens” in certain hours of the day and instead can be explained more pragmatically: usually people have free time in the evening after work, while in the morning the majority of people are going to work and are a bit too immersed in their thoughts to notice interesting phenomena around them.

4. Venn Diagram

A Venn diagram shows the relationships between several datasets, where each group is displayed as an area-weighted circle, and the overlaps (if any) of the circles represent the intersection and its size between the corresponding datasets. In Python, we can use the matplotlib-venn library to create Venn diagrams for 2 or 3 datasets. For the first case, the package provides the venn2 and venn2_circles functions, for the second, correspondingly, venn3 and venn3_circles.

Let’s practice this tool on 2 subsets from our UFO dataset. For example, we want to extract the data for all cross-shaped and cigar-shaped UFO sightings (for simplicity, we’ll call them from now on crosses and cigars) that occured in North America in the last 5 years (which in the context of our dataset means from 2015 till 2019 inclusive), and check if there are some cities where both shapes were observed in that period. Let’s install and import the matplotlib-venn library (installation: pip install matplotlib-venn), and create a basic Venn diagram for crosses and cigars:

from matplotlib_venn import *

# Creating the subsets for crosses and cigars
crosses = df[(df['shape']=='cross')&(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']
cigars = df[(df['shape']=='cigar')&(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']

fig = plt.subplots(figsize=(12,8))

# Creating a Venn diagram
venn2(subsets=[set(crosses), set(cigars)], 
      set_labels=['Crosses', 'Cigars'])

plt.title('Crosses and cigars by number of cities, 2015-2019', fontsize=27)
plt.show()

Hence in the period from 2015 till 2019 inclusive, there were 18 cities in North America where both crosses and cigars were registered. In 79 cities, only crosses were observed (from these 2 shapes), in 469 – only cigars.

Now, we’re going to add one more exotic UFO shape from our collection – diamonds – and apply some customization to the Venn diagram. Earlier, we’ve already used a self-explanatory optional parameter set_labels. In addition, we can add to the venn2() and venn3() functions:

  • set_colors – a list of colors of the circles, based on which the colors of intersections will be computed,
  • alpha – a parameter regulating color intensity, 0.4 by default.

The other 2 functions – venn2_circles() and venn3_circles() – serve to adjust the circumferences of the circles using the parameters color, alpha, linestyle (or ls), and linewidth (or lw).

# Creating a subset for diamonds
diamonds = df[(df['shape']=='diamond')&(df['year']>=2015)&(df['year']<=2019)].loc[:, 'city']

# Creating a list of subsets
subsets=[set(crosses), set(cigars), set(diamonds)]

fig = plt.subplots(figsize=(15,10))

# Creating a Venn diagram for the 3 subsets
venn3(subsets=subsets, 
      set_labels=['Crosses', 'Cigars', 'Diamonds'],
      set_colors=['magenta', 'dodgerblue', 'gold'],
      alpha=0.3)

# Customizing the circumferences of the circles 
venn3_circles(subsets=subsets,
              color='darkviolet', alpha=0.9, 
              ls='dotted', lw=4)

plt.title('Crosses, cigars, and diamonds \nby number of cities, 2015-2019', fontsize=26)
plt.show()

This diagram shows that in the period of interest there were 6 cities in North America where all 3 shapes were registered, 66 cities – where only cigars and diamonds, 260 – where only diamonds, etc. Let’s check those 6 cities in common for all the 3 shapes:

print(set(crosses) & set(cigars) & set(diamonds))

Output:

{'Albuquerque', 'Rochester', 'Staten Island', 'Lakewood', 'Savannah', 'New York'}

All of them are located in the USA.

Venn diagrams can be further beautified through the get_patch_by_id() method. It allows us to select any of the diagram zones by its id and change the color of the circle (using the set_color()method), transparency (set_alpha()), change the text (set_text()) and adjust its font size (set_fontsize()). The possible values of id for a two-circle Venn diagram are '10', '01', '11', for a three-circle one – '100', '010', '001', '110', '101', '011', '111'. The logic behind these values is the following:

  • the number of digits reflects the number of circles,
  • each digit represents a dataset (subset) in the order of their assignment,
  • 1 means the presence of a dataset in the zone, while 0 – the absence.

For example, '101' is related to the zone where the 1st and 3rd datasets are present, and the 2nd is absent in a three-circle diagram, i.e. to the intersection of the 1st and the 3rd circles excluding the 2nd one. In our case, it’s the crosses-diamonds intersection, which is equal to 9 cities where only these two shapes were observed in the period of interest.

Let’s try to change the color of the intersection zones of our Venn diagram and add short pieces of text instead of numbers to the zones representing only one shape. Furthermore, to make it funnier, let it be not just a boring text, but some ASCII art symbols reflecting each shape:

fig = plt.subplots(figsize=(15,10))

# Assigning the Venn diagram to a variable
v = venn3(subsets=subsets, 
          set_labels=['Crosses', 'Cigars', 'Diamonds'],
          set_colors=['magenta', 'dodgerblue', 'gold'],
          alpha=0.3)

# Changing the color of the intersection zones
v.get_patch_by_id('111').set_color('white')
v.get_patch_by_id('110').set_color('lightgrey')
v.get_patch_by_id('101').set_color('lightgrey')
v.get_patch_by_id('011').set_color('lightgrey')

# Changing text and font size
v.get_label_by_id('100').set_text('✠')
v.get_label_by_id('100').set_fontsize(25)
v.get_label_by_id('010').set_text('(̅_̅_̅_̅(̅_̅_̅_̅_̅_̅_̅_̅_̅̅_̅()~~~')
v.get_label_by_id('010').set_fontsize(9)
v.get_label_by_id('001').set_text('♛')
v.get_label_by_id('001').set_fontsize(35)

# Customizing the circumferences of the circles
venn3_circles(subsets=subsets,
              color='darkviolet', alpha=0.9, 
              ls='dotted', lw=4)

plt.title('Crosses, cigars, and diamonds \nby number of cities, 2015-2019', fontsize=26)
plt.show()

Finally, it’s possible to adjust any of the circles separately, assigning the result of the venn3_circles() method to a variable and then referring to the circles by index (0, 1, or 2, in case of a three-circle Venn diagram). The methods to be used here are self-explanatory and similar to the ones discussed above: set_color(), set_edgecolor(), set_alpha(), set_ls(), and set_lw().

Let’s emphasize the circle for diamonds (well, everybody likes diamonds! :slightly_smiling_face::gem:).

##### PREVIOUS CODE #####

fig = plt.subplots(figsize=(15,10))

# Assigning the Venn diagram to a variable
v = venn3(subsets=subsets, 
          set_labels=['Crosses', 'Cigars', 'Diamonds'],
          set_colors=['magenta', 'dodgerblue', 'gold'],
          alpha=0.3)

# Changing the color of the intersection zones
v.get_patch_by_id('111').set_color('white')
v.get_patch_by_id('110').set_color('lightgrey')
v.get_patch_by_id('101').set_color('lightgrey')
v.get_patch_by_id('011').set_color('lightgrey')

# Changing text and font size
v.get_label_by_id('100').set_text('✠')
v.get_label_by_id('100').set_fontsize(25)
v.get_label_by_id('010').set_text('(̅_̅_̅_̅(̅_̅_̅_̅_̅_̅_̅_̅_̅̅_̅()~~~')
v.get_label_by_id('010').set_fontsize(9)
v.get_label_by_id('001').set_text('♛')
v.get_label_by_id('001').set_fontsize(35)

##### NEW CODE #####

# Assigning the Venn diagram circles to a variable
c = venn3_circles(subsets=subsets,
                  color='darkviolet', alpha=0.9, 
                  ls='dotted', lw=4)

# Changing the circle for diamonds by index
c[2].set_color('gold')
c[2].set_edgecolor('darkgoldenrod')
c[2].set_alpha(0.6)
c[2].set_ls('dashed')
c[2].set_lw(6)

plt.title('Crosses, cigars, and diamonds \nby number of cities, 2015-2019', fontsize=26)
plt.show()

5. Swarm Plot

While its more famous “relative” box plot is great at displaying the overall distribution statistics, and the less known violin plot describes the distribution of the data for one or several categories, the under-estimated swarm plot provides some additional information about the dataset. Namely, it gives us an idea of:

  • the sample size,
  • the overall distribution of a numeric variable across one or more categories,
  • where exactly the individual observations are located in the distribution.

The points in a swarm plot are adjusted along the categorical axis in a way to be close to each other but not to overlap. Consequently, this plot works well only in the case of a relatively small number of data points, while for larger samples violin plots are more suitable (for them, just the opposite, a sufficient number of data points is required to avoid misleading estimations). Also, as we’ll see soon, swarm plots are good for distinguishing individual data points from different groups (optimal no more than 3 groups), through applying corresponding colors.

A swarm plot can be a good alternative or supplement to a box plot or a violin plot.

Let’s extract a couple of relatively small subsets from our UFO dataset, create for them swarm plots, and compare them with box and violin plots. In particular, we can select one state from the USA and one from Canada, extract all the UFO sightings of conic or cylindric shapes for both, and observe the corresponding data point distribution along the years (from 1969 till 2019). From our treemap experiments, we remember that the biggest number of UFO sightings in the USA was registered in California. Let’s now find the leader in Canada:

df[df['country']=='Canada'].loc[:, 'state'].value_counts()[:3]

Output:

ON    1363
BC     451
AB     369
Name: state, dtype: int64

So, we’ll select California from the USA and Ontario from Canada as the candidates for our further plotting. First, let’s extract the subsets and create for them basic swarm plots, superimposed on the corresponding box plots for comparison:

# Creating the subsets for California and Ontario
ca_on_cylinders_cones = df[((df['state']=='CA')|(df['state']=='ON'))&\
                           ((df['shape']=='cylinder')|(df['shape']=='cone'))]

fig = plt.subplots(figsize=(12,7))
sns.set_theme(style='white')

# Creating swarm plots
sns.swarmplot(data=ca_on_cylinders_cones, 
              x='year', y='state', 
              palette=['deeppink', 'blue'])

# Creating box plots
sns.boxplot(data=ca_on_cylinders_cones, 
            x='year', y='state', 
            palette=['palegreen', 'lemonchiffon'])

plt.title('Cylinders and cones in California and Ontario', fontsize=29)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
sns.despine()
plt.show()

We can make the following observations here:

  • Since the numeric variable in question (year) is an integer, the data points are aligned.
  • Both subsets are quite different in terms of their sample size. It’s clearly seen on the swarm plots, while the box plots hide this information.
  • The Californian subset is heavily left-skewed and contains a lot of outliers.
  • None of the box plots gives us an idea about the underlying data distributions. In the case of the Californian subset, the swarm plot shows that there are a lot of conic or cylindric UFO related to the 3rd quartile of the distribution, as well as to the most recent year, 2019.
  • We definitely should add to our “wish list” the possibility to distinguish between cylinders and cones for each dataset.

So, our next steps will be:

  • to exclude the outliers from the visualization and zoom it in on the x-axis,
  • to add the hue parameter to the swarm plots, to be able to display the second categorical variable (shape).
fig = plt.subplots(figsize=(12,7))

# Creating swarm plots
sns.swarmplot(data=ca_on_cylinders_cones, 
              x='year', y='state', 
              palette=['deeppink', 'blue'], 
              hue='shape')

# Creating box plots
sns.boxplot(data=ca_on_cylinders_cones, 
            x='year', y='state', 
            palette=['palegreen', 'lemonchiffon'])

plt.title('Cylinders and cones in California and Ontario', fontsize=29)
plt.xlim(1997,2020)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
plt.legend(loc='upper left', frameon=False, fontsize=15)
sns.despine()
plt.show()

Now both swarm plots show that the predominant majority of UFO for these 2 subsets are cylinders. For the Californian subset, we can distinguish the years of particularly frequent occurences of cylindric/conic UFO: 2008, 2015, and 2019. Moreover, in 2015, we observe an unexpected boom of cones, despite they are much rarer in general.

Let’s now put apart box plots and compare swarm and violin plots for each subset. This time, though, we’ll customize the swarm plots a bit more, using some of the parameters below:

  • order, hue_order – the order to plot the categorical variables in. If we create a swarm-box hybrid plot like above (or swarm-violin), we have to apply this order also to the second type of plot.
  • dodge – assigning it to True will separate the strips for different hue levels (if applicable) along the categorical axis.
  • marker, color, alpha, size, edgecolor, linewidth – marker style ('o' by default), color, transparency, radius (5 by default), edge color ('gray' by default), and edge width (0 by default).
  • cmap – a colormap name.
fig = plt.subplots(figsize=(12,7))

# Creating and customizing swarm plots
sns.swarmplot(data=ca_on_cylinders_cones, 
              x='year', y='state', 
              palette=['deeppink', 'blue'], 
              hue='shape',
              marker='D',              
              size = 8,
              edgecolor='green',
              linewidth = 0.8)

# Creating violin plots
sns.violinplot(data=ca_on_cylinders_cones, 
               x='year', y='state', 
               palette=['palegreen', 'lemonchiffon'])

plt.title('Cylinders and cones in California and Ontario', fontsize=29)
plt.xlim(1997,2020)
plt.xlabel('Years', fontsize=18)
plt.ylabel('States', fontsize=18)
plt.legend(loc='upper left', frameon=False, fontsize=15)
sns.despine()
plt.show()

Here we can make the following observations:

  • As it was with the box plots, the violin plots don’t reflect the sample size of both subsets.
  • The violin plots don’t distinguish between cylinders and cones.

We could resolve the last issue by creating instead grouped violin plots (using the parameters split and hue). However, given that our subsets are already rather small, splitting them for creating grouped violin plots would lead to further decreasing of the sample size and data density of each part, making these plots even less representative. Hence, in such cases, swarm plots look a better choice.

Conclusion

To sum up, we’ve explored five rarely used plot types, their application cases, limitations, alternatives, ways of customization, and the approaches to analyze the resulting graphs. Besides, we’ve investigated a little bit the mysterious world of UFOs.

If by any chance, there are some extraterrestrial beings reading this right now, then I would like to thank them for visiting our planet every now and again. Please next time come also to my country, probably I will be able to visualize you better :alien::art:.

Thank you, dear reader, for your attention. I hope you enjoyed my article and found something useful for you.

38 Likes

Absolutely loved reading your article @Elena_Kosourova !

Very informative, I learned a lot and will be using these plotting techniques that seem so useful. You explain everything so well and make it easy to understand, which is something I often find difficult with reading articles that explain concepts like these. I also love that you’re using such a fun and interesting example to showcase these examples of plots. It made me curious why 2015 was such a big year for alien visits in CA.

My only 2 feedback points:

  • Maybe to add something in the beginning sort of like a table of contents just to let the reader know which types of plots you’ll be explaining. You mention you will talk about “lesser known” plots but don’t list which ones. This would be helpful so that if someone is looking for particular plot examples they will know you’ll be talking about them.

  • The conclusion was a bit short and you only mention the extraterrestrials, but your article was more so focused on how to create these lesser-known plots so I would suggest to also mention that we learned how to use each of those plots.

Overall - really great, exemplary work! I look at your projects for ways to improve my own :smile:

5 Likes

Wow, thank you so much @ywbadri for your nice words and super-helpful and detailed feedback!!! :heart_eyes: I’m very glad that I managed to share some interesting and useful information! Absolutely agree with both your suggestions, and I’ll definitely introduce them both in my article! :star_struck: Indeed, as for unusual visuaizations, initially I was going to describe much more types. Then I noticed that my article was already becoming too long and decided to cut it, and probably this is also the reason of such a short conclusion :sweat_smile: Probably once I should also write the part 2 of this article, about the remaining original plot types which I left. Thanks again for your time and cool ideas, very appreciated indeed! :heavy_heart_exclamation:

3 Likes

Oh no I only meant to add in your conclusion that you described how to use those plots. I think the article was long enough! :sweat_smile:

Glad you found my feedback useful! :blush:

2 Likes

Yes-yes, I got it, I mean I’ll probably write another article in future on the remaining graphs that I didn’t consider in this one. This article is already rather long, I agree, better not to convert it in encyclopedia! :joy:

1 Like

Fantastic article, Elena! Thanks for taking the time to write this. I will definitely be using this as a reference in upcoming projects!

2 Likes

Thanks a lot, Mike, your words are really encouraging for me! :star_struck: I’m very happy that you appreciated my work and found it useful, it means that I wrote a good stuff! :heavy_heart_exclamation:

2 Likes

Whilst most of the concepts you highlighted in the detailed and informative article(as averred by @ywbadri and @mathmike314 ) sounds pretty much esoteric to me as a newbie(lol), i can’t but appreciate the time you put in in coming up with this article - your passion can be felt in every word and illustrations therein.
I look forward to a point in time when i can fully comprehend and appreciate the concepts discussed.

Great job Elena.

1 Like

Feedback on article: “Spaceborn” visualizations: some interesting plot types …"

  1. Easy to follow, not too much detail, great examples!
  2. I will keep this as a bookmark - it provides me a wealth of coding tidbits which I expect to reference as I continue my learning journey! Very, very helpful!
  3. I am familiar with many different types of graphs/plots from my 35 years of Quality Engineering service. There were even a couple of new ones for me in this article.
1 Like

That’s super-cool @j.adeyemi.thomas, you’re inspiring me to write more articles that nobody understands! :grinning: Don’t worry, soon you’ll arrive there and even further, and will be able to do much more! Thanks a lot for your feedback, and happy learning!

Thank you very much, Bruce, for your positive feedback! I’m very glad that my article was helpful also for you, with such a huge experience. That’s really great and make me feel proud of myself! :star2:

Wow, rather interesting article! :wink:

  • as always easy to read both text and code.
  • plus to plotting possibilities
  • especially I liked treemap and word cloud. Now I know where to take the logic of code :grin:
  • to the topic of article, worth reading , seems US - Hollywood made its job, people are looking in the sky for UFO :rofl: or Aliens don’t like the Canada because there too cold in comparision with sunny California :alien:
2 Likes

Thanks a lot, Serhii, for your cool comments! :heavy_heart_exclamation: Well, then I understand now why aliens avoid visiting Russia as well! :grinning:

1 Like

Enjoyed the exposure to new visualisations + the smoking cigar was a pretty cool find!

The venn diagram & the treemap are the most useful.

I would’ve enjoyed & benefited from you mapping the sightings with a geo map of North America (USA) (and cross-tabulating it with LSD use? ahahah!).

Be careful though… The Truth is Out There! :alien:

4 Likes

Thank you, Burhaan, for you kind words! :heart_eyes: Great idea about mapping (I think a choropleth map would be a great choice here), and especially about the correlation with a possible explanation! :sweat_smile: :joy:

This is fantastic, @Elena_Kosourova! very enjoyable to read, I have learned a lot!

I have bookmarked this article and am sure I will be using it as a resource in the future.

2 Likes

Thank you very much @gosaints! It’s great to know that my work can be useful for other people! :star_struck:

Hey @Elena_Kosourova ,
This is one of the coolest projects out there. The term ‘ufology’ made your descriptions all the more interesting. :joy:
Everything about this project : the information, the plots, the data manipulation techniques that you’ve used are nothing short of elegant and well written !

You’ve done an amazing job, so keep up the good work ! :smile:

P.s I would love to colab on a kaggle project with you sometime :grin:

3 Likes

Great project @Elena_Kosourova! UFOs are definitely are fun subject to explore these lesser-used visualisations. You’ve structured this article well and the code is easy to read.

One feedback point I have is perhaps to be cautious when drawing conclusions from graphs like the UFO sighting frequencies by state (USA) - California may have the most UFO sightings but it’s also the most populous state by a large margin (Florida and Texas also have high populations). Maybe it would be interesting to see the same information presented on a per-capita basis!

Super work overall!

2 Likes

Oh, I’m very happy @shubhkirti.prasad that you found my work cool! :heavy_heart_exclamation: And yes, it will be great to collaborate on some project! :grinning: In my profile here, you can find my linkedin and medium links, so let’s connect and keep in touch! :star_struck: