11 Cool Names in Data Science

When we meet a person with an unusual name for the first time, or, maybe, see an intriguing title of a book/film, or an ingenious name of a company, we immediately get curious what can be behind it. In this sense, data science, with its rich selection of creative names compelling to be deciphered, gives us an abundant scope for investigation. Let’s try us in the role of Sherlock Holmes and attempt to figure out the meanings of some of them.

Python and Anaconda

Python is a general-purpose programming language and the most popular one in data science. It’s characterized by simpler and intuitively comprehensible syntax, a relatively small core language complemented with a lot of extensions (libraries). To understand its main principles, we can refer to The Zen of Python, a collection of guidelines like the following:

  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Readability counts.
  • Special cases aren’t special enough to break the rules.
  • Errors should never pass silently.
  • There should be one — and preferably only one — obvious way to do it.

How come this language was named after the snake? The answer is that it’s not actually referred to snakes. When the founder of Python Guido van Rossum was developing it, he was passionate about Monty Python’s Flying Circus, a BBC sketch comedy television show created by a British troupe called Monty Python (or, shortly, Pythons) and popular in the 1970s. Van Rossum wanted a name for his language to be funny, short, unique, and slightly mysterious, so he decided that Python would be the best choice.

Monty Python’s theme is also reflected in Python’s metasyntactic variables — the specific words serving as a placeholder in a programming language and supposed to be substituted with real values. Instead of previously used foo and bar, here they are called spam, ham, and eggs, referring to Spam, one of the sketches of the show.

Later, Python gave inspiration for the name of Anaconda — a distribution of the Python and R programming languages for data science and machine learning, aiming to simplify package management and suitable for various operating systems. It comes with Anaconda Navigator, its own command line, over 250 preinstalled data science packages, and the possibility to install over 7,500 additional open-source packages. In this case, the name is an evident allusion to the “snaky” name of Python, because initially, before even founding the Anaconda Inc. company in 2012, the team originated the use of this language for data science, and it still remains their main focus. Both the company logo color and its pattern confirm it, clearly resembling a green snake’s skin.

R

R is a programming language and free open-source environment, providing a great variety of statistical and graphical techniques. It’s easily extensible through additional packages and functions and stands out for its high-quality plots, including mathematical symbols and formulae, and the possibility to add dynamic and interactive graphics. R is the second popular language in data science after Python.

In 1993 Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, publicized an alternative and completely independent implementation of the S programming language, which 2 years after was officially released as a new language called R. Its laconic name derives from the first names of its founders, and is partly a play on the name of S.

Jupyter

Jupyter is a free open-source project and community, derived in 2014 from the IPython Project, supporting interactive data science and scientific computing across all programming languages. For this purpose, Jupyter is equipped with numerous kernels (execution environments) for writing code in different programming languages, and it has developed the following products:

  • Jupyter Notebook — a web application for creating and sharing notebooks that contain code, plots, and text,
  • JupyterHub — a multi-user server for Jupyter Notebooks,
  • JupyterLab — a new flexible user interface for Jupyter Notebooks released in 2018.

At the first glance, the name of the project looks like a reference to the biggest planet of the Solar System (well, let’s not be picky and ignore the orthographical error :slightly_smiling_face:). However, in reality, it comes from the 3 core programming languages supported by the project: Julia, Python, and R. There is a bit of astronomical context, though: the logo of Jupyter was inspired by Galileo Galilei’s notes from 1610 with observations of Jupiter and its moons.

Pandas

Pandas is a free open-source Python library for data manipulation and analysis including a vast variety of operations: importing data from different file formats, data reading, writing, cleaning, subsetting, aggregating, merging, reshaping, slicing, handling missing data, etc. It’s particularly powerful when working with numerical tables and time series. Being a fast, multi-purpose, flexible, and very efficient tool, not surprising that pandas is one of the major Python libraries for data science, which we’ll always find in any TOP5 list.

Logically, the first association with the name of this library is the cute Chinese animals. Wikipedia gives us 2 other explanations, though:

  • Pandas’ name stands for the term panel data, used in statistics and econometrics and referred to two- or multi-dimensional data that was collected by measurements over time and some other dimensions (if any) for the same individuals.
  • It’s an abbreviation for Python Data Analysis.

However, it seems that the real explanation is the first one, according to the library’s creator Wes McKinney. In his article pandas: a Foundational Python Library for Data Analysis and Statistics (thanks a lot, @Bruno, for sharing it and for your suggestion :wink:) he states:

The library’s name derives from panel data, a common term for multidimensional data sets encountered in statistics and econometrics

Definitely, this source of information is more convincing here.

As we see, Python’s pandas has nothing to do with the animals panda. It’s just a coincidence, but a nice one :smiling_face_with_three_hearts:

Koalas

Koalas library is much less famous than its friend (or their friends?) pandas. It represents an implementation of the pandas API built on Apache Spark and was first released in 2019 by Databricks, with the main scope to combine the flexibility and intuitive syntax of pandas with the functionality and distributed nature of Spark, adapted to work fast with particularly big datasets. Practically, Koalas took:

  • scalability, dataframe and query implementation from Spark,
  • all methods and functions, dataframe mutability, and indexing system from pandas,

making it easier for data scientists to efficiently and smoothly switch from processing relatively small datasets to those very large while using a familiar syntax of pandas and having a single codebase for both pandas and Spark. In the documentation for Koalas, we can find the so-called 10 minutes to Koalas tutorial demonstrating the identity of pandas and Koalas syntax, allowing an immediate code converting from one library to another.

Koalas is now in a transient state of development, with the majority of pandas methods and functions already included, while some other features are still to be transferred and tuned.

In the case of this library’s name, there are no secrets: it’s just a play of pandas, being two species that people always love but often confuse. It’s interesting, though, that Koalas is written with a capital letter, as opposed to pandas.

Seaborn

Seaborn is a free open-source Python visualization library built on top of matplotlib and focused on working with pandas dataframes and NumPy arrays. It stands out for its rich gallery of all common (as well as those less common) plot types, simple and comprehensible syntax, big choice of themes, styles, and colors. All these features allow creating well-designed and informative statistical graphics for further data analysis.

As for its name’s origin, it resulted to be a hard nut to crack. There is no mention of it neither in the documentation nor in Wikipedia, and in general, I didn’t find any trace of it on the other resources. Moreover, the word itself doesn’t seem particularly informative, let alone related to a visualization library: born in the sea? :astonished: To satisfy my curiosity, I contacted directly the founder of seaborn, Michael Waskom, the contact information of whom I found in the copyright section of the library:

question


And here is the answer:

answer

After further investigation on Google, I found out that the television show in question is The West Wing. Indeed, seaborn was called after Samuel Norman Seaborn (by the way, this is also the meaning of the traditional seaborn’s alias as sns), and in the Github repository of Waskom we can find the other characters: Lyman, Moss, Cregg, and Ziegler.

Hence, as it was in all the previous cases, there is no connection between the name of seaborn library and its purpose/features.

Folium

Folium is a free open-source Python wrapper for the JavaScript library Leaflet.js designed for creating interactive maps based on geospatial data. It’s a very efficient and easy-to-use tool that combines data manipulation in Python with its further displaying on an interactive Leaflet map. The library has a lot of built-in tiles from OpenStreetMap, Mapbox, and Stamen, and also supports custom tiles. Among many other things, folium allows creating choropleth and heat maps, passing vector/raster/HTML visualizations as markers on the map, customizing those markers, adding pop-up data, etc.

Both the name and the logo of folium represent a clear reference to the initial Leaflet.js library. As Vladimir Agafonkin, the creator of the latter, affirms, he came up with this name because it reflects the simplicity and lightness of his library. In addition, another meaning of the word leaflet is flyer, and flyers are often used to print maps on them. As for folium, which means leaf in Latin, it seems that it inherited only the “botanical” component of the original name. Well, at least this time the library name shows a more or less direct relation to its features, after all those koalas and anacondas :slightly_smiling_face:

Theano

Theano is a primary open-source Python library for deep learning built on top of NumPy, and also one of the first in this sphere. Its development started in 2007 at the University of Monreal in Canada. Some of the advantages of this library are:

  • strong integration with NumPy syntax and structures including multi-dimensional arrays,
  • fast numerical computation and evaluation of mathematical expressions that can be run on the CPU or GPU,
  • speed, stability, and code optimizations,
  • powerful bug detection and potential issue diagnosing system,
  • ideal for working with big amounts of data and large neural networks of any type.

Theano can be used for creating DL models directly or through numerous libraries built on top of it, such as Keras, which makes the whole process much easier.

The name of the library derives from Theano of Crotone, the first known ancient woman mathematician and philosopher, whose biography is quite enigmatic and full of open questions. According to the majority of sources, she was a talented student and/or wife of Pythagoras and the daughter of Brontinus, while according to some others — just the opposite, the wife of Brontinus and the daughter of Pythagoras. From a few works of Theano that were preserved till our days, it is considered that she worked on the golden mean and the golden rectangle.

Beautiful Soup

Beautiful Soup is a free open-source Python library designed for web scraping. Using it, we can quickly retrieve from the HTML document of a web page whatever specific content we’re interested in: tables, links, tags, images, text styles, headings, particular combinations of elements and styles, etc.

The story behind the name of this package is quite intriguing. Before its release in 2004, the existing parsers were able to scrape only well-formed XML and HTML documents. Instead, those with malformed markup, invalid structure or syntax, undefined elements were called tag soups, and could be parsed only by a web browser. Beautiful Soup was implemented as an HTML parser for fixing tag soups and transforming them into “beautiful soups”, despite all the issues with their structure or syntax.

However, it’s not the end of the story yet. At the same time, the name of this library directly refers to the “beautiful soup” from the song Turtle Soup sung by the Mock Turtle, a character from Alice’s Adventures in Wonderland by Lewis Carroll. Furthermore, throughout the documentation of Beautiful Soup, we encounter plenty of other hints, such as images from the book and usage of a piece of text from the story as an example of an HTML document for parsing. But the most astonishing thing (and the most disguised one), is the meaning of the opening lines of the library documentation: “You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help”. These lines are just a masterful rephrasing of the dialog between the King and the Knave, where the latter was trying to justify himself that it wasn’t him who wrote the letter.

As a bonus, here are 3 more curious facts about the “original” beautiful soup from the song:

  • The Turtle Soup song is, in its turn, a parody of the poem Star of the Evening by James M. Sayles.
  • Mock Turtle soup existed as a real dish in the XIX century. It was considered to be a cheap analogue of green turtle soup, consisting, though, not of turtles but calf interiors.
  • The beautiful soup from Alice inspired not only the HTML parser creators but also Leon Coward, an Australian composer, who presented his lyric composition Beautiful Soup in 2014, as one of his works for Alice’s Adventures in Wonderland.

Caffe

Continuing our “culinary” topic, let’s consider Caffe — a free open-source deep learning framework written in C++, with a Python interface, originally developed at the University of California and maintained by an active community of contributors. This framework is a popular choice for both academic research and industrial deep learning projects since it supports different types of neural networks and can be applied to image classification, image segmentation, speech and multimedia recognition, etc. The main features of Caffe: high processing speed, extensible code, expressive and modular architecture for efficient creation and optimization of DL models, and excellent community support.

Despite the word caffe meaning coffee in Italian (well, Yangqing Jia, the creator of Caffe, definitely isn’t Italian) and also a cup of coffee as a logo starting from the 2nd release of the software, the name of the framework is nothing more than a mere abbreviation — Convolutional Architecture for Fast Feature Embedding.

Conclusion

We see now that behind some names in data science there is a story to discover: sometimes symbolic, sometimes funny, sometimes enigmatic. Alternatively, it can be a play of an already existing name or even a random choice of its creator.

Thanks for reading!

15 Likes

Oh my god, I can’t believe Seaborn is named after Sam Seaborn that’s SO FREAKING COOL!!! I’m actually watching it right now!! You should tell Michael that it’s an absolutely exciting story for a west wing fan! :laughing: Love love love it!

3 Likes

It was a great surprise also for me, Vera! :star_struck: I was convinced this name refers somehow to whales or, at least, to seagulls, or whatever other creature related to the sea :grinning:

2 Likes

Thanks for writing this article and sharing it with us. Would be a shame to miss this fun trivia!

2 Likes

Thanks a lot! :star2: Very glad that it was appreciated! :smiling_face_with_three_hearts:

2 Likes

Great article, @Elena_Kosourova

2 Likes

Thank you @vishallbabu5! :relaxed: