Going fast! #DataquestChallenge Premium Annual Offer:
500 get 50% & the next 1000 get 40% off.
GET OFFER CODE

Data and sustainable development: The case for world changers in a digital world

When I graduated as a Sustainable Development Engineer in 2018, little did I imagine that my career would be driven to the world of data. I am glad it did. Data is becoming a powerful tool to solve some of the world’s biggest problems in the 21st century. If you are an enthusiast of sustainability and development, like myself, I hope to convince you to start learning data analytics yourself. And if you are already a data professional, I hope to inspire you to start using your superpowers to help make progress on such issues.

I landed my first post-college job as an analyst at a startup based in Mexico City. I have not stopped working with data a single day since I started working there. I am lucky to be able to apply my previous knowledge in data projects related to energy and water usage efficiency, an issue with a growing importance in our world. Chances are if you are working on issues related to development or sustainability, you will also be faced up with huge amounts of data at some point in time.

Data science has become a hot topic, present to some degree or another within any company seeking to become more competitive. Less known is the importance that data analysis has grown in projects related to sustainability and humanitarian development. Be it energy access, water availability, biodiversity conservation or even energy efficiency, it is hard to imagine any significant progress in any of these topics without reliable, available, and open data for scientists and curious minds alike to analyze. Data is also becoming incredibly useful for organizations aiming to solve some of the world’s most pressing challenges, and a great opportunity for skilled professionals to volunteer and put their abilities to practice while helping others.

It is no surprise that such initiatives are appearing in our highly digitalized world. If the banking, transport, entertainment and even manufacturing industries are now using vast amounts of data to become more competitive, why would organizations and social enterprises around the world not use it to make better progress on such important issues? Data access is critical to make better informed decisions, to be able to better assess human progress and the state of the world and to design successful public policies which can improve our response to contingencies such as the ongoing COVID-19 pandemic. But so is data literacy among aspiring world changers. The truth is that data does not hold much worth on itself until we dig into it to extract useful information which can be translated into actionable insight.

Consider this example of an analysis on energy access data using Python. Of course, this is not meant to be an exhaustive public policy analysis, but rather a demonstration of the use of data to identify valuable trends. This is an Access to electricity by percentage of population data set openly provided by The World Bank here. If you are working on anything related to sustainability or development, chances are you will spend long hours analyzing data sets such as this.

I will be using two libraries to explore this data set: Pandas and Matplotlib for visualizations. Also, I will add in some style using ggplot and use Numpy to perform some operations. My starting code looks like this:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

matplotlib.style.use('ggplot')
url = 'API_EG.ELC.ACCS.ZS_DS2_en_csv_v2_2257264.csv'
data = pd.read_csv(url, skiprows=4).drop(['Country Code', 'Indicator Code', 'Indicator Name'], axis=1).fillna(0)

Do not worry too much about the specifics; I am just importing the data from the csv file into a DataFrame to analyze on Pandas. Also, I am doing some cleanup, which is a task that you will come across incredibly often, especially when dealing with open access datasets like this. Notice that I am also using the .fillna() method to replace all non-valid values with zeros. This will allow me to filter these values after “melting” all the years’ data, included as different columns in the original data set, into a single column:

    # Use 'melt' to condense all the years' data into a single column

    data = data.melt(id_vars=['Country Name'], value_vars=[col for col in data][1:-1], var_name='Year', value_name='Value')

    # Select only the data with values bigger than 0

    data = data[data['Value'] > 0]

I want to focus this analysis on the world’s data and the top and bottom 5 countries in the last year available in the data set, hence the use of the .max() method over the Year column. I will also add my own country, Mexico, for the sake of curiosity. Can you guess what the last four lines of code do?

    # Get the last year in the data set

    last_year = data['Year'].max()

    top_5_countries = data[(data['Year'] == last_year)].sort_values(by='Value', ascending=False).head()

    bottom_5_countries = data[data['Year'] == last_year].sort_values(by='Value', ascending=True).head()

    world_data = data[data['Country Name'] == 'World']

    mexico_data = data[data['Country Name'] == 'Mexico']

Now that I have found the countries that I am looking for, I can use a SQL trick to limit my original data set to the top 5 and bottom 5 countries, respectively. Having one extra DataFrame for each instance, I can do an inner join with the complete dataset using the .merge() method to get the common rows to both sets.

    # Limit original data to the selected countries

    top_5_data = data.merge(top_5_countries, on='Country Name', how='inner')

    bottom_5_data = data.merge(bottom_5_countries, on='Country Name', how='inner')

Finally, I will select my columns, rename them to keep them consistent after the merge and create pivot tables using the Country Name column to create visualizations:

    # Select columns from each data set

    world_data = world_data[['Country Name', 'Year', 'Value']]

    mexico_data = mexico_data[['Country Name', 'Year', 'Value']]

    top_5_data = top_5_data[['Country Name', 'Year_x', 'Value_x']]

    bottom_5_data = bottom_5_data[['Country Name', 'Year_x', 'Value_x']]

    # Rename columns accordingly

    top_5_data = top_5_data.rename(columns={'Year_x': 'Year', 'Value_x': 'Value'})

    bottom_5_data = bottom_5_data.rename(columns={'Year_x': 'Year', 'Value_x': 'Value'})

    # Pivot data tables to create visualizations

    top_5_data_pivot = pd.pivot_table(top_5_data, values='Value', index='Year', columns='Country Name', aggfunc=np.sum).reset_index()

    bottom_5_data_pivot = pd.pivot_table(bottom_5_data, values='Value', index='Year', columns='Country Name', aggfunc=np.sum).reset_index()

    world_data_pivot = pd.pivot_table(world_data, values='Value', index='Year', columns='Country Name').reset_index()

    mexico_data_pivot = pd.pivot_table(mexico_data, values='Value', index='Year', columns='Country Name').reset_index()

We are finally ready to create some visualizations! This is the sample code; you can just modify the reference DataFrame and titles and you will have your four plots:

    ax = world_data_pivot.plot(x='Year', legend=True, title='Energy access (%) in the world')

    ax.set_ylabel('% of population')

The results look like this:
img-1
img-2
img-3
img-4

Consider the trends that we just found. In all cases, except maybe for the top 5 countries, electricity access seems to have improved since the early 90’s. This is magnificent news! By 2018, nearly 100% of the population in Mexico had access to electricity and in the world, the figure is close to 90%. This is extra good considering that in 1998, the figure was around 72.5%. While there are also considerable increases in the bottom 5 countries, there is still a lot to be done there. Only one plot seems suspiciously atypical: the top 5 countries exhibit curious patterns, different to the other ones. How reliable is the data in these countries? Have they sustained 100% electricity access across all the plotted years? Are the drops observed in Aruba anomalies in the reported data? The truth is that you will need specific knowledge on the issue and the data itself to better explain such trends. But imagine how much you could enhance your capabilities by knowing how to do this yourself.

As you can see, this is how far you will come using data analysis alone. Of course, you could dive deeper into the data with more complicated analyses, correlations, and fancy technologies. But in the end, useful insight will only come through a deep understanding of the problem. And this is the reason why we need skilled data professionals and passionate individuals with a high interest on such issues to look for ways to deliver useful information to aid better policy and drive progress towards solving the most pressing issues in the world.

Data is our new superpower. It allows us to make better informed decisions, better understand the state of the world and be more critical about what we hear in the news every day. For the first time, individuals have access to huge amounts of data and high computing power to perform such analyses on their own! It is our responsibility to be able to use it correctly and be mindful of its power to drive progress like, possibly, never. If you are a professional data scientist, you can put your skills to practice and help address some of the world’s most pressing issues. Your abilities are highly valued and required in many organizations looking for volunteers every year! Data literacy is more important today than ever, especially if you aspire to change the world, and I hope you can see why you should seek to mature your analytics skills if that is the case. This, I am sure, you will find a thrilling, highly rewarding endeavor. And you will find it more rewarding by the great impact that you can make.

GitHub: @JaviSandoval94
LinkedIn: https://www.linkedin.com/in/javier-sandoval-bustamante/?locale=en_US

7 Likes

Super inspiring article, thank you and welcome to the community! Since I am currently living in Mexico (Guanajuato), it was nice to see some singled out data on my “new home.” Gracias, amigo! :smiley:

I also couldn’t help but wonder about one of your plots: Energy access (%) by country - Bottom 5 (2018). It appears that all five countries have nearly identical (positive) slopes. Does this suggest that the progress being made is due to some lurking variable(s) rather than the decisions/choices being made by these particular countries? For example, is access to/cost of new technologies driving this progress, do you think?

1 Like

Hello, Mike! Thank you for your comment! This particular data refers exclusively to electricity access, so my first guess would be a wider reach of transmission and distribution grids in the last two decades in these countries. It might be interesting to look at data related to GDP and industrialization; it might be the case that an accelerated industrialization, coupled with economic growth, demands enhanced energy infrastructure, which can also help deliver electricity to a wider population. It is definitely a great question to get into! :smiley:

2 Likes

Hello, Mike! Thank you for your comment! This particular data refers exclusively to electricity access, so my first guess would be a wider reach of transmission and distribution grids in the last two decades in these countries. It might be interesting to look at data related to GDP and industrialization; it might be the case that an accelerated industrialization, coupled with economic growth, demands enhanced energy infrastructure, which can also help deliver electricity to a wider population. It is definitely a great question to get into!

This was a really inspiring article. Thanks so much for sharing!

1 Like

Thank you for reading!

Thank you for the article. I’m an ecologist and I love data science. I would love to work with the 2 things I like the most

2 Likes

Thank you for reading!