Issues Creating A Bar Graph

I am trying to create a bar graph that has the x-axis showing AGE_YRS as x-axis label and [‘0-6’, ‘7-12’, ‘13-19’, ‘20-29’, ‘30-39’, ‘40-49’, ‘50-59’, ‘60+’] and y-axis as the total number of people who fit the age range. I am doing a project where I find out how many people (as well as state, and lot number) on VAERS database that suffers from major cardiovascular symptoms such as tachycardia, arrhythmias, pericarditis, myocarditis, bradycardia, palpitations and atrial fibrillation.
Here are some relevant information about the dataset (I merged 6 datasets, 3 from 2021 and 3 from 2022 using an outer join on VAERS_ID):

<class ‘pandas.core.frame.DataFrame’>
Int64Index: 31463 entries, 30 to 1220849
Data columns (total 14 columns):

Column Non-Null Count Dtype


0 VAERS_ID 31463 non-null int64
1 VAX_TYPE 30869 non-null object
2 VAX_MANU 30869 non-null object
3 VAX_LOT 23426 non-null object
4 VAX_NAME 30869 non-null object
5 SYMPTOM1 31463 non-null object
6 SYMPTOM2 29181 non-null object
7 SYMPTOM3 25211 non-null object
8 SYMPTOM4 20567 non-null object
9 SYMPTOM5 15715 non-null object
10 STATE 28475 non-null object
11 AGE_YRS 30285 non-null float64
12 SEX 31463 non-null object
13 SYMPTOM_TEXT 31463 non-null object
dtypes: float64(1), int64(1), object(12)
memory usage: 3.6+ MB

This is the code I used:

Take 2: Create a simple bar graph sectioning age group to the total number of people who experienced major cardiovascular symptoms

Import Matplotlib

import matplotlib.pyplot as plt

Create a Bar Chart

plt.bar(x=covid_vaers_1[‘AGE_YRS’], height = covid_vaers_1[‘Count’], color = ‘roygbivr’)
plt.xlabel(‘AGE (years)’)
plt.ylabel(‘Count’)

Set x-axis values

plt.xticks(x, [‘0-6’, ‘7-12’, ‘13-19’, ‘20-29’, ‘30-39’, ‘40-49’, ‘50-59’, ‘60+’])

Add a Title

plt.title(‘Cardiovascular Adverse Events Distribution Across All Ages’)
plt.show()

This is my error message:

KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3360 try:
→ 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:

~\anaconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

~\anaconda3\lib\site-packages\pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ‘Count’

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8156/2596297499.py in
4
5 # Create a Bar Chart
----> 6 plt.bar(x=covid_vaers_1[‘AGE_YRS’], height = covid_vaers_1[‘Count’], color = ‘roygbivr’)
7 plt.xlabel(‘AGE (years)’)
8 plt.ylabel(‘Count’)

~\anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
→ 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
→ 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: ‘Count’

This leaves me with 2 problems:

  1. Do I need to change the string type on the columns? If so, how do I do that?
  2. Seems I must have a y-axis that is defined. The y-axis I want is the total number (count) that fits in the x category. So is there anything to do here?

Hi @MfonobongAmana

For your future queries please refer to this guide as a properly formatted question/ doubt helps the community help you better.

If I understand correctly you would like to display a frequency distribution of the AGE_YRS column.
The core of the error here is the “Count” column that does not exist in your data frame hence the key error.

You have several options here.

  • The easiest one is to use a plt.histand pass the desired no. to bins argument.
  • If you still wish to plot this using plt.bar then you can use series.value_counts() function and again pass a value for bins argument.

The “bins” argument will divide the age column into that many intervals.

  • You can create a custom function and apply it to the column AGE_YRS with intervals defined explicitly.

Check out the notebook attached for details on the above points.
Age_Intervals.ipynb (28.3 KB)

Click here to view the Jupyter notebook file in a new tab

The easiest option which I tried is not giving me what I want. It has lots of readings which is what I do not want

I want to group all individuals who are between for example 0-6 years old as one value, 7-12 in another etc, and I also want to separate by gender (M, F, U(for unknown)) for each x-axis value

This is an example of what I’ll like, including the total number
https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

In addition, the main issue I have is how can I use a csv dataset within the matplotlib plot code to sort the ages from ‘0-6’, ‘7-12’, ‘13-18’ etc on the x-axis as well as sort the gender to male, female and unknown as different bars on the graph.

I tried creating filters instead and see where that takes me


## Have another go making a graph showing how cardiovascular symptoms are distributed among the COVID19 vaccinated

width = 0.3

babies = covid_vaers_cardio[(covid_vaers_cardio.AGE_YRS < '7')]
kids = covid_vaers_cardio[('7' <= covid_vaers_cardio.AGE_YRS < '13')]
teens = covid_vaers_cardio[('13' <= covid_vaers_cardio.AGE_YRS < '19')]
young_adults = covid_vaers_cardio[('19' <= covid_vaers_cardio.AGE_YRS < '30')]
adults = covid_vaers_cardio[('30' <= covid_vaers_cardio.AGE_YRS < '40')]
older_adults = covid_vaers_cardio[('40' <= covid_vaers_cardio.AGE_YRS < '50')]
halfway_adults = covid_vaers_cardio[('50' <= covid_vaers_cardio.AGE_YRS < '60')]
retirement_adults = covid_vaers_cardio[('60' <= covid_vaers_cardio.AGE_YRS < '70')]
oldies = covid_vaers_cardio[(covid_vaers_cardio.AGE_YRS >= '70')]

x = ['babies', 'kids', 'teens', 'young_adults', 'adults', 'older_adults', 'halfway_adults', 'retirement_adults', 'oldies']
Male = covid_vaers_cardio[(covid_vaers_cardio.SEX == 'M')]
Female = covid_vaers_cardio[(covid_vaers_cardio.SEX == 'F')]
Unknown = covid_vaers_cardio[(covid_vaers_cardio.SEX == 'U')]

bar1 = np.arange(len(x))
bar2 = [i+width for i in bar1]
bar3 = [i+width for i in bar2]


plt.bar(bar1, Male, width, label = 'Male')
plt.bar(bar2, Female, width, label = 'Female')
plt.bar(bar3, Unknown, width, label = 'Unknown')

plt.xlabel("Age")
plt.ylabel("Total Number of Adverse Events")
plt.title("How Cardiovascular Symptoms are Distributed among the COVID19 Vaccinated by Age and Sex")
plt.xticks(bar1+width, x)
plt.legend()

plt.show()

TypeError: Invalid comparison between dtype=float64 and str

Not working at all. I really need help and assistance.

Hi @MfonobongAmana

It wasn’t mentioned in the first post that the main idea is to have a multi-category grouped bar chart for your data.

Anywho, there are several issues with the code here.

  • Let’s start with the actual error you are getting here. the column AGE_YRS is a float datatype so '7' is not necessary for the filter. The below code will suffice (it doesn’t have 7 inside the quotes) as it compares a float column with a numeric value:

babies = covid_vaers_cardio[(covid_vaers_cardio.AGE_YRS < 7)]

kids = covid_vaers_cardio[(7 <= covid_vaers_cardio.AGE_YRS ) & (covid_vaers_cardio.AGE_YRS < 13)].....

This is applicable to almost all of the categories you wish to create.

  • The “Male, Female, and Unknown” are not numerical variables but complete data frames. plt.bar cannot plot the whole data frame as a y-axis!

Also for the same post, are the string values for your data inside the data frame and values given in the filter the same? Because in python (or for most languages) data values like “COVID” and “covid” are completely different. (that could be one of the possible reasons for data not matching). Please check the code cell [8] in the attached jupyter notebook for an example.

What you are trying to achieve using the example code from official pandas documentation. Yes. But it requires multiple steps and a better understanding of how the grouping of values can be achieved and how to plot the chart.

I am not sure if you checked out the jupyter notebook attached earlier. I have attached the updated version now to clear some of your doubts from both these posts. Please break each and every code and understand what and why (it) has been done.

In case this wasn’t helpful at all, please share a dummy “.csv” file with the same structure and some records of the data you wish to work with. Also, share your complete code file. It can be “.py” or “.ipynb”. It helps the community to understand your question better and to help you in a more efficient manner.

Age_Intervals_Updated.ipynb (44.4 KB)

Click here to view the Jupyter notebook file in a new tab

Click here to view the jupyter notebook file in a new tab

  1. I am understanding that the AGE_YRS column is a float and on the code, I do not need the quotes?
  2. I do not need to create multiple filters in one. I just need the number of adverse events under a specific category like the ‘babies’ example.
  3. The Male, Female, Unknown are to represent different bars within the categories or x-axis labels. So I have had trouble incorporating them per x-axis label
  4. I did check out the earlier jupyter notebook. There were 4 options. The first 2 were easily executable but not what I was looking for. The third and fourth were difficult. the 4th option required making a custom x-axis label but there was no connection with uploaded csv files

I will test out the jupyter notebook file you sent, and I’ll keep you posted

Hi @MfonobongAmana

  1. Yes. you can’t compare float with a string datatype. Hence the error.

  2. I don’t understand this.

  3. What you want is a value list. What you were passing to the plt.bar method was a whole data frame.

  4. I generated dummy data within the notebook. There was no external data used anywhere. Not sure what “uploaded csv files” refer to.

Okay. All the best!

Hello. Seems to be some progress.

This is what I did:

## Have another go making a graph showing how cardiovascular symptoms are distributed among the COVID19 vaccinated

### define function to create custom intervals
def age_intervals(age):
    if age <= 7:
        return "6 or below"
    elif 7 < age <= 12:
        return "7 - 12"
    elif 12 < age < 19:
        return "13 - 19"
    elif 19 <= age < 30:
        return "19 - 29"
    elif 30 <= age < 40:
        return "30 - 39"
    elif 40 <= age < 50:
        return "40 - 49"
    elif 50 <= age < 60:
        return "50 - 59"
    elif 60 <= age < 70:
        return "60 - 69"
    else:
        return "70+"

### create new column with customized intervals    
covid_vaers_cardio["age_group"] = covid_vaers_cardio["AGE_YRS"].apply(age_intervals)

covid_vaers_cardio["age_group"].value_counts()

### Group the SEX and AGE_YRS together and sort the values by age_group 

covid_vaers_cardio_grouped = covid_vaers_cardio.groupby(["SEX", "age_group"], dropna = False, as_index = False).agg({"AGE_YRS" : np.size}).sort_values("age_group")

## age_groups to be used as labels
labels = covid_vaers_cardio["age_group"].unique()

print("labels:", labels)

## no. of male patients for each age_group
men_count = covid_vaers_cardio_grouped.loc[covid_vaers_cardio_grouped["SEX"] == "M", "AGE_YRS"].values

## no. of female patients for each age_group
women_count = covid_vaers_cardio_grouped.loc[covid_vaers_cardio_grouped["SEX"] == "F", "AGE_YRS"].values

## no. of unknown gender patients for each age_group
unknown_count = covid_vaers_cardio_grouped.loc[covid_vaers_cardio_grouped["SEX"] == "U", "AGE_YRS"].values

print("Men count = {}, Women count = {}, Unknown count = {}".format(men_count, women_count, unknown_count))

x = np.arange(len(labels))  ## the label locations
width = 0.4  ## the width of the bars


fig, ax = plt.subplots(figsize = (20, 12))
rects1 = ax.bar(x = x + width/2, 
                height = men_count, 
                width = width, 
                label='Males')
rects2 = ax.bar(x = x - width/2,
                height = women_count,
                width = width,
                label='Females')


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Total Number of Adverse Events')
ax.set_xlabel('Age Range')
ax.set_title('How Cardiovascular Symptoms are Distributed among the COVID19 Vaccinated by Age and Sex')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend(bbox_to_anchor = (1.15, 1))

ax.bar_label(rects1, padding=2)
ax.bar_label(rects2, padding=2)


fig.tight_layout()

plt.show()

The only issue I have is that I want to group the labels in ascending order of age

@MfonobongAmana

That is precisely why I said the grouped plot will require too many customizations! And I left the wrong plot intentionally in the Jupyter notebook attached to emphasize the same.

If I may ask, how have you learnt or are learning Python and Matplotlib? Are you following certain tutorials or enrolled somewhere? My question is coming because of your new post on pie-chart-related question

We can discuss that too, but I would first request this info from you. Thanks.

Me, I just go straight into it. I did some Python courses before but I hated it because with these things, I do not learn well learning all sorts of different tools out there. I know basics in isolation. Much of my data analyst learning is Excel, SQL, R. I learn better and work faster and efficiently just doing projects.

Despite the fact that there are too many customizations, as long as I can understand the trail and what to do, thats not an issue

Hey @MfonobongAmana

Okay, thanks.

Some basic understanding does help in figuring out what perhaps may be causing the error and how to debug that.