Hello!
I was automating the process of getting data from a website and pushing it to Google Sheets and encountered a problem and I want to share the solution I found.
The website was returning a very nested JSON that was very difficult to handle even with pandas.json_normalize(). This function returned a nice dataframe but there was only one problem: one of the columns contained a list of dictionaries from which I needed just one value from each dictionary… The example is below:
[{'value': {'files': [{'mimetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'perma_link': None,
'hosted_by': 'podio',
'description': None,
'hosted_by_humanized_name': 'Podio',
'size': 43477,
'thumbnail_link': None,
'link': 'www.test.com',
'file_id': 1356087,
'external_file_id': None,
'link_target': '_self',
'name': 'file.docx'}],
'sharefile_vault_url': None,
'space': {'sharefile_vault_url': None,
'name': 'AIESEC Italy',
'url': 'www.test.com',
'url_label': 'crm-igcdp-test',
'item_accounting_info': None,
'org_id': 845436,
'type': 'regular',
'space_id': 43396326},
'app_item_id': 15253,
'app': {'status': 'active',
'sharefile_vault_url': None,
'name': User Selection',
'space_id': 43396326,
'default_view_id': None,
'url_add': 'www.test.com',
'icon_id': 192,
'link_add': 'www.test.com',
'app_id': 169234471,
'current_revision': 44,
'is_default': False,
'item_name': 'User',
'link': 'www.test.com',
'url': 'www.test.com',
'url_label': 'user-selection',
'config': {'item_name': 'User',
'icon_id': 192,
'type': 'standard',
'name': 'User Selection',
'icon': '192.png'},
'item_accounting_info': None,
'icon': '192.png'},
'title': 'Marta Wilson',
'initial_revision': {'item_revision_id': 4312859392,
'created_via': {'url': None,
'auth_client_id': 18,
'display': True,
'name': 'Webform',
'id': 18},
'created_by': {'user_id': -1,
'name': 'User Selection',
'url': 'www.test.com',
'type': 'app',
'image': None,
'avatar_type': 'icon',
'avatar': -1,
'id': 163457471,
'avatar_id': 192,
'last_seen_on': None},
'created_on': '2020-09-29 21:00:24',
'user': {'user_id': -1,
'name': 'User Selection',
'url': 'www.test.com',
'type': 'app',
'image': None,
'avatar_type': 'icon',
'avatar': -1,
'id': 163245471,
'avatar_id': 192,
'last_seen_on': None},
'type': 'creation',
'revision': 0},
'created_via': {'url': None,
'auth_client_id': 18,
'display': True,
'name': 'Webform',
'id': 18},
'created_by': {'user_id': -1,
'name': 'User Selection',
'url': 'www.test.com',
'type': 'app',
'image': None,
'avatar_type': 'icon',
'avatar': -1,
'id': 162354471,
'avatar_id': 192,
'last_seen_on': None},
'created_on': '2020-09-29 21:18:24',
'link': 'www.test.com',
'item_id': 1823412172,
'sharefile_vault_folder_id': None,
'revision': 8}}
]
This list could contain multiple dictionaries and I need just this value: ‘title’: ‘Marta Wilson’ (fake name) from each of the dictionaries. Hence I had to somehow loop through this list (which could be variable in length).
And here is my solution:
def extract_names(row):
names = []
for i in range(len(row)):
names.append(row[i]["value"]["title"])
return names
I then applied this function to the columns:
df_norm["User"] = df_norm["User"].apply(lambda row: extract_names(row) if np.all(pd.notnull(row)) else row)
The if...else
statements is very important here because otherwise, the function will throw a TypeError
:
TypeError: object of type 'float' has no len()
because NaN
values are “float” which length cannot be measured.
Hope it can be useful to somebody:) Check out my article on how I automated a mundane process in a non-profit with Python.
Happy coding