UnicodeDecodeError when using pandas.read_csv on crunchbase-investments.csv

Screen Link: 167-1

After looking through the solution book, it appears the correct encoding for this csv isISO-8859-1.
The problem here is that we are given no way to figure this out on our own!
The only way to the student can resolve this issue is by looking at the solution.
The character encoding lessons from the Programming Concepts in Python class taught us to use chardet to detect encoding. I’ve attempted to determine the encoding using chardet, and it detects ASCII with the exception of one problem line.

My Code:
What triggered the encoding issue:

chunk_iter = pd.read_csv('crunchbase-investments.csv', encoding='ASCII', chunksize=CHUNKSIZE)

col_missing = []
col_mem =[]
err_chunks = []
for i, ci in enumerate(chunk_iter):
    try:
        col_missing.append({l: c.isna().sum() for l, c in ci.iteritems()})
        col_mem.append(ci.memory_usage(deep=True))
    except UnicodeDecodeError:
        err_chunks.append(i)
total_mem = [cm.sum() for cm in col_mem]
display(total_mem)

To detect the encoding issue:

import pandas as pd
import chardet
from IPython.display import display

with open('crunchbase-investments.csv', 'rb') as f:
    for i in range(10000):
        encoding = chardet.detect(f.read(5))
        if encoding['confidence'] != 1.0:
            detect = "Detected unusual encoding between lines {} and {}:"
            print(detect.format(i * 5, (i + 1) * 5))
            print(encoding)
1 Like