How do we know the encoding to use with pd.read_csv? chardet doesn't help

Guided Project: Analyzing Startup Fundraising Deals from Crunchbase
Screen Link:

Trying to read some rows from the csv file (with UTF-8 encoding by default) didn’t work. It throws the below UnicodeDecodeError
My Code:

fivek_rows = pd.read_csv(csvfile, nrows=5000)

What actually happened:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 6: invalid start byte

I tried using chardet by reading in 128 bytes to detect the encoding but it seems encoding returned is unreliable i.e it returns ascii and that still didn’t work.

After some guesswork and googling, I used encoding=‘ISO-8859-1’ and that worked.

How will we definitively be able to find and apply the right encoding used in cases like this?

1 Like

Apparently, correctly detecting the encoding with certainty is impossible -

Using chardet, as you already seemed to try out, is one way which can help. But, of course, it’s not a guarantee.

Trying out some of the more common ones is the recommended option -

  • UTF-8
  • Latin-1 (also known as ISO-8859-1)
  • Windows-1251

That’s correct, I did use as I mentioned, encoding=‘ISO-8859-1’ and that worked.
But that seems to me like shooting in the dark , no?

Hey Durga is very difficult to determine exactly which format a file is encoded in. So just as @the_doctor said the most viable formats in which files tend to be encoded are mostly either of these:

  • UTF-8
  • Latin-1 (also known as ISO-885901)
  • Windows-1251

Most of the time chances are usually between UTF-8 and Latin-1 as Windows-1251 is a subset of Latin-1.

‘crunchbase-investments.csv’ is encoded in Mac OS Roman which Python calls ‘mac_roman’.

No other character set that I know of correctly decodes 0x8e as é. Using this encoding we correctly get the names of companies like MiaSolé, Qualtré, and Furnésh.