Changing the encoding of a csv file

Screen Link:

My Code:

import csv, chardet
with open('kyoto_restaurants.csv', mode='rb') as file:
    first_4 = file.read(4)
    answer = chardet.detect(first_4)['encoding']
    print('Initial Encoding: ',answer)
with open('kyoto_restaurants.csv', encoding='UTF-16') as file:
    rows = list(csv.reader(file))
    
with open('kyoto_restaurants_utf8.csv',mode='w', encoding='UTF-8') as file:
    writer = csv.writer(file)
    for row in rows:
        writer.writerow(row)
with open('kyoto_restaurants_utf8.csv', mode='rb') as file:
    file = file.read(16)
    answer = chardet.detect(file)['encoding']
    print('After processing: ',answer)
with open('kyoto_restaurants_utf8.csv') as file:
    check = list(csv.reader(file))
    print(check[:5])

What I expected to happen:
I expected the program to print “After processing: UTF-8”.

What actually happened:
chardet.detect() interpreted the new csv file as being ‘ascii’

After processing:  ascii

The program prints the first five rows properly, including non-ascii characters (katakana, kanji, etc). I don’t get it

You are not running chardet.detect() on the entire file.

That 16 is 16 bytes of data from the file. The first 16 bytes are just ASCII characters. For a different number, you might see a different encoding. For example, using 400 returns utf-8.

However, this is not an issue because ASCII falls under UTF-8. As per the documentation

A string of ASCII text is also valid UTF-8 text.

Encodings can be quite complicated to understand and work with at times and, apparently, chardet can be wrong as well at times (it’s also apparently not being maintained). If you read the file as a whole, it will return utf-8.

3 Likes