348-x Exit Surveys unicode error

Mission Link: https://app.dataquest.io/m/348/guided-project%3A-clean-and-analyze-employee-exit-surveys

Your Code: tafe_survey = pd.read_csv('tafe_survey.csv')

Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 13: invalid start byte

Question: For this project we need to read in two data sets. One of them is read just fine (DETE), the other one (TAFE) gives me a unicode error. I’ve tried a couple other common codecs but none have worked and haven’t found encoding info about the data set. Tried redownloading the data and the same error occurs. Any ideas on why the error is occuring (since it doesn’t seem like it has for others) and how to resolve it would be greatly appreciated!

Hey, Granny.

If you download the files we provided in the app, it should work fine. It’s possible the author of the course made some modifications to the dataset, leading to this inconsistency of behavior. I’ll let Julie clarify this. @juliechipko

If you want to work with the original dataset, the encoding is cp1252 or synonyms of it. Here’s how I found out:

>>> import chardet
>>> file = "tafe-employee-exit-survey-access-database-december-2013.csv"
>>> with open(file, "rb") as f:
...     data = f.read()
... 
>>> chardet.detect(data)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

The function chardet.detect takes as input a bytes object that represents the file (hence the use of the mode rb) and returns a dictionary with the most likely encoding and how confident it is that the encoding is correct.

The confidence parameter exists because the function uses heuristics to determine the encoding, it’s not a sure thing.

4 Likes

Hey Bruno, thanks for demonstrating how to figure out the most likely encoding! This was really helpful.

~granny

1 Like

Thanks Bruno! You are right - this dataset was modified for this mission. I’ve made a note to address this in the course. Thank you!

1 Like

Hi, I’m also having issues with this. Any chance that you could provide the files you used in the Learn section?

You should be able to download them through the interface. Did you run into any issues?

Also, that shouldn’t matter for this particular problem. What issue are you running into precisely?

1 Like

The interface is through Jupyter and doesn’t contain the files. One of them in particular, the Tafe survey cannot be read.

Thank you!

1 Like

I know I’m bringing up an old thread, but it’s relevant to me and I am having similar issues.

Even if you do know the encoding of the file you’re trying to import, what do you do need to do after?

  1. open the CSV and just detail its encoding? Will this ensure that the data is read correctly (i.e., through standard convention (UTF-8))?

    tafe = pd.csv_read('tafe_survey.csv`, encoding='cp1252')
    
  2. Temporarily open the CSV file in binary mode and assign it to a variable?

    with open('tafe_survey.csv', 'rb') as f:
         tafe = pd.read_csv(f)
    
  3. Rewrite the file’s encoding to a desired encoding type (UTF-8) [using Python]? (not sure how to do this in Python as my attempts have failed - but wouldn’t you read in the csv, decode and then loop through line by line and rewrite the csv(??))

  4. Convert the encoding of the file in something like notepad/excel/etc.? (I tried this and it didn’t actually work)

Also, my attempts to rewrite/convert the CSV file’s encoding don’t seem to work according to the method I have been deploying in trying to read the encoding, it constantly lists the 2 CSV files as having a ```cp1252`` encoding, see code below:

 with open('tafe_survey.csv') as f:
       print(f)

output:

<_io.TextIOWrapper name=‘tafe_employee_exit_survey_2013.csv’ mode=‘r’ encoding=‘cp1252’>

BQ, it’s okay to revive old threads if the post tightly concerns the old topic.

In this case, however, it seems like your questions are sufficiently different to merit a new topic. Since I also didn’t understand what you want exactly, I’ll ask you to reword your questions and ask them in a new topic.

To be more specific, I just didn’t understand your goal.

This depends on your goal.

1 Like

I know this is probably not useful, but I missed your answer until now. Do you still need help with this?

Ok, understood.

I’ll reword and repost in a new post.

Kr,
BQ

1 Like