UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0

Hi,

I´m triying to open a downloaded Google Ads csv for my project but i´m reiciving an error:

from csv import reader
opened_file = open('adw_kws.csv')
read_file = reader(opened_file)
kws = list(read_file)
print(kws)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-6-b145c4bac599> in <module>
      2 opened_file = open('adw_kws.csv')
      3 read_file = reader(opened_file)
----> 4 kws = list(read_file)
      5 print(kws)

/opt/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Anyone knows anything about it? In other cases, when I tried to open a csv (for example from Twitter) I didn´t get any error.

Thank you

Hi @javiermarzalb,

You can try to add ignoring errors to your code:

opened_file = open('adw_kws.csv', encoding='utf8', errors='ignore').  

In this case, the symbols that the encoding system fails to encode (those that most probably have issues with their format, etc.) will be just ignored, and all the other symbols will be successfully encoded.

1 Like

Hi javiermarzalb

How about providing the encoding value explicitly in the open() method

opened_file = open('adw_kws.csv', encoding='utf8')

Could you try it out?

Thanks.

1 Like

Hi, Dash! I tried before but it didn´t worked: same error :frowning:

Hi, Elena, thank you! seems to work but founded different error in the next line code:


Error                                     Traceback (most recent call last)
<ipython-input-8-cc40c2ffe1db> in <module>
      2 opened_file = open('adw_kws.csv', encoding="utf8", errors='ignore')
      3 read_file = reader(opened_file)
----> 4 kws = list(read_file)
      5 print(kws)

Error: line contains NUL
1 Like

To exclude also this error, try using a more advanced encoding system utf-16:

import csv
import codecs
opened_file = open('adw_kws.csv')
read_file = csv.reader(codecs.open('adw_kws.csv', 'rU', 'utf-16'))
kws = list(read_file)
print(kws)
1 Like

Hi @javiermarzalb

While using the errors='ignore' argument (as @Elena_Kosourova suggested) works at the first glance your results might not be as expected because the reader() function still doesn’t know how to interpret certain parts of the file. What this error tells you is that the encoding you used when opening the file, doesn’t match the actually encoding of the file itself. So instead of passing ignore to the function, figuring out the original encoding is most likely a better approach in the long run. For starters you could try latin1 (another common encoding).

DQ links to this as well in some of its courses, but if you haven’t read it, it is really worth your time

Best htw

3 Likes

@javiermarzalb

In my recent guided project, I tried the below code

import chardet

with open("file.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
print(result)

You just need to replace the file.csv with your file path.

You will then get an output something like this

{‘encoding’: ‘Windows-1252’, ‘confidence’: 0.73, ‘language’: ‘’}

Use the encoding given in the result and it will most probably sort out your issue. I hope this helps.

3 Likes

Thank you, Elena.

I tried but something in the data is not right because I get something like this:

`[’\tCampaign1\t\t\t\t\t\t\t\tEspaña - Móvil\t"80138’, ‘87"\t"1673630’, ‘75"\t"29113’, ‘53"\t"4’, ‘8%"\t"0’, ‘36"\t\tEUR\t\t\t\t\t\t\t\t\t\t’], [’\tCampaign1\t\t\t\t\t\t\t\tEspaña - Tablet\t"5642’, ‘61"\t"82020’, ‘7"\t"3106’, ‘51"\t"6’, ‘9%"\t"0’, ‘55"\t\tEUR\t\t\t\t\t\t\t\t\t\t’]

Hi Javier,

Uhm, then I think that the issue here is definitely not with your data, but with the fact that both of my suggested encodings (utf-8 and utf-16) are really not good exactly for this data :thinking: Hence here it isn’t just a question of filtering out some “strange” symbols from your data, which errors='ignore' does, because in your case there are too many “strange” symbols for these 2 encoding systems. So I agree with @htw and @jithins123: you have first to figure out what encoding system is the best for you and then use exactly that system. Use the code recommended by @jithins123 to find out the ideal encoding system exactly for your data. It can happen that even after using that system you’ll still have some non-encoded data (because none of the encoding systems is really ideal), but for sure there will be much, much less of such data, and their impact on your further work will be minimized.

1 Like

Yes, you are right, I tried as @jithins123 suggested, and the encoding system detected it´s the one you pointed before:

   import chardet
        with open("adw_kws.csv", 'rb') as rawdata:
            result = chardet.detect(rawdata.read(10000))
        print(result)

{‘encoding’: ‘UTF-16’, ‘confidence’: 1.0, ‘language’: ‘’}

Then I tried to open the file using the encodign system:

from csv import reader
opened_file = open('adw_kws_sp.csv',encoding='utf-16')
read_file = reader(opened_file)
kws_data = list(read_file)
print(kws_data[:5])

And this is what I get:

[[‘Date Range\tCampaign\tLocation\tLocation ID\tLanguages\tNetworks\tAd Group\tKeyword\tKeyword Type\tSegmentation\tEstimated Clicks\tEstimated Impressions\tEstimated Cost\tEstimated CTR\tEstimated Average CPC\tEstimated Average Position\tAccount Currency\tConversion Rate\tValue Per Conversion\tConversions\tAverage CPA\tConversion Value\tROAS\tSuggested Daily Budget\tCampaign Max CPC\tMax CPC\tBid Override’], [‘1 de enero de 2021 - 31 de marzo de 2021 (quarter)\tCampaign1\t\t\t\tGoogle Search\t\t\t\tTotal de la campaña\t"111419’, ‘63"\t"2002880’, ‘38"\t"42415’, ‘34"\t"5’, ‘6%"\t"0’, ‘38"\t"0’, ‘56"\tEUR\t6%\t"15’, ‘00"\t6700\t"6’, ‘34"\t"100000’, ‘00"\t"2’, ‘4"\t"585’, ‘13"\t"3’, ‘09"\t\t’], [’\tCampaign1\t\t\t\t\t\t\t\tOrdenadores\t"13885’, ‘86"\t"192275’, ‘91"\t"8369’, ‘88"\t"7’, ‘2%"\t"0’, ‘60"\t\tEUR\t\t\t\t\t\t\t\t\t\t’], [’\tCampaign1\t\t\t\t\t\t\t\tMóvil\t"91254’, ‘29"\t"1726853’, ‘5"\t"30857’, ‘20"\t"5’, ‘3%"\t"0’, ‘34"\t\tEUR\t\t\t\t\t\t\t\t\t\t’], [’\tCampaign1\t\t\t\t\t\t\t\tTablet\t"6279’, ‘51"\t"83750’, ‘85"\t"3188’, ‘26"\t"7’, ‘5%"\t"0’, ‘51"\t\tEUR\t\t\t\t\t\t\t\t\t\t’]]

So I figure that “/t” it´s encode as space. It´s strange that the enconding i´ts not better because it´s a doc downloaded from Google.
I will read the article that @htw linked before

Thank you everyone with your help :slight_smile:

Then your data is tab-delimited, try this:

opened_file = open('adw_kws_sp.csv',encoding='utf-16', delimiter='\t')
2 Likes

Thank you!

when i try to open with the delimiter, says that:

TypeError: ‘delimiter’ is an invalid keyword argument for open()

@javiermarzalb: Maybe you could try using a nested for loop since its a list? Something similar to this:

for row in kws_data[:5]:
   for data in row:
      data.split('\t').strip('\t')     

But then again looking at your data looks like it has inconsistent formatting so it probably will not work. Would be better if you could upload your csv file so that we can better assist.

Btw I got the idea from here

1 Like

@javiermarzalb: Looks like the delimiter for each row is different. Please upload your csv file so we can assist. Thanks

Hi
You are right this doesn´t seem the right encoding format so I saved the csv as UTF8 encoding and this is what I get using your help:

import chardet
with open("adw_kws.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))
print(result)

{‘encoding’: ‘UTF-8-SIG’, ‘confidence’: 1.0, ‘language’: ‘’}

Then I tried to open the file. Notice that I have to check wich kind of data I´m getting from the file:
Also, I need to delete the first 10 rows because they are a briefing data.

    from csv import reader
    opened_file = open('adw_kws.csv',encoding='utf-8')
    read_file = reader(opened_file)
    kws_data = list(read_file)
    header=kws_data[0]
    kws_data=kws_data[11:]
    print(header)
    print(kws_data[6])
    print(type(kws_data[6]))
    print(kws_data[6][1])
    print(type(kws_data[6][1]))

[’\ufeffDate Range;Campaign;Location;Location ID;Languages;Networks;Ad Group;Keyword;Keyword Type;Segmentation;Estimated Clicks;Estimated Impressions;Estimated Cost;Estimated CTR;Estimated Average CPC;Estimated Average Position;Account Currency;Conversion Rate;Value Per Conversion;Conversions;Average CPA;Conversion Value;ROAS;Suggested Daily Budget;Campaign Max CPC;Max CPC;Bid Override’]
[’;Campaign1;;;;;Grupo de anuncios 1;ajedrez damas y backgammon;BROAD;;0’, ‘06;41’, ‘17;0’, ‘03;0’, ‘10%;0’, ‘46;;EUR;;;;;;;;;;’]
<class ‘list’>
06;41
<class ‘str’>

I think this is more likely what I should get, but I found the delimiter is a ‘;’ and I´m not sure if the element of the list/row it´s right.

Here is the attachment:

adw_kws.csv (132.2 KB)

Wow looks quite messy. Could you save it in a different encoding format? I think that is the first step :sweat_smile:. Is it possible to get a screenshot of the original spreadsheet?

@javiermarzalb to get rid of the encoding signature you might be able to do it with this:

f = open('file', mode='r', encoding='utf-8-sig')

From this article. Are you able to save it in ASCII encoding instead?

1 Like

Hi, @javiermarzalb

If it is a CSV file saved with Excel then the divider in it is really ‘;’. I think we could help you better if you provided the source file.

This is what the last file looks like to you

1 Like

Yes, that.s what it looks like. It´s a csv form Google Ads. Exactly form the Keyword Planner.
My objetive it´s to add ‘+’ at the beginning of each word in the Keyword column. I think this process will be easy and itá common thing to do when you set up a campaign in Adwords.
The problem here it´s that I tried to do it with Python but the encoding thing has stopped me.
In this example, I have to split a column because the encoding it´s not rigt.