Removing non-english apps: Non-english names appended as english names

Screen Link:

My Code:

dict_name_nonenglish = []
dict_row_nonenglish = []
dict_name_english = []
dict_row_english = []

def only_english_modified1(android_clean):
        for row in android_clean:
            name = row[0]
            non_english_charac = []
            for character in name: 
                if ord(character) > 127:
                    non_english_charac.append(character)  
                elif len(non_english_charac) > 3:
                    dict_name_nonenglish.append(name)
                    dict_row_nonenglish.append(row)
                    break
            if name not in dict_name_nonenglish:
                    dict_name_english.append(name)
                    dict_row_english.append(row)
                    

        print(len(dict_row_nonenglish)) 
        print(len(dict_name_nonenglish))
        print(len(dict_name_english))
        print(len(dict_row_english))




trial = only_english_modified1(android_clean)
print(trial)

What I expected to happen:

The number of rows or names should be 9614
What actually happened:

The number of rows and/or names I actually get is 9625

Hi everyone,

In order to get rid off the non-english apps ( >3 non english characters) I have used an alternative code. Up until the code I have posted here I have written code which is the same as the one in the solutions section. Android_clean contains 9659 rows as expected. For this reason I do not understand, why after running the code I am posting here, I am getting some non-english names appended as english names (thereby 9625 names instead of 9614). I have compared the results of my code and the ones from the solutions section, and I have realized that few nonenglish names such as РИА Новости or 'Bonjour 2017 Abidjan CI :heart::heart::heart::heart::heart: are with my code appended as english names. Could someone help me out with this?

Thank you very much

1 Like

Hi @aitor.susperregui,

The problem with your code is at the if-elif block here:

if ord(character) > 127:
      non_english_charac.append(character)  
 elif len(non_english_charac) > 3:
      dict_name_nonenglish.append(name)
       dict_row_nonenglish.append(row)
       break

Because you used an if-elif statement, not all the non-English characters were counted for some non-English apps. As a result, those non-English apps were not added to dict_name_nonenglish and dict_row_nonenglish. To solve this, use a nested if statement or two successive blocks of if statements. These will make it such that the length of non_english_charac is checked every time a character is added to non_english_charac. You should get the correct number of English apps if you replace elif with if. In my case, I used a nested if statement. Like this:

if ord(character) > 127:
      non_english_charac.append(character)  
      if len(non_english_charac) > 3:
              dict_name_nonenglish.append(name)
              dict_row_nonenglish.append(row)
              break

Hi adewalade,

Many thanks for your help!! Now it really works, I get 9614 names for english apps. Can I ask you what was the logic behind this change? Why with the if-elif statement not all the non-English characters were counted as non-English apps but with the nested if statement they did?

Thank you in advance

1 Like

hi @aitor.susperregui,

I’m glad I could help. And, It’s because of the way python handles if-else types of statements. With the if-elif and if-else statements, python only runs through the conditions till the first condition that is true is found and then skips the rest of the code. In your code, when ord(character) > 127 is true, that character is appended to non_english_charac but the elif statement will be skipped. This will go on everytime ord(character) > 127 is True till the code finds an English character only then will it check for the length of non_english_charac (because ord(character) > 127 is now False) thus allowing the next condition in the code, len(non_english_charac) > 3, to be run. In most non-English languages, there are usually no English letters between the native characters of that language and because of this, even though the non-English characters are added to the appropriate list, the length of the list will not be calculated and it eventually runs through all the characters on the name without finding out the number of non-English characters.

For example, in the case of Bonjour 2017 Abidjan CI :heart::heart::heart::heart::heart:, the emojis will be appended to non_english_charac but because there is no English character between them, the length of non_english_charac won’t be gotten, the code would think that there’s no non-English character (when there’s actually four) and will add that name to the list of English apps.

Hi @adewalade,
Many thanks for your time and the great explanation! I was stuck in this problem for a long time without being able to realize what the issue was.

Thanks again!

1 Like

Hi @aitor.susperregui,

You are welcome. I’m glad my explanation was helpful to you.

Happy learning!