I executed the function of detecting if an app is English or non-english using the is_english function in the solution. I wonder why the third-string which is clearly non-English was recognized as English (i.e. ‘爱奇艺PPS -《欢乐颂2》电视剧热播’)
When I executed the code to make a list of app names that are in English, it doesn’t seem that the code worked as I still have the same number of row output as the android_clean and ios data set. (9659 rows for android vs. the expected 9614; 7197 rows for ios vs. the expected 6183).
android_english = []
ios_english = []
for app in android_clean:
name = app[0]
if app_english(name):
android_english.append(app)
for app in ios:
name = app[1]
if app_english(name):
ios_english.append(app)
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
def app_english(string):
non_ascii = 0
for character in string:
if ord(character) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
Excluding the non-English app in the dataset:
android_english = []
ios_english = []
for app in android_clean:
name = app[0]
if app_english(name):
android_english.append(app)
for app in ios:
name = app[1]
if app_english(name):
ios_english.append(app)
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)
I’m not able to replicate the error. I copied and pasted your code and I got an indenting error. After fixing the indenting, it ran fine and I got the expected values for the English android and ios apps. If you could upload a copy of your .lpynb file, I can have a look at it and see what else might be going on.
Thanks for sharing your notebook! I was able to spot the problem.
def app_english(string):
non_ascii = 0
for character in string:
if ord(character) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True
The if/else part of your statement is indented inside the loop. What that means is that on the first iteration, it will check the character and increment non_ascii accordingly. Then it will process the if/else statement. Since non_ascii at that point will only be 0 or 1, it will process the else and return True. It won’t run through any of the other characters.
To fix this, take the if/else out of the loop:
def app_english(string):
non_ascii = 0
for character in string:
if ord(character) > 127:
non_ascii += 1
if non_ascii > 3:
return False
else:
return True