Profitable App Profiles - Does ASCII 0-127 leave more languages than just English?

Hi, the first guided project in the Fundamentals Course of the Data Science path (Profitable App Profiles) suggests to remove non-English entries in the Google Data Set by detecting characters that are outside of the ASCII 0-127 (standard code tabel) range.

But as a Dutch native speaker :wink: I wonder if that would really filter out all the non-English app names. We hardly use any accents in our language (extended code tabel 128 and upwards), and I suspect this is the case with more languages than just Dutch.

Not really looking for a solution, just saying. But in a real project, I wonder if this solution would be acceptable! Or am I missing something?

thanks,
Annemarie

1 Like

Hello, Annemarie!

No, I don’t think it really filters out all the non-English app names. As a native portuguese speaker I can say that, although the use of accents is common, there are lots and lots of portuguese words with no accents that would not be filtered out.

But it helps a lot filtering out the apps from countries that use different alphabets and leaving only the apps whose names we’re at least able to read. I supposed it would be extremely hard do leave only the real english apps without a country column. I’d say we’d have to check app by app manually :sweat_smile:

Good question, though.

2 Likes