App Store/ Google Play Store non-guided project

I decided I will do my own project based on the guided project ‘Profitable Apps Guided Project’ with large datasheets instead of the smaller more manageable datasheets in order to see what it would be like. Although I am trying to do the project with Pandas instead and I got stuck trying to separate English and Non-English apps.

I decided to make a function like that of the Guided Project

def ascii(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
            
        return False
    else: 
        return True

Although it doesn’t seem to mingle well with Pandas. for when I plug in the data sets into the function I can no longer print df.info:

ios_eng = []
android_eng = []

for app in ios:
    name = ios['App_Name']
    if ascii(name):
        ios_eng.append(app)
        
for app in android:
    name = android['App Name']
    if ascii(name):
        android_eng.append(app)
        
print(ios_eng.info())
print('\n')
print(android_eng.info())

returns

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6700/2693046771.py in <module>
     14         android_eng.append(app)
     15 
---> 16 print(ios_eng.info())
     17 print('\n')
     18 print(android_eng.info())

AttributeError: 'list' object has no attribute 'info'

What would be a better way to separate non-English and English with pandas?

I may post more here as I run into more challenges with this self-inflicted challenge. Right now I am in the Data cleaning section of the Data Science path, Part 3.

Instead of creating a list, you might want to create a dataframe.
Please do have a look at these link. Might help with your challenge. Good luck.

1 Like

Okay, so I am still running into a bit of an issue going from data set to list to data frame. So now I am trying something a little different. So I want to delete each row that has 3 or more non-ascii characters from the data frame without creating a list. At this point im just experimenting with things and I’ve got this code:

android_eng = android.drop(android[android['App Name'] == (r'[^\x00-\x7F]+') >= 3].index, inplace=True)

ios_eng = ios.drop(ios[ios['App_Name'] == (r'[^\x00-\x7F]+') >= 3].index, inplace=True)

that creates the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6700/147107051.py in <module>
      1 ### remove non-english apps
      2 
----> 3 android_eng = android.drop(android[android['App Name'] == (r'[^\x00-\x7F]+') >= 3].index, inplace=True)
      4 
      5 ios_eng = ios.drop(ios[ios['App_Name'] == (r'[^\x00-\x7F]+') >= 3].index, inplace=True)

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1535     @final
   1536     def __nonzero__(self):
-> 1537         raise ValueError(
   1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Looking at the code, it is what I would want to happen, if the name of the app has 3+ non-ascii characters drop the row. although Im not 100% sure how write that and other sources have been focusing on removing characters and not rows.

Okay, so I believe I got it to work to a capacity but it’s not as accurate as I want it.
After cleaning all NaN and non-needed columns I ran the data frames through:


android_eng = android[~android['App Name'].str.contains(r'[^\x00-\x7F]+')]

ios_eng = ios[~ios['App_Name'].str.contains(r'[^\x00-\x7F]+')]

This would work…but It also takes out a lot of data that could be useful i.e. English apps with a few non-ASCII characters. how can I make it more accurate?