BLACK FRIDAY EXTRA SAVINGS EVENT - EXTENDED
START FREE

Fandango movies rating: problem with cleaning movie_budget

After the glorious work made by @adam.kubalica with scraping all the budgets, I’m trying to clean each string and converting them into float numbers.
This is the function I wrote which is working just fine when I test it with random strings but not with the dataframe

def budget_neat(df):
    budget = df['movie_budget']
    if budget is not None:
        #splitting the budget with more than 1 value
        stripped = budget.strip()
        #need this if statement cause some '-' are differntly encoded
        if re.search('–', stripped):
            splitted = stripped.split('–')
            #converting values into float
            floated = [float(i) for i in splitted]
            #calculating the mean
            mean_value = sum(floated) / len(floated)
            return mean_value
        else:
            splitted = budget.split('-')
            floated = [float(i) for i in splitted]
            #calculating the mean
            mean_value = sum(floated) / len(floated)
            return mean_value
    else:
        return None

showing me this error:
AttributeError: ‘float’ object has no attribute ‘strip’

I don’t understand why. Could you please help?
I’ll leave the csv I created and part of my work on the jupyter notebook
fandango_2015.csv (12.7 KB)

fandango_cleaning.ipynb (98.5 KB)

Thanks for your kind help in advance

Click here to view the jupyter notebook file in a new tab

1 Like

Haven’t tested your code, but given the DataFrame the error most likely occurs because of the NaN values.

NaN values are of type float. You need to deal with those NaN values in some way.

1 Like

But I put the if statement so it should compile correctly, no?

Aah, good question!

None is of type NoneType. So, the if condition doesn’t work for NaNs.

You need to check for NaN separately or find another way to handle the missing values. This could be a starting point - pandas.isnull — pandas 1.3.4 documentation

1 Like

Ok so it’s better if a reformulate the if statement as if

if budget.isnull():
etc.

Now I get this message while scraping data for the 2016 dataframe:

ConnectionError: HTTPSConnectionPool(host='en.wikipedia.orghttps', port=443): Max retries exceeded with url: //en.wikipedia.org/wiki/Special:Search (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000027ECFEF7F10>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

Anyone knows how to solve it?
@adam.kubalica

: Max retries exceeded with url:

that’s your answer - you’ve tried to scrape too much, too often and their server cuts you off , because you’re scraper asks for too much and it doesn’t behave like a normal human being,

if you look how I actually applied the function in my fandango proj you’ll notice that I’ve smuggled

time.sleep(np.random.randint(1,19))

every now and then, that line of code puts everything to sleep for a random amount of seconds(between 1 and 19) - that’s a more natural behaviour , also if you’ve just had 30 tries in 5 minutes , testing you function, the server can cut you of

helpful article

1 Like

Think I screwed everything :sweat_smile: :sweat_smile: :sweat_smile:
Doesn’t allow me anymore if I put the function to sleep among one request and the other…

You should try setting up kaggle, you could scrape it trough the notebook at kaggle in the cloud, then download it on local, or continue working on you notebook at kaggle (btw it’s way faster scraping and doing calculations on kaggle, because…cloud et all.)

1 Like

How can I scrape through kaggle?
Is there some tutorial?

it’s just another notebook solution, think of it as jupyter notebook in the cloud (so the internet connection is better)