How to download an Excel Online File

Hi there!

I’m trying to download the following excel online file: http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc={566feecf-1e0d-46b8-a505-7cd762665268}&action=edit&source=http%3A%2F%2Fwww.presupuesto.pr.gov%2FPRESUPUESTOPROPUESTO2020-2021%2FFOMB%20Budget%20Requirements%20FY%202021%2FForms%2FAllItems.aspx%3FRootFolder%3D%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%20Budget%20Requirements%20FY%202021

The idea is download it, convert it into a Data Frame and then save it as an excel file.
I tried this but didn’t work:

import urllib
dls = “http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc={566feecf-1e0d-46b8-a505-7cd762665268}&action=edit&source=http%3A%2F%2Fwww.presupuesto.pr.gov%2FPRESUPUESTOPROPUESTO2020-2021%2FFOMB%20Budget%20Requirements%20FY%202021%2FForms%2FAllItems.aspx%3FRootFolder%3D%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%20Budget%20Requirements%20FY%202021
urllib.urlretrieve(dls, “test.xls”)

Any thoughts? :slight_smile:

Hi @santamariaots:

Please check out this article. Excel online does not have download functionality but perhaps exporting/copy pasting into google sheets and then downloading the sheets as a .csv would help. In addition, you may want to know that the method that you tried has the same link as the excel online link.

1 Like

Hi @masterryan.prof:

Thank you so much for the advice. The thing is that I do not have Excel Online. This document is posted by the government with Excel Online and I want to download it. I tried copy/paste but seems to be too big:
This message appears: “Retrieving data. Wait a few seconds and try to cut or copy again.”
I run out of ideas and I need the file badly to proceed with my analysis :frowning:

Hi @santamariaots: you are right… seems like the file is too big… Do you require all the data for your analysis? Perhaps if you do you could just copy chunks (which may be time consuming) or copy a select portion and not the entire file if you need not process all the data.

1 Like

@masterryan.prof Yep. I need all the info. The problem is that just let me copy/paste 100 rows at a time (and there are 15,000 rows!) It’ll take forever :slightly_frowning_face:

I have one final idea-- if you are using jupyter and you are able to view the raw file, maybe you could read from the website like this:

import pandas as pd
url = '<url_name>'
df = pd.read_csv(url, delimiter=',')

Note that this needs to be the raw file.

1 Like

@masterryan.prof . I really appreciate your final idea. I tried and gave me this error:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

import pandas as pd
url = ‘http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc={566feecf-1e0d-46b8-a505-7cd762665268}&action=edit&source=http%3A%2F%2Fwww.presupuesto.pr.gov%2FPRESUPUESTOPROPUESTO2020-2021%2FFOMB%20Budget%20Requirements%20FY%202021%2FForms%2FAllItems.aspx%3FRootFolder%3D%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%20Budget%20Requirements%20FY%202021
df = pd.read_csv(url, delimiter=’,’)

Could you screenshot your notebook with your code and the error instead? Thanks!

1 Like

@masterryan.prof

Try this:

  df = pd.read_csv(url, header=None,error_bad_lines=False, delimiter=’,’)

To be honest I’m not sure if this will work because the file isn’t in the raw format. Add the delimiter argument to circumvent a possible ‘exit code 0’

1 Like

@masterryan.prof

 df = pd.read_csv(url, header=None,error_bad_lines=False, delimiter=',')

The formatting was abit off because I copy pasted the previous example. It should work now.

1 Like

Thanks @masterryan.prof I guess since it is not a raw file it downloads that:

Hi @santamariaots I think all forms (open with excel and download) to retrieve the data require authentication. Maybe you should check with the relevant government department about this?

1 Like

@masterryan.prof. Yes, good advice. I’ll email them. Thanks!!!

No worries @santamariaots! Anytime!

1 Like

Hey @santamariaots

It didn’t work because there’s no data where you go. The data is loaded dynamically, and it’s also protected from manual copying.

I attach a complete Excel data file.
I hope you won’t need to download something from Azur again anytime soon :slight_smile:
Sabana May 8 2020 on FY 2021.xlsx (948.4 КБ)

1 Like

@moriturus7 Excellent!!! Thank you so much!!! :grinning:

@moriturus7
How did you get this excel file? Did you ask for permissions to download through the UI or there is a direct link you can find from the DOM to use request.get with?

@santamariaots
Would there be any issues if i download the information using selenium? I guess if it’s open for reading and anyone can copy the data manually into their own excel, there wouldn’t be a problem if i write a program to do the reading? I don’t understand the language/domain, what is this data about?

I have finished the single threaded code version already (1.5 minutes runtime), but still trying to get multithread working (unlikely time savings, just as an exercise). I started the project a month ago but put it on hold for GCP exam, so couldn’t make it useful then :frowning:

1 Like