How to download an Excel Online File

Hi, @hanqi

I learned how to upload data excel online and faked this behavior with requests.get. Then I just loaded the page scrolling until I got to the page without data.

1 Like

@moriturus7
Could you explain what you mean by faking uploading behaviour? How is it related to downloading this data?

I give up multi-threading, there is just too much time overhead (>1 sec/block of 28 rows vs 0.16 secs for single thread). I was trying to open multiple headless drivers and scroll each one to a different block of 28 rows before reading, but too many bugs, deadlocks and i don’t understand multi-threading enough. Don’t want to debug multi-processing not working normally on windows too.

Finally my hardest article is done, any suggestions on how to handle dirty data if it’s not as simple as missing in only the 1st column?

1 Like

@hanqi
I have already given some examples of how to work with sites on JS using the requests library.

You have studied the data correctly and found iframe in the page code. But every time your browser loads new lines, it just makes a request to the server. Instead of hiding a page with Selenium you could just make these requests.

@moriturus7
Thanks, I get what you mean and found the request from network tab. By opening a new tab and pasting in the request url, i can see the data in a CDATA structure. I can also paste the same url into new tabs and have all the CDATA show properly.

However using request/session.get with that url beginning with http://azure... and providing the 2 headers of Upgrade-Insecure-Requests and User-Agentgives me an error page.

image

This error page also appears if i refresh the previously successfully loaded page, or if i paste the same url into the tab and press enter. After this error, all future attempts of pasting the same url into new tab will show this error. Then i have to go to the original page to find a new set of GetRangeContent... to repeat my experiments.

  1. Do you know what is wrong with my requests.get in python? Am i missing some headers?
  2. Why does pasting the request url manually into multiple new tabs load the data properly, but once i refresh or paste to address bar + enter (within the same tab), the error page comes up and all future attempts at pasting url into new tabs show error?

Edit to add own answer to question 1
‘Content-Type’: ‘application/json’ header is missing. This request header is also missing when refresh is done. Inspiration came from looking at Chrome Developer tools. Right click request --> Copy --> Copy as cURL (cmd/bash), then either run the bash cURL directly in jupyter or feed to https://curl.trillworks.com/ to generate a program using python requests library. The cmd cURL requires more cleaning of ^%^ to % and ^& to & first before feeding to the converter, or running in windows cmd.

Edit to add own answer to question 2
The point is about refresh (new get request) vs no refresh (served from disk cache). All url pasted into new tab were served from disk cache, but url pasted into same tab or f5 on same tab does refresh which sends new GET request, this new GET request is missing the required Content-Type header.

Also, on the point of pasting url into new tab, i realize i have to paste it in the same chrome window where i got the request url from for it to work. Pasting from the chrome window that selenium opened into a manually open chrome window gets that error, and vice versa too.

  1. What kind of behaviour is this? Feels like there is something that knows which chrome window/session i’m in that i must provide in requests.

Edit to add own answer to question 3
New window fails because request header has no Content-Type

Observations from Investigating request url


This is a text-compare.com comparison (differences highlighted) of the requests made before(left) and after (right) refresh with

driver.refresh()
driver.switch_to.frame(0)

I can see that SessionId has changed, but i don’t know if it’s necessary to provide one to requests.get, and if yes what its value should be. I went through the excel web service protocol https://interoperability.blob.core.windows.net/files/MS-EXSPWS3/[MS-EXSPWS3].pdf to try understand the Query String Parameters (the part after GetRangeContent?context=...) but it seems the SessionId is randomly provided, eventually i’m still not sure whether this SessionId is relevant to a correct requests.get call.


This is a later comparison between the selenium opened chrome (left) and manually opened Chrome where the SessionId and Configurations (can’t find this definition in the protocol pdf) are different.

This excel did not handle missing data in 1st column so has translation errors.

1 Like

Hi, @hanqi

Yes, you lack the headlines. If you look closely at the list of headers, you will see that there are many headers responsible for authorization.

Without these headers Azure treats you as an invalid user and returns an error.

Selenium doesn’t run under the browser user because he doesn’t know about the tokens, sessions and authorization headers given to you.

image

Are you sure the headers are required? I have deleted all of them and only "Content-Type": "application/json" is required to get proper data. Not sure why this is needed since it’s supposed to be a response header, or for specifying payload type in request but there’s no payload here.

Finally, i’ve completed full run in 5 seconds using requests+multithreading.

1 Like