Scraping Excel Online Read-Only File With Requests
Hit the nail on the head
In a previous article https://readmedium.com/scraping-excel-online-read-only-file-with-selenium-and-javascript-in-python-7bb549f05d66, I used Selenium to scrape this Excel Online file, but that felt a little indirect and slow, so here is a new attempt with new tools and knowledge gained. Full notebook at https://gist.github.com/gitgithan/b9f48e1b23e88f1fb1c56ad9b739adef
Creating the request
In the previous article, the strategy was to scroll, find, parse, scroll, find, parse,… Now, the goal is to send requests using Python requests library to directly target the information we want.
Begin by F12 to open Developer Tools → Network Tab on Chrome, then load http://www.presupuesto.pr.gov/PRESUPUESTOPROPUESTO2020-2021/_layouts/15/WopiFrame.aspx?sourcedoc=%7B566feecf-1e0d-46b8-a505-7cd762665268%7D&action=edit&source=http%3A%2F%2Fwww%2Epresupuesto%2Epr%2Egov%2FPRESUPUESTOPROPUESTO2020%2D2021%2FFOMB%2520Budget%2520Requirements%2520FY%25202021%2FForms%2FAllItems%2Easpx%3FRootFolder%3D%252FPRESUPUESTOPROPUESTO2020%252D2021%252FFOMB%2520Budget%2520Requirements%2520FY%25202021 or F5 reload page to see a list of Network Requests being recorded, we want to focus/filter on the GetRangeContent
requests (discovered by manually scrolling and observing new requests made) because they contain the data.
You will see 7 different GetRangeContent
requests being made on page load. 1 of them contains the Column names, the other 6 represent the permutation of 3 starting positions of rows (0,28,56) X 2 starting positions of columns (0,30). The 3 requests starting at column 30 do not provide useful data because there are only 14 columns in this data.
As in the screenshot above, we can right-click Copy as cURL (bash) into https://curl.trillworks.com/ to generate the requests code required to emulate this call made by the browser. Copy as cURL (cmd) is possible if you want to run in Windows cmd directly, but after using the code generator, you’ll have to replace ^%^
with %
and ^&
with &
before the code works (maybe other converters will not include so many erroneous ^
that’s not used for line breaking).
Not all the Request Headers you see in Chrome will be copied (eg. Accept-Encoding: gzip, deflate), but as we’ll see later, there is much less information required to successfully make a request.
We can see the code generator nicely organizes the headers, params, and requests.get call, ready to be run.
By slowly deleting information and manipulating the request, we can learn by experiment what’s not important and what is.
Not important:
Everything in the headers except Content-Type: application/json
can be deleted. If there is no Content-Type supplied, the server will return an error. You can verify this by opening the GetRangeContent in new tab and refreshing it, which does not provide a Content-Type. It is strange why this is required as it is usually a Reponse Header rather than a Request Header, and even as a Request Header, it is given to specify a payload (https://stackoverflow.com/questions/5661596/do-i-need-a-content-type-header-for-http-get-requests#:~:text=In%20that%20sense%2C%20a%20Content,there%20needs%20no%20Content%2DType%20.), but we have no payload here.
cookies
can be deleted too as the server always sends the same cookie no matter how you refresh.
Important:
context
should contain updated information within each session before it times out after ~10 mins, so the whole process of refresh page+copy curl must be repeated, and later relevant code blocks updated before things run with correct responses.
row
specifies the starting row of the requested data. This is varied in both the singlethreaded and multithreaded make_requests
function.
rowCount
specifies the number of rows of data to get, beginning from row
. We do not have to stick to the default 28 that Chrome/Selenium sends and can actually get all 15253 rows at once. However, such requests take much longer to finish. Experimentation shows 100–6000 rows per request is a good range.
blockPosition
→ PaneType
PaneType 1 returns data only. PaneType 3 contains the column names (sent by the 7th GetRangeContent
we see after refresh).
Defining Functions
def make_requests
This is a while loop with row
and rowCount
parameterized for the request to start at a later row for each call, and to experiment with run speeds with different block sizes requested.
def clean_data
This function uses a similar logic as the previous Selenium article to fill empty strings in 1st column, but is more accurate because it even checks the last block while the Selenium article implementation did not (so missing values in last block would not have been identified/filled).
def parse_response
This function extracts the valuable information in the CDATA section manually with string slicing [166:-25] to avoid XMLSyntaxError: CData section too big found. The resulting string is parsed into a dictionary, and the relevant data inside GridHtml
is extracted. These paths are found by manually inspecting response.text
to see where the data is.
Once we get inside GridHtml
, we can use Developer Tools live DOM to analyze what class to target for our Xpath search for data extraction and cleaning. We can see that every row starts with ewr-nglr
(Ctrl+F shows 28 rows), so that class was used to identify and clean rows in def clean_data. Since every cell has ewr-vab
, that can be used too for xpath searching for data, but it’s more convenient to directly do gridhtml.xpath('//text()')
.
When you meet missing data beginning row 9242, you can see there is no left:0px
, but the data starts from left:64px
(every cell has 64px width) which is visually displayed in the 2nd column, leaving the 1st column empty. This proves that the Excel Online client does not know in which columns the data is missing, but only the correct horizontal positions to show them among the gridlines so the user can interpret missingness. These positions could be used for more complex missing data filling if it was not only the 1st column that is missing in this simple case.
Speeding up with Multi-threading
Since requests are I/O bound tasks, there is no need to wait for 1 block of rows to be returned and parsed before the next request is sent (as in the single threaded while loop in the notebook). The sweet spot is around 300–500 rows for rowCount
. range
is used to create different starting row
positions for each function in each thread.
Errors to Watch
If it has been awhile since the page was refreshed, a refresh is needed to get a working session. Scrolling in the window to revive it will not work as it will refresh anyway to give a new SessionId.
The 4 errors shown at the end could error when the rowCount
is changed/reduced. Both the single/multi-threaded code will break at different rowCount
ranges. At rowCount
of 298 →299, the response will change from a known Content-Length to a Transfer-Encoding: Chunked and ChunkedEncodingError
starts creeping in, however at much higher rowCount
of 1000’s this error goes away, same goes for the ContentDecodingError
.
These seem like issues with the server/requests library implementation. The ContentDecodingError
can be resolved by specifying request header Accept-Encoding
of br
or identity
so the server uses a different compression scheme than the default gzip , but the time to get the request will double . JSONDecodeError
happens more rarely and I interpret it is bad data returned from server.
I guess Jupyter stuck at running symbol (*) when running executor.map would be due to my lack of understanding of how ThreadPoolExecutor really works.
Comparison with Selenium
Selenium Pros
- Selenium allows the user to observe the open window scrolling to track progress. Also, it will never error out with these parsing errors if sufficient waits are given
- Selenium does not require the handling of different starting rows
- More intuitive, mimics how a user actually interacts with the browser
Selenium Cons
- A lot slower (overhead in scrolling new elements into DOM)
- Multi-threading is difficult/ineffective. To multi-thread, one has to scroll each thread to a different starting point before reading. All these scrolling is unnecessary overhead.
- Cannot get more than 28 rows at a time because information is constrained by what is attached to the DOM at any moment
Lessons
- Observe requests and compare their differences to learn about what information is returned (column names vs actual data)
- Do not trust Copy as cURL blindly, may require cleaning and filling in missing headers
- Manually slicing strings to remove outer layers of nested information can hack through lxml limitations on CDATA parsing
- Experiment with new request parameters (not stick to default requesting of 28 rows, delete useless headers)
- Client does not know where is missing data, it is based on human interpretation from the inline style CSS, but this could be useful for programmatically filling in missing data
- Changing number of rows requested can change the way server passes back data, and how efficiently multi-threading works