LIMITED TIME OFFER: 50% OFF OF PREMIUM WITH OUR ANNUAL PLAN (THAT'S $294 IN SAVINGS).
GET OFFER

Tips. Alternative for Selenium exists - Playwright

Hello, everybody.

I propose to create a category of messages that could be called “Tips, Tricks, and Hacks”. Where the community could share useful libraries and ask questions related to them.

But since I can’t create a separate category. Let’s dodge it.

So let’s get started.

Web scraping and Selenium.

Data collection is quite an important issue in our community. I know that many people use or just learn Web scraping.

Many of you use Selenium and I can understand you. There are several advantages to this path.

  1. It is visually understandable - you see the browser, you see how the script works with it.
  2. It is very simple - you can look at the element selectors and just write code that does the same thing as you do when you use the browser.
  1. It is very easy to find many examples and articles about Selenium on the Internet.

But this way has disadvantages.

  1. The most important one. Poor scalability. When you try to run several Selenium instances, the difficulties begin. And if you need to bypass hundreds of thousands of pages, it becomes a problem.
  2. Flowing out of the first is speed. Even in the headless mode Selenium is a rather slow tool.
  3. excessive API. It may seem unimportant, but over time it tires out the number of methods to search for an element by different identifiers and selectors.
  4. no possibility to get data from developer tools. Sometimes if you need a browser as an intermediate tool, you need to get some data from Dev Tools to pass it on to your code.

That’s why I want to tell you about the library, which is quite famous among JS developers, but so far little known in the Python community.

Playwright for Python. This is not a myth.

So far we have been using Selenium. Browsers and how we interact with them have evolved. And the closest thing to that was JavaScript.

So, Puppeteer replaced Selenium. The problem was that there was no reliable python library for working with it. That’s why its release passed by us.

Puppeteer has been replaced by Playwright. Developed by Microsoft with the Puppeteer development team. Most importantly, Microsoft did not wait for someone to write a library for Python. That is why they have actively engaged in the development themselves.
https://github.com/microsoft/playwright-python.

This is an open-source library. Yes, Microsoft realized that the source open is not a stop and decided to join it.

So what does playwright give us?

One Playwright to rule all.
Let’s walk on the pluses.

  1. We finally got headless Firefox and Webkit. If firefox is not so interesting, Webkit opens up interesting possibilities. We can easily emulate the browser of a mobile device by substituting geolocation data in a context we understand.
  2. Asynchronous support - Python doesn’t stand on revenge and asynchronous programming trend today. And our window on the Internet is exactly where asynchronous programming reveals itself best.
  3. Parallelism support - you can work with several pages receiving data from them. (In fact, even in a synchronous interface it is implemented asynchronously, just hidden from us).
  4. Built-in waiting timeouts of elements. - If you want to make a quick test, creating a click on an element does not threaten to instantly kill the script, because you forgot to set the rendering wait time.
  5. Easier to read code allowing you to work with different kinds of selectors
  6. Allows getting some data from a developer tool.

It has one minus but a quite significant minus - very few articles and examples. You will mainly have to work with documentation and examples for JavaScript

A few examples.

Readability of the code, I will take from my project which was written using Selenium but rewritten in Playwright.

Selenium:

driver.find_element_by_xpath('//*[@id="welcome_form"]/div[1]/div/div[1]/label').click()
driver.find_element_by_xpath('//*[@id="energy-loc"]/div/div/div[1]/label').click()
driver.find_element_by_xpath('//*[@id="energy-loc-home-only"]/div[1]/label').click()
driver.find_element_by_xpath('//*[@id="twelve-months-enquiry"]/div[2]/div/label[1]').click()
driver.find_element_by_xpath('//*[@id="retailer-select"]/div[2]/span/span[1]/span/span[2]').click()
driver.find_element_by_xpath('//*[@class="select2-results__option"][contains(., "AGL")]').click()
driver.find_element_by_xpath('//*[@id="postcode"]').send_keys("3000")
driver.find_element_by_xpath('//*[@id="postcode-btn"]').click()
driver.find_element_by_xpath('//*[@for="upload"]').click()
driver.find_element_by_xpath('//*[@class="select2-selection__rendered"][contains(., "Please")]').click()
driver.find_element_by_xpath('//*[@class="select2-results__option"][contains(., "AGL")]').click()
driver.find_element_by_id("fileupload").send_keys(os.getcwd()+"/File for Home 16.xls")
driver.execute_script("$('#energy-concession-yes').click()")
driver.execute_script("$('#disclaimer_chkbox').click()")
driver.execute_script("$('#btn-proceed').click()")

The code looks ugly, because the simpler paths to the Selenium element simply could not interact with it.

Playwright:

page.click('xpath=//label[@for="electricity"]')
page.click('xpath=//label[@for="home"]')
page.click('xpath=//label[@for="home-here"]')
page.click('xpath=//label[@for="twelve-months-yes"]')
page.selectOption('xpath=//select[@id="retailer"]', '5314')
page.fill('xpath=//*[@id="postcode"]', '3000')
page.click('xpath=//*[@id="postcode-btn"]')
page.click('xpath=//label[@for="upload"]')
page.click('xpath=//label[@for="upload-yes"]')
page.selectOption('xpath=//select[@id="file-provider"]', 'agl')
page.setInputFiles('xpath=//input[@id="fileupload"]', sub_file)
page.click('xpath=//label[@for="energy-concession-yes"]')
page.check('xpath=//*[@id="disclaimer_chkbox"]')
page.click('xpath=//*[@id="btn-proceed"]')

Because Playwright is closer to the engine, it does not have the same problems with the action. But at the same time look how much cleaner and clearer the code is.

There are also very good examples in the documentation.
https://github.com/microsoft/playwright-python.

And in this article. Here is an example of how you can intercept requests that occur when loading a page.

Thank you all. Do not stop learning new things and develop your skills.

4 Likes