Good sources about advanced scraping?

Hello

I build many scraping based on node + puppeteer + PostgreSQL as data storage and queue.

Everything works great, but I want to develop myself, my next goal to scraping data from + 100 000 000 pages (I have my proxy server with squid). When I can find sources to learn advanced scraping data, optimize performance, etc?

what would you recommend to me?

Hello mxcdh
Welcome to the community!

Hope these articles will help

  1. How to build a scaleable crawler to crawl million pages with a single machine in just 2 hours
  2. HOW DO YOU CRAWL AND SCRAPE MILLIONS OF ECOMMERCE PRODUCTS?
  3. Easy Way to Scrape Data from Website By Yourself

Thanks,
Debasmita

Hi @mxcdh

I don’t know much about Node-based scraping. But I can say that the bottleneck in your approach is puppeteer because it works directly with the web browser driver, even if it is headless. It will always be slower and less scalable than making requests from a light HTTP client. In python, it is e.g. requests, aiohttp, httpx, etc.

So I would recommend that you explore approaches for scraping without a web browser. Since you are likely to have resource issues when building large scraping systems.

Regards, Max