Adding test around web scraping

Question: When it comes to web scraping data, what kind of process or test can be written to make sure the data is complete? eg if the site is scraped and 10 new products are found, the next day 8 new products are found(2 got removed from the site, this is ok), 3rd day 5 products are found(the structure of site changed and scraper broke). is there a way to add any type of check? other wise it will be days before you catch that your scraping program broke


This is great question, though sounds like about general logic on programming. In the sense that there is no universal answer here. I usually test that connection request was successful and that data scrapped are not null. The rest fro your question depends on specifics of your data. You know how many entries shall be minimum in the scrapped data, so you could check for this. For instance, one of my scripts running on VPS, scrapes certain site (through official API) every two minutes. If number of scrapped entries is less then threshold, script does nothing. Otherwise, it processes them further. Not sure if this answer has helped in any way, so feel free to ask more questions if you want to :slight_smile:.

