I’m trying to scrape some data from a website on Jupyter Notebooks using BeautifulSoup, and I keep getting this error.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
Obviously I’ve googled this, and it seems like it can be changed with some configuration file in Jupyter, but I haven’t been able to get that to work. I also suspect that’s not the real problem in my case as:
- I’m only trying to scrape text data
- I have scraped the same data I’m TRYING to extract from this page before without getting this error, I just was getting it in a slightly different way
- I’ve tried a workaround (inserting some time.sleep() breaks into my loops, but even that doesn’t seem to help.
Here’s the code I have currently:
# Create a cache that we can access after scraping, just in case
requests_cache.install_cache('missing_children_cache')
# Iterate through each URL in the relevant section of Baobeihuijia forum
url_list1 = ["https://bbs.baobeihuijia.com/forum-191-{}.html".format(str(page)) for page in range(1, 2)]
# Create the lists we'll use to collect data
datalist = []
# SCRAPE DATA
# Create a cache that we can access after scraping, just in case
requests_cache.install_cache('missing_children_cache')
# Iterate through each URL in the relevant section of Baobeihuijia forum
url_list1 = ["https://bbs.baobeihuijia.com/forum-191-{}.html".format(str(page)) for page in range(1, 2)]
# Create the lists we'll use to collect data
datalist = []
# SCRAPE DATA
for url in url_list1:
print('processing:' + ' ' + str(url)) #Print scraping progress so we know it's running
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
thread = soup.find_all('tr')
for single_thread in thread:
em_container = soup.find_all('em') # Find em tags, which sometimes contain gender data
time.sleep(0.2)
data_container = soup.find_all('a', class_ = 's xst') # isolate the forum post title links
for em in em_container: # iterate through the em containers
em_text = em.get_text() # get text associated with em tags
time.sleep(0.1)
if "孩" in str(em_text):
for link in data_container: # iterate through forum title links
link_text = link.get_text() # get text from forum title links
# time.sleep(0.3)
if "出生" in str(link_text):
datalist.append([em_text, link_text])
This may be a bit tough to read, but basically I’m trying to:
- Scrape any snippet of HTML that’s inside
<tr> </tr>
tags from this page - From within that, isolate the text that’s inside
<em> </em>
tags and grab if it includes the string'孩'
(this is gender data) - If it does include that string, then also look for the text that’s inside link tags with
class_ = 's xst'
and grab that as well (this is case data) - when the conditions from 2 and 3 are met, append both those values as a list to
datalist
to make a list of list, where each entry is a list that contains the gender data and the case data associated with it.
(As I said, I’ve done previous scrapes grabbing the exact same data and not gotten this error, so I think the problem is my code. Unfortunately I can’t use the results of the previous scrapes as they were not accurately associating the correct gender data to the corresponding case data - sometimes the page will contain one but not the other and I only want to get them if both are there.).
I think that’s what my code is doing, but my guess is something here is wrong and is getting tons more data than I actually want, which is causing the error.