"IOPub data rate exceeded" error in Jupyter Notebook web scraper

I’m trying to scrape some data from a website on Jupyter Notebooks using BeautifulSoup, and I keep getting this error.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Obviously I’ve googled this, and it seems like it can be changed with some configuration file in Jupyter, but I haven’t been able to get that to work. I also suspect that’s not the real problem in my case as:

  1. I’m only trying to scrape text data
  2. I have scraped the same data I’m TRYING to extract from this page before without getting this error, I just was getting it in a slightly different way
  3. I’ve tried a workaround (inserting some time.sleep() breaks into my loops, but even that doesn’t seem to help.

Here’s the code I have currently:

# Create a cache that we can access after scraping, just in case
requests_cache.install_cache('missing_children_cache')

# Iterate through each URL in the relevant section of Baobeihuijia forum
url_list1 = ["https://bbs.baobeihuijia.com/forum-191-{}.html".format(str(page)) for page in range(1, 2)]
 
# Create the lists we'll use to collect data
datalist = []

# SCRAPE DATA 
# Create a cache that we can access after scraping, just in case
requests_cache.install_cache('missing_children_cache')

# Iterate through each URL in the relevant section of Baobeihuijia forum
url_list1 = ["https://bbs.baobeihuijia.com/forum-191-{}.html".format(str(page)) for page in range(1, 2)]
 
# Create the lists we'll use to collect data
datalist = []

# SCRAPE DATA 
for url in url_list1:
    print('processing:' + ' ' + str(url)) #Print scraping progress so we know it's running
    page = requests.get(url)
    soup = BeautifulSoup(page.content,'html.parser') 
    
    thread = soup.find_all('tr')
    for single_thread in thread:
        em_container = soup.find_all('em')  # Find em tags, which sometimes contain gender data
        time.sleep(0.2)
        data_container = soup.find_all('a', class_ = 's xst')  # isolate the forum post title links
        for em in em_container: # iterate through the em containers
            em_text = em.get_text() # get text associated with em tags
            time.sleep(0.1)
            if "孩" in str(em_text): 

                for link in data_container:   # iterate through forum title links
                    link_text = link.get_text()   # get text from forum title links
#                     time.sleep(0.3)
                    if "出生" in str(link_text):  
                        datalist.append([em_text, link_text])

This may be a bit tough to read, but basically I’m trying to:

  1. Scrape any snippet of HTML that’s inside <tr> </tr> tags from this page
  2. From within that, isolate the text that’s inside <em> </em> tags and grab if it includes the string '孩' (this is gender data)
  3. If it does include that string, then also look for the text that’s inside link tags with class_ = 's xst' and grab that as well (this is case data)
  4. when the conditions from 2 and 3 are met, append both those values as a list to datalist to make a list of list, where each entry is a list that contains the gender data and the case data associated with it.

(As I said, I’ve done previous scrapes grabbing the exact same data and not gotten this error, so I think the problem is my code. Unfortunately I can’t use the results of the previous scrapes as they were not accurately associating the correct gender data to the corresponding case data - sometimes the page will contain one but not the other and I only want to get them if both are there.).

I think that’s what my code is doing, but my guess is something here is wrong and is getting tons more data than I actually want, which is causing the error.

2 Likes

To fix this behaviour you can either start the notebook with the command:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

i.e with a greater than default iopub.data_rate ,
or create a jupyter_notebook_config.py as it it is explained here:
http://jupyter-notebook.readthedocs.io/en/latest/config.html

2 Likes

please refer to the below link. It may help you.

https://www.drjamesfroggatt.com/python-and-neural-networks/iopub-data-rate-exceeded-the-notebook-server-will-temporarily-stop-sending-output-to-the-client-in-order-to-avoid-crashing-it/