Learn Web Scraping the Fun Way by Building A Discord Bot

You know web scraping is a highly beneficial skill to have, it’s been on your list of things to learn for a very long time now.

What’s holding you back?

We both know the answer is the most elusive force in the universe called motivation.:slightly_smiling_face: While the simple but factual ‘it’s good for you’ usually has no bearing on us, by some lucky chance, we humans can be easily motivated by fun.

:robot: Let’s get the fun started with a Discord bot!

Before we get our hands dirty, I want to make it clear that while it seems like I’m complicating web-scraping by throwing in building a Discord bot. The bot is actually really simple to make thanks to the awesome Discord package.

By the end of this post, we will have a live Discord bot that goes to the website and searches for us.

Bot Setup

What you need before setup:

  • A Discord account.
  • A Discord server. If you don’t already have one, it’s crazy easy to create and completely free. It’s like a group chat on steroids. :muscle:

The setup process is fairly straight forward, with only 3 steps:

  1. Create an application. Go to the Discord developer portal and sign in with your Discord account. Click the [New Application] button at the top right corner.

  1. Create a Bot. After the application is created, in the left panel, go to [ Bot] —it’s time to give your bot a name and an avatar! If you are like me — always spend too much time on a ‘cool name’ — don’t worry, you can always come back and change those settings.

You don’t really need to worry about the bot configuration settings for now.

  1. Add the bot to a server. Click <OAuth2> on the sidebar and check the ‘bot’ box under <SCOPES >. After the box is checked, you will see <BOT PERMISSIONS > show up. It’s not terribly important for a first project, for now, we will just check all the boxes under <TEXT PERMISSIONS > and [View Channels] under <GENERAL PERMISSIONS>.

Boxes to Check For Your Bot:

Copy the URL shown in <SCOPES > and paste it into your browser. You will be brought to a webpage where you can choose which server to add your bot to:

Choose Which Sever You Want to Add the Bot to:

Congradulations, you now have a bot in your server!

Bring Your Bot Online

What you need before coding:

  1. If you are not developing the bot on repl.it — which is an easier way if you don’t want to deal with package installation and version compatibility issues — and your Python version is prior to v3.3, you need to upgrade to 3.5 or later to use async functions. Read why in this StackOverflow thread.
  2. Install the discord package through the command line pip3 install discord.py. If you use repl.it, you can simply import discord in your code and repl.it will install the package on running.
  3. Copy your bot token from the Discord developers application page

Copy Bot Token from The Bot Section on the Application Page:

Note: I highly recommend using repl.it to code up your first Discord bot. But if you really want to code on your local machine, there are a few troubles I ran into you can find at the end of this post.

To bring your bot online, all you need to do is to import the necessary packages → instantiate the Discord Clientclient.run(your bot token).

When your bot is online, you will see in the Discord server that the bot avatar has a little green dot indicating it’s online. But we still want to make sure, so we will add a Discord event to print a message to the console that tells us that the bot is indeed online.

Bring Your Bot To Life with A Few Lines of Code:

import discord
import os
import search_runpee # search class we will implement later

'''
# If you are coding the bot on a local machine, use the `python-dotenv` pakcage to get variables stored in `.env` file of your project
from dotenv import load_dotenv
load_dotenv()
'''

# instantiate discord client 
client = discord.Client()

# discord event to check when the bot is online 
@client.event
async def on_ready():
  print(f'{client.user} is now online!')

# get bot token from .env and run client
# has to be at the end of the file
client.run(os.getenv('TOKEN'))

You may have noticed in the code above, I’m using os.getenv(‘TOKEN’) to get the bot token. That’s because your bot token is like the password, you really want to keep it safe. Here what I did was to create a .env file and add the token as a variable. Note when you create the token variable, don’t add quotes to it, ie. TOKEN=<paste your token as is>.

I forgot to add the .env file to.gitignore just now and immediately got a DM from Discord warning me to be more careful. :slightly_smiling_face: Hope you are not as careless as I was but if something like that happens, it’s best to go to the page you copied the token and regenerate it.

Teach The Bot To Recognize Commands

Teach the Bot to respond to $hello with on_message Discord event:

@client.event
async def on_message(message): 
  # make sure bot doesn't respond to it's own messages to avoid infinite loop
  if message.author == client.user:
      return  
  # lower case message
  message_content = message.content.lower()  
  
  if message.content.startswith(f'$hello'):
    await message.channel.send('''Hello there! I\'m the fidgeting bot from RunPee. 
    Sorry but I really need to go to the bathroom... Please read my manual by typing $help or $commands while I'm away.''')

One other thing that may have caught your eye is the use of the @client.event decorator. Definition from Discordpy doc:

A decorator that registers an event to listen to. The events must be a coroutine, if not, TypeError is raised.

It’s okay if you are not familiar with asynchronous programming, we are only using two events here, the on_ready to check our bot is online, and on_message for our bot to recognize when a message is sent by the user.

In the code above, we used the built-in on_message event(API Reference) from discord.py to listen for the command $hello from users. There are four key points I want to explain a little further:

  1. Check user. If author and user are the same, meaning your bot is sending a message to itself, then return nothing. This prevents infinite loops with the bot.
  2. You want to add a symbol that’s uncommon in texts like $ to distinguish commends from chat texts.
  3. The event on_message is called when a Message is created and sent. The message parameter in the on_message function is a Message> object. Here we accessed the content attribute of the message — which is a string type — then used the Python str method startswith to match the command.
  4. Use message.channel.send() to send a message back to the user in the same channel.

Attributes and Methods of the Message class:

Your bot just learned manners. It’s time to prepare it for the world(wide web)!

Basic Packages for Web Scraping

I know, I know, at this point I have realized that it’s taking more word counts than I expected to go through building the Discord bot. But since you are still here, let’s introduce the bot to the web!

There are only two basic packages we need:

After installing these two packages, you need to pick a website to scrape. Here I’m using the RunPee website as an example. We have a lot of movie-related content there that can be found through a search box.

The next thing to do is to get the URL or a base URL you are using to access the page you want to scrape. In my case, when I search for “wonder woman” related posts on RunPee.com through the search box, I get this URL: https://runpee.com/?s=wonder+woman

Search Results Page for Wonder Woman from RunPee.com:

It’s pretty clear that my base URL here is https://runpee.com/?s= and my search words separated by a space will be replaced with a “+” sign.

I wanted to construct a separate file and class for web scraping to keep my code clean. So here it is, the class for web scraping and returns search results as links to the posts:

import requests
from bs4 import BeautifulSoup

class RunPeeWeb:
  def __init__(self):
        self.headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.3'}
        self.url = 'https://runpee.com/?s='

  def key_words_search_words(self, user_message):
    words = user_message.split()[1:]
    keywords = '+'.join(words)
    search_words = ' '.join(words)
    return keywords, search_words

  def search(self, keywords):
    response = requests.get(self.url+keywords, headers = self.headers)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    result_links = soup.findAll('a')
    return result_links
      
  def send_link(self, result_links, search_words): 
    send_link = set()
    for link in result_links:
        text = link.text.lower()
        if search_words in text:  
          send_link.add(link.get('href'))
    return send_link

Let me break it down step by step:

1. Initialize Search Class:


import requests
from bs4 import BeautifulSoup

class RunPeeWeb:
  def __init__(self):
        self.headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.3'}
        self.url = 'https://runpee.com/?s='

After importing the two packages we need, we will init the class with the base URL. If you get a 403 Forbidden error, you can try and add User-Agent info to the headers as I did here. Go to inspect mode in Chrome and check [Headers] information under [Network] .

Let’s think about what we want next. Now that we have the base URL, the first thing we want is to complete the link with the words we are using to search and stitch them together with a “+” sign.

Also, looking at the result page, there are site featured posts that are not part of our search. There are possibly a few footer links, even ad links that we don’t want. We need to be able to compare with the search words after getting the scraping results back.

That’s what the key_words_search_words function is for. We pass in the user_message, split it into a list of words, excluding the first word which is our command $search, and reforms the search content to meet our needs with keywords for completing the URL and search_words for later matching the links we want to send.

2. Create search_words and keywords & start scraping:


  # Create search workds and url keywords part 
  def keywords_search_words(self, user_message):
    words = user_message.split()[1:]
    keywords = '+'.join(words)
    search_words = ' '.join(words)
    return keywords, search_words

  
  def search(self, keywords):
    # create requests and get the response
    response = requests.get(self.url+keywords, headers = self.headers)
    content = response.content
    # pase the html and pull the data we want 
    soup = BeautifulSoup(content, 'html.parser')
    result_links = soup.findAll('a')
    return result_links

Now that we have the complete URL, it’s time to make a request for the content! That can be done with the first two lines of code in the search function through requests.get() and the content attribute of the response. Then, parsing the HTML using Beautiful soup.

A bowl of soup is a great analogy for the content we are getting. You can print(soup), but beware of the flood of HTML that’s going to flood your console. :sweat_smile:

In this example, what I want to be returned are the relevant posts, as in links. So I’m using findAll(‘a’) to get all the links on the page. If you don’t know HTML at all, I encourage you to check out the HTML reference on MDN. Then use Chrome inspect mode to check out the site you are scraping.

In conclusion, the search function will send a request to the site you are scraping and get the content back in HTML. Then you can use beautiful soup to parse the HTML and use its methods to find the specific data you want. In this example, we only used the findAll(‘a’) method because our need here is fairly simple. But more often than not, you might want to scrape a certain class of a certain element or scrape by name or id or even CSS selector. What I’m doing here is to lead you through the door, I will leave you to explore the world inside on your own.

3. Send Unique Links that Are Relevant to Our Search:

  def send_link(self, result_links, search_words): 
    send_link = set()
    for link in result_links:
        text = link.text.lower()
        if search_words in text:  
          send_link.add(link.get('href'))
    return send_link

With all the links we scraped from the result page, it’s time to filter them out and send only the ones we want. It’s a simple and straightforward function to match and filter the links. One thing to mention is the usage of link.text. text is an attribute that simply gets the text that’s showing for humans. For example <a href="https://example.com">This is the text</a>. That’s what we can use to compare the search_words with.

Integrating The Search Class With Discord Bot

With the search class in place, all we need to do now is to import the class file and use it in the Discord on_message event.

If we do get results back from the search, we will send the user relevant links, if not, we will reply with a no_result_message:

import search_runpee

# instantiate RunPeeWeb class from search_runpee.py
runpee_web = search_runpee.RunPeeWeb()

# no result message 
no_result_message = '''Sorry, we can\'t find what you are searching for. We may not have written anything about it yet, 
but you can subscribe to our newsletter for updates on our newest content 
--> https://runpee.com/about-runpee/runpee-movie-newsletter/'''

@client.event
async def on_message(message): 
  if message.author == client.user:
      return  
  # lower case message
  message_content = message.content.lower()  

  if message.content.startswith(f'$hello'):
    await message.channel.send('Hello there! I\'m the bad robot you fart face.')
    
  if f'$search' in message_content:

    key_words, search_words = runpee_web.key_words_search_words(message_content)
    result_links = runpee_web.search(key_words)
    links = runpee_web.send_link(result_links, search_words)
    
    if len(links) > 0:
      for link in links:
       await message.channel.send(link)
    else:
      await message.channel.send(no_result_message)

Viola! Your Discord search bot is ready to crawl the web!

Extra info for coding on a local machine

  • To get the Token variable in the .env file under the same folder, you will need to pip install python-dotenv. Then import and call load_env().
  • If you installed a new Python version, an error you might run into is “SSLCertVerificationError: certificate has expired”. Here’s the thread to read and fix that issue.
  • To host the bot and keep it running, I recommend this quick video tutorial from YouTube.

Conclusion

This is really a very basic introduction to web scraping, but I hope it intrigues you to explore more on your own.

Having a bot to play with is surprisingly motivating. You can also integrate it with an API, or ideas and needs you have for your Discord channel. Let your imagination fly. :wink: Hope you enjoy coding your own Discord bot!

Click here for the repo.

5 Likes