Sunday, April 26, 2020

Web Scraping Using Requests - Beautifulsoup Intro

Intro to Requests and Beautifulsoup

We will use beautifulSoup to extract some data from the HTML that we get. For now we will do it step by step.
 
Before we proceed, please make sure you have read the first blog on this series to do the prerequisites.
 

What to do next?

We will install no the beautifulsoup library
  1. You can run the command.
    pip install beautifulsoup4
    or
    python -m pip install beautifulsoup4
 

Lets start to code.

  1. Create a file simplebsoup.py and paste the following codes
    from bs4 import BeautifulSoup
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    print(soup.find(id="d").text)
    
  2. Run the simplebsoup.py. It should print the text
            Inside the div tag but outside the p tag.
            This is inside the p tag.
    
  3. Add another line containing the following code
    print(soup.find(id="p").text)
    
    The final code should be:
    from bs4 import BeautifulSoup
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    print(soup.find(id="d").text)
    print(soup.find(id="p").text)
    
  4. Run the simplebsoup.py again. It should print the text
            Inside the div tag but outside the p tag.
            This is inside the p tag.
    
    This is inside the p tag.
    
 

What did we do?

  • We import the BeautifulSoup class from bs4 library.
    from bs4 import BeautifulSoup
    
  • We import the requests library.
    import requests
    
  • We declare the URL that we will use for scraping.
    url = "https://slackingslacker.github.io/simple.html"
    
  • We get the text value of the HTML in the 4th line assigning it to html_doc.
    html_doc = requests.get(url).text
    
  • We create a BeautifulSoup instance using the HTML text value in variable html_doc assigning to variable soup. We also use the html.parser to parsing the HTML.
    soup = BeautifulSoup(html_doc, "html.parser")
    
  • On the 6th line,
    print(soup.find(id="d").text)
    
    we find the element with id and value d and print the text. You can notice that the whole text Inside the div tag but outside the p tag. This is inside the p tag. was printed. This is the behaviour in using javascript to get the text value of the div tag. The HTML is.
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Tutorials</title>
    </head>
    <body>
        <div id="d">
            Inside the div tag but outside the p tag.
            <p id="p">This is inside the p tag.</p>
        </div>
    </body>
    </html>
    
  • On the 7th line,
    print(soup.find(id="p").text)
    
    we find the element with id and value d and print the text. The text This is inside the p tag. was printed only because it is the only text inside the p tag.
 

Conclusion

We have extracted some data from a web page with just a few lines of codes.
 

No comments:

Post a Comment

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium

Beginner Guide to Web Scraping Using Python For Requests and Selenium (Live Examples)   Web scraping is gathering da...