Basic Programming - Do It Simpler: Web Scraping Using Requests

Intro to Requests and Beautifulsoup

We will use beautifulSoup to extract some data from the HTML that we get. For now we will do it step by step.

Before we proceed, please make sure you have read the first blog on this series to do the prerequisites.

What to do next?

We will install no the beautifulsoup library

You can run the command.

pip install beautifulsoup4

python -m pip install beautifulsoup4

Lets start to code.

Create a file simplebsoup.py and paste the following codes

from bs4 import BeautifulSoup
import requests
url = "https://slackingslacker.github.io/simple.html"
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find(id="d").text)

Run the simplebsoup.py. It should print the text

        Inside the div tag but outside the p tag.
        This is inside the p tag.

Add another line containing the following code

print(soup.find(id="p").text)

The final code should be:

from bs4 import BeautifulSoup
import requests
url = "https://slackingslacker.github.io/simple.html"
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.find(id="d").text)
print(soup.find(id="p").text)

Run the simplebsoup.py again. It should print the text

        Inside the div tag but outside the p tag.
        This is inside the p tag.

This is inside the p tag.

What did we do?

We import the BeautifulSoup class from bs4 library.
```
from bs4 import BeautifulSoup
```
We import the requests library.
```
import requests
```

We declare the URL that we will use for scraping.

url = "https://slackingslacker.github.io/simple.html"

We get the text value of the HTML in the 4th line assigning it to html_doc.
```
html_doc = requests.get(url).text
```
We create a BeautifulSoup instance using the HTML text value in variable html_doc assigning to variable soup. We also use the html.parser to parsing the HTML.
```
soup = BeautifulSoup(html_doc, "html.parser")
```
On the 6th line,
```
print(soup.find(id="d").text)
```
we find the element with id and value d and print the text. You can notice that the whole text Inside the div tag but outside the p tag. This is inside the p tag. was printed. This is the behaviour in using javascript to get the text value of the div tag. The HTML is.
```
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Tutorials</title>
</head>
<body>
    <div id="d">
        Inside the div tag but outside the p tag.
        <p id="p">This is inside the p tag.</p>
    </div>
</body>
</html>
```
On the 7th line,
```
print(soup.find(id="p").text)
```
we find the element with id and value d and print the text. The text This is inside the p tag. was printed only because it is the only text inside the p tag.

Conclusion

We have extracted some data from a web page with just a few lines of codes.

Basic Programming - Do It Simpler

Sunday, April 26, 2020

Web Scraping Using Requests - Beautifulsoup Intro

Intro to Requests and Beautifulsoup

What to do next?

Lets start to code.

What did we do?

Conclusion

No comments:

Post a Comment

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium