Intro to Requests and Beautifulsoup
We will use beautifulSoup to extract some data from the HTML that we get. For now we will do it step by step.
Before we proceed, please make sure you have read the
first blog on this series to do the
prerequisites.
What to do next?
We will install no the beautifulsoup library- You can run the command.
pip install beautifulsoup4
orpython -m pip install beautifulsoup4
Lets start to code.
- Create a file simplebsoup.py and paste the following codes
from bs4 import BeautifulSoup import requests url = "https://slackingslacker.github.io/simple.html" html_doc = requests.get(url).text soup = BeautifulSoup(html_doc, "html.parser") print(soup.find(id="d").text)
- Run the simplebsoup.py. It should print the text
Inside the div tag but outside the p tag. This is inside the p tag.
- Add another line containing the following code
print(soup.find(id="p").text)
The final code should be:from bs4 import BeautifulSoup import requests url = "https://slackingslacker.github.io/simple.html" html_doc = requests.get(url).text soup = BeautifulSoup(html_doc, "html.parser") print(soup.find(id="d").text) print(soup.find(id="p").text)
- Run the simplebsoup.py again. It should print the text
Inside the div tag but outside the p tag. This is inside the p tag. This is inside the p tag.
What did we do?
- We import the BeautifulSoup class from bs4 library.
from bs4 import BeautifulSoup
- We import the requests library.
import requests
- We declare the URL that we will use for scraping.
url = "https://slackingslacker.github.io/simple.html"
- We get the text value of the HTML in the 4th line assigning it to html_doc.
html_doc = requests.get(url).text
- We create a BeautifulSoup instance using the HTML text value in variable html_doc assigning to
variable soup. We also use the html.parser to parsing the HTML.
soup = BeautifulSoup(html_doc, "html.parser")
- On the 6th line,
print(soup.find(id="d").text)
we find the element with id and value d and print the text. You can notice that the whole text Inside the div tag but outside the p tag. This is inside the p tag. was printed. This is the behaviour in using javascript to get the text value of the div tag. The HTML is.<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Tutorials</title> </head> <body> <div id="d"> Inside the div tag but outside the p tag. <p id="p">This is inside the p tag.</p> </div> </body> </html>
- On the 7th line,
print(soup.find(id="p").text)
we find the element with id and value d and print the text. The text This is inside the p tag. was printed only because it is the only text inside the p tag.
No comments:
Post a Comment