Wednesday, April 29, 2020

Web Scraping Using Requests - Searching Through Document

Requests and Beautifulsoup Search In Document

We will use beautifulSoup to find_all method to search for tag elements. We will have different samples and we can run it one by one.
 
Before we proceed, please make sure you have read the first and second blogs on this series to do the prerequisites.
 

Lets start to code.

  1. Create a file simpletagattr.py and paste the following codes
    from bs4 import BeautifulSoup
    import requests
    import re
    
    url = "https://slackingslacker.github.io/simpletags.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    
  2. Add the code below.
    print(soup.find_all("div"))
    
    Run the code. It should print all the div tag elements that were found in the HTML document. The first parameter of the find_all method is the tag name(s). It can be string, regex or a list. It can also be a function.
  3. Comment the recently added line by putting # at start of the line.
    # print(soup.find_all("div"))
    
    Add this line and run the code again. It should print all the span and b tags.
    print(soup.find_all(["span", "b"]))
    
  4. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(re.compile("^s")))
    
    We have used regular expressions here to find all tags that starts with s. It should print all the tags like span, style, sub, small, etc.
  5. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(string=re.compile("bold text")))
    
    We have used string to find a literal text bold text. If there is a text, the text will be displayed on the output. If the search does not find any matches, then it will be an empty list.
  6. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(id="tableDiv"))
    
    We have used id attribute to find all tags that have the value tableDiv. In real world we must only find one as id must be a unique identifier of an elements. We can use different attributes here like href, width, height, data-lang, etc.
  7. Comment the recently added line and add the line below then run the code.
    print(soup.find_all("span", "red"))
    
    The parameters that we pass are the name and class respectively. It will find all the span with the class red.
  8. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(width="400"))
    
    Like the id attribute, we just used the width instead. It should find 2 table tags.
  9. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(width="400", height="100", border="1", id="secondTable"))
    
    In the previous example we j=used the attribute width. In this case we used width, height, border and id attributes to get a specific table.
  10. Comment the recently added line and add the lines below then run the code.
    print(soup.find_all(name="table", width="400", limit=1))
    print(soup.find_all(name="b", limit=2))
    
    These examples limit the search to 1 and 2 tags respectively. This will be useful when you just first the first or a couple of tags.
  11. Comment the recently added lines and add the lines below then run the code.
    print(soup.find_all(class_="red"))
    print(soup.find_all(name="span", class_="red"))
    print(soup.find_all(name=re.compile("^sp"), class_="red", string="span"))
    print(soup.find_all(name="span", class_="black red-bg"))
    
    The examples above are using the class of a tag. There are four example which have differect sets of search criterias.
  12. Comment the recently added lines and add the lines below then run the code.
    def with_specific(tag):
        return tag.has_attr("id") and tag.get("id", "") == "listDiv"
    print(soup.find_all(with_specific))
    
    The example above is using a function to find a specific tag. The method checks if the tag is id attribute
    tag.has_attr("id")
    
    And then checks if the value if the id attribute matches the value of a specific string.
    tag.get("id", "") == "listDiv"
    
    The value of id is defaulted to emptry string as to ensure that tag will not have errors if the id attribute is not present.
    soup.find_all(with_specific)
    
    Then the above is the sample call for functions as parameter.
  13. Comment the recently added lines and add the lines below then run the code.
    for tag in soup.find_all(True):
        print(tag.name)
    
    Passing True will scan for all the tags in the document passed. We just loop through the list and print the tagnames.
 

Conclusion

We used the find_all to get all the Tags in different ways.
 

Monday, April 27, 2020

Web Scraping Using Requests - Beautifulsoup Tag Class

Requests and Beautifulsoup Using Tag

We will use beautifulSoup to get the name of the tag and some attributes.
 
Before we proceed, please make sure you have read the first and second blogs on this series to do the prerequisites.
 

Lets start to code.

  1. Create a file simpletagattr.py and paste the following codes
    from bs4 import BeautifulSoup
    import requests
    url = "https://slackingslacker.github.io/simpletags.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    print(soup.find(id="hDiv").name)
    image_tag = soup.find(id="imgId")
    print(image_tag["src"])
    print(image_tag["width"])
    print(image_tag["height"])
    print(image_tag["alt"])
    
  2. Run the simpletagattr.py. It should print the text
    div
    simple.png
    200
    200
    Sample Image
    
 

What did we do?

  • We import the BeautifulSoup class from bs4 library.
    from bs4 import BeautifulSoup
    
  • We import the requests library.
    import requests
    
  • We declare the URL that we will use for scraping.
    url = "https://slackingslacker.github.io/simpletags.html"
    
  • We get the name value of the HTML in the 4th line assigning it to html_doc.
    html_doc = requests.get(url).text
    
  • We create a BeautifulSoup instance using the HTML text value in variable html_doc assigning to variable soup. We also use the html.parser to parsing the HTML.
    soup = BeautifulSoup(html_doc, "html.parser")
    
  • On the 6th line,
    print(soup.find(id="hDiv").name)
    
    we print the name of the Tag. This code gets the object Tag of whatever it finds using id hDiv
    soup.find(id="hDiv")
    
  • On the 7th line,
    image_tag = soup.find(id="imgId")
    
    we assign the Tag object to image_tag. We will use this object to print the element attributes.
  • On the succeeding lines, we print the src, width, height and alt attributes respectively
 

Conclusion

We use the Tag class to obtain different propeties of an HTML element.
 

Sunday, April 26, 2020

Web Scraping Using Requests - Beautifulsoup Intro

Intro to Requests and Beautifulsoup

We will use beautifulSoup to extract some data from the HTML that we get. For now we will do it step by step.
 
Before we proceed, please make sure you have read the first blog on this series to do the prerequisites.
 

What to do next?

We will install no the beautifulsoup library
  1. You can run the command.
    pip install beautifulsoup4
    or
    python -m pip install beautifulsoup4
 

Lets start to code.

  1. Create a file simplebsoup.py and paste the following codes
    from bs4 import BeautifulSoup
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    print(soup.find(id="d").text)
    
  2. Run the simplebsoup.py. It should print the text
            Inside the div tag but outside the p tag.
            This is inside the p tag.
    
  3. Add another line containing the following code
    print(soup.find(id="p").text)
    
    The final code should be:
    from bs4 import BeautifulSoup
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    print(soup.find(id="d").text)
    print(soup.find(id="p").text)
    
  4. Run the simplebsoup.py again. It should print the text
            Inside the div tag but outside the p tag.
            This is inside the p tag.
    
    This is inside the p tag.
    
 

What did we do?

  • We import the BeautifulSoup class from bs4 library.
    from bs4 import BeautifulSoup
    
  • We import the requests library.
    import requests
    
  • We declare the URL that we will use for scraping.
    url = "https://slackingslacker.github.io/simple.html"
    
  • We get the text value of the HTML in the 4th line assigning it to html_doc.
    html_doc = requests.get(url).text
    
  • We create a BeautifulSoup instance using the HTML text value in variable html_doc assigning to variable soup. We also use the html.parser to parsing the HTML.
    soup = BeautifulSoup(html_doc, "html.parser")
    
  • On the 6th line,
    print(soup.find(id="d").text)
    
    we find the element with id and value d and print the text. You can notice that the whole text Inside the div tag but outside the p tag. This is inside the p tag. was printed. This is the behaviour in using javascript to get the text value of the div tag. The HTML is.
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Tutorials</title>
    </head>
    <body>
        <div id="d">
            Inside the div tag but outside the p tag.
            <p id="p">This is inside the p tag.</p>
        </div>
    </body>
    </html>
    
  • On the 7th line,
    print(soup.find(id="p").text)
    
    we find the element with id and value d and print the text. The text This is inside the p tag. was printed only because it is the only text inside the p tag.
 

Conclusion

We have extracted some data from a web page with just a few lines of codes.
 

Saturday, April 25, 2020

Web Scraping Using Requests - Intro

Intro to Requests Scraper

We will write a simple scraper using python as our base language. We will create the very simplest web scraper.
 

What is Web Scraping?

Web scraping is data extraction from a website.
 

What to do next?

We will now start to code. But first lets make sure that we have the required.
  1. Install python. You can download python here depending on your OS. The installation of python will depend on the OS that you are using
  2. Install the required library, in this case requests.You can run the command.
    pip install requests
    or
    python -m pip install requests
  3. (Optional) You can download pycharm here as to make your coding faster. I will be using pycharm in doing these tutorials but you can also use notepad and command lines.
 

Lets start to code.

  1. Create a directory where you will put your codes
  2. Create a file simple.py and paste the following codes
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    print(requests.get(url).text)
    
  3. Run the simple.py.
    python simple.py
    or Run on pycharm
     
    It should print the HTML
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Tutorials</title>
    </head>
    <body>
        <div id="d">
            Inside the div tag but outside the p tag.
            <p id="p">This is inside the p tag.</p>
        </div>
    </body>
    </html>
    
 

What did we do?

  • We import the requests library in order for us to use it.
    import requests
    
  • We then declare the URL that we will use for scraping.
    url = "https://slackingslacker.github.io/simple.html"
    
  • On the third lince we access the url using.
    requests.get(url)
    
    Then get the text value by
    .text
    
    And print the HTML using
    print()
    
    method.
 

Conclusion

With just 3 lines of code we successfully scrape a website by getting its whole page.
 

Just an Update yo our simple scraper. The following are sample on different HTTP Methods

  1. Create a directory where you will put your codes
  2. Create a file simple.py and paste the following codes
    import requests
    url = "https://slackingslacker.github.io/simple.html"
    print(requests.get(url).text)
    
    url = "http://slackingslacker.pythonanywhere.com"
    # Using GET Method
    print(requests.get(url+"/get").text)
    # Using POST Method
    print(requests.post(url+"/post").text)
    # Using PUT Method
    print(requests.put(url+"/put").text)
    # Using DELETE Method
    print(requests.delete(url+"/delete").text)
    # Using POST Method With FORM Submission
    print(requests.post(url+"/postdata",
                        data={"name":"slackingslacker",
                              "location": "earth",
                              "height": "normal human"}).text)
    
 

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium

Beginner Guide to Web Scraping Using Python For Requests and Selenium (Live Examples)   Web scraping is gathering da...