Wednesday, April 29, 2020

Web Scraping Using Requests - Searching Through Document

Requests and Beautifulsoup Search In Document

We will use beautifulSoup to find_all method to search for tag elements. We will have different samples and we can run it one by one.
 
Before we proceed, please make sure you have read the first and second blogs on this series to do the prerequisites.
 

Lets start to code.

  1. Create a file simpletagattr.py and paste the following codes
    from bs4 import BeautifulSoup
    import requests
    import re
    
    url = "https://slackingslacker.github.io/simpletags.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    
  2. Add the code below.
    print(soup.find_all("div"))
    
    Run the code. It should print all the div tag elements that were found in the HTML document. The first parameter of the find_all method is the tag name(s). It can be string, regex or a list. It can also be a function.
  3. Comment the recently added line by putting # at start of the line.
    # print(soup.find_all("div"))
    
    Add this line and run the code again. It should print all the span and b tags.
    print(soup.find_all(["span", "b"]))
    
  4. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(re.compile("^s")))
    
    We have used regular expressions here to find all tags that starts with s. It should print all the tags like span, style, sub, small, etc.
  5. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(string=re.compile("bold text")))
    
    We have used string to find a literal text bold text. If there is a text, the text will be displayed on the output. If the search does not find any matches, then it will be an empty list.
  6. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(id="tableDiv"))
    
    We have used id attribute to find all tags that have the value tableDiv. In real world we must only find one as id must be a unique identifier of an elements. We can use different attributes here like href, width, height, data-lang, etc.
  7. Comment the recently added line and add the line below then run the code.
    print(soup.find_all("span", "red"))
    
    The parameters that we pass are the name and class respectively. It will find all the span with the class red.
  8. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(width="400"))
    
    Like the id attribute, we just used the width instead. It should find 2 table tags.
  9. Comment the recently added line and add the line below then run the code.
    print(soup.find_all(width="400", height="100", border="1", id="secondTable"))
    
    In the previous example we j=used the attribute width. In this case we used width, height, border and id attributes to get a specific table.
  10. Comment the recently added line and add the lines below then run the code.
    print(soup.find_all(name="table", width="400", limit=1))
    print(soup.find_all(name="b", limit=2))
    
    These examples limit the search to 1 and 2 tags respectively. This will be useful when you just first the first or a couple of tags.
  11. Comment the recently added lines and add the lines below then run the code.
    print(soup.find_all(class_="red"))
    print(soup.find_all(name="span", class_="red"))
    print(soup.find_all(name=re.compile("^sp"), class_="red", string="span"))
    print(soup.find_all(name="span", class_="black red-bg"))
    
    The examples above are using the class of a tag. There are four example which have differect sets of search criterias.
  12. Comment the recently added lines and add the lines below then run the code.
    def with_specific(tag):
        return tag.has_attr("id") and tag.get("id", "") == "listDiv"
    print(soup.find_all(with_specific))
    
    The example above is using a function to find a specific tag. The method checks if the tag is id attribute
    tag.has_attr("id")
    
    And then checks if the value if the id attribute matches the value of a specific string.
    tag.get("id", "") == "listDiv"
    
    The value of id is defaulted to emptry string as to ensure that tag will not have errors if the id attribute is not present.
    soup.find_all(with_specific)
    
    Then the above is the sample call for functions as parameter.
  13. Comment the recently added lines and add the lines below then run the code.
    for tag in soup.find_all(True):
        print(tag.name)
    
    Passing True will scan for all the tags in the document passed. We just loop through the list and print the tagnames.
 

Conclusion

We used the find_all to get all the Tags in different ways.
 

No comments:

Post a Comment

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium

Beginner Guide to Web Scraping Using Python For Requests and Selenium (Live Examples)   Web scraping is gathering da...