Tuesday, May 5, 2020

Web Scraping Using Requests - Query Selectors/CSS Selectors

Requests and Beautifulsoup Searching Using CSS Selectors

We will use beautifulSoup to get different tags using different selectors.
 
Before we proceed, please make sure you have read the first and second blogs on this series to do the prerequisites.
 

Lets start to code.

  1. Create a file simpleselectors.py and paste the following codes. We will do it differently to print the tags that we will find using a method.
    from bs4 import BeautifulSoup
    import requests
    
    url = "https://slackingslacker.github.io/simpleselectors.html"
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser")
    
    
    def find_by_selector(selector: str):
        elements = soup.select(selector)
        for el in elements:
            print("============================")
            print("name : " + el.name)
            print("attributes : " + str(el.attrs))
            print(el)
    
  2. Add the code below.
    find_by_selector(".column")
    
    Run the code. It should print all the div tag elements that has the class attribute with value column
  3. Comment the recently added line by putting # at start of the line.
    find_by_selector(".table.is-narrow")
    
    It should print all table tags that has the class attribute with value table and is-narrow.
  4. Comment the recently added line and add the line below then run the code.
    find_by_selector(".columns .table")
    
    It should print all table tags that has the class attribute with value table and must be a child of an element (div) with class attribute and value columns.
  5. Comment the recently added line and add the line below then run the code.
    find_by_selector("#total")
    
    It should print a span tag that has the id attribute with value total.
  6. Comment the recently added line and add the line below then run the code.
    find_by_selector("div")
    
    It should print all div tags.
  7. Comment the recently added line and add the line below then run the code.
    find_by_selector("td.has-text-link")
    
    It should print all td tags that has the class attribute with value has-text-link.
  8. Comment the recently added line and add the line below then run the code.
    find_by_selector("b,i")
    
    It should print all b and i tags.
  9. Comment the recently added line and add the line below then run the code.
    find_by_selector("table th")
    
    It should print all th tags under the table tags.
  10. Comment the recently added line and add the line below then run the code.
    find_by_selector("div > span")
    
    It should print all span tags that is a direct child of div tags.
  11. Comment the recently added line and add the line below then run the code.
    find_by_selector("span~p")
    
    It should print all p tags that is preceeded by span tags.
  12. Comment the recently added lines and add the line below then run the code.
    find_by_selector("[colspan]")
    
    It should print all tags with colspan attribute.
  13. Comment the recently added lines and add the line below then run the code.
    find_by_selector("[colspan='2']")
    
    It should print all tags with colspan attribute that has a value of 2.
  14. Comment the recently added lines and add the line below then run the code.
    find_by_selector("[class^='has']")
    
    It should print all tags with class attribute that has a value that starts with has.
  15. Comment the recently added lines and add the line below then run the code.
    find_by_selector("[class$='link']"
    
    It should print all tags with class attribute that has a value that ends with link.
  16. Comment the recently added lines and add the line below then run the code.
    find_by_selector("[class*='text']")
    
    It should print all tags with class attribute that has a value that contains text.
  17. Comment the recently added lines and add the line below then run the code.
    find_by_selector("span:empty")
    
    It should print all tags that have no child.
  18. Comment the recently added lines and add the line below then run the code.
    find_by_selector("tr:first-child")
    
    It should print all tr tags that is the first child of a parent tag.
  19. Comment the recently added lines and add the line below then run the code.
    find_by_selector("tr:last-child")
    
    It should print all tr tags that is the last child of a parent tag.
  20. Comment the recently added lines and add the line below then run the code.
    find_by_selector("td:nth-child(3)")
    
    It should print all td tags that is the 3rd child of a parent tag.
  21. Comment the recently added lines and add the line below then run the code.
    find_by_selector("a:first-of-type")
    
    It should print all a tags that is the first child (first of type anchor) of a parent tag.
  22. Comment the recently added lines and add the line below then run the code.
    find_by_selector("a:last-of-type")
    
    It should print all a tags that is the last child (last of type anchor) of a parent tag.
  23. Comment the recently added lines and add the line below then run the code.
    find_by_selector("a:nth-of-type(2)")
    
    It should print all a tags that is the 2nd child (2nd of type anchor) of a parent tag.
  24. Comment the recently added lines and add the line below then run the code.
    find_by_selector("span:only-child")
    
    It should print all span tags that is the only child of a parent tag.
  25. Comment the recently added lines and add the line below then run the code.
    find_by_selector("table > tr:not(:first-child)")
    
    It should print all tr tags that is not the first child of the table. It will print all tr tags that does not contain the titles for the columns.
  26. Comment the recently added lines and add the line below then run the code.
    find_by_selector("div.column:nth-of-type(1) > table > tr:nth-child(4) > td:nth-child(2)")
    
    It should print all td tags that is the 2nd child of a parent tr which is the 4th child of a table tag which is the child of a div that is the first of type (first div) from a parent tag.
  27. Comment the recently added lines and add the line below then run the code.
    print(soup.select_one("div.column:nth-of-type(1) > table > tr:nth-child(4) > td:nth-child(2)").text)
    
    By using the query selector from previous example, we just get a single tag using select_one which will find the first tag matching the query and displaying the text Chicken Legs.
 

Conclusion

We used the select and select_one to get all or one of the Tags in different ways.
 

No comments:

Post a Comment

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium

Beginner Guide to Web Scraping Using Python For Requests and Selenium (Live Examples)   Web scraping is gathering da...