Saturday, May 16, 2020

Web Scraping Using Selenium - Intro

Selenium Introduction

The purpose of these series is to use the different functionalities in the selenium documentation. Each functionality will have a different example in order for us to better understand when and where to use that functionality.
 

Background

Selenium is usually used for automated testing but can also be used for scraping websites. Since most modern websites are created as Single-Page Applications (SPA), the page is lazy loaded. It means that the basic structure like HTML and CSS are loaded first before the data were loaded. The data were loaded afterwards through API calls and Javascript. Scrapers such as requests library from Python (tutorials can be found here) or guzzle from PHP cannot directly interact with javascripts. There can be workarounds to handle those javascript interactions but sometimes it is a dead end. Selenium solves this kind of problems by interacting to the website using browser so it is just like a person controlling the website. Selenium is mostly used for testing website and also can be user for scraping.
 

Getting the Softwares required for Selenium

We will now start to code. But first lets make sure that we have the required.
  1. Install python. You can download python here depending on your OS. The installation of python will depend on the OS that you are using
  2. Install the required library, in this case requests.You can run the command.
    pip install selenium
    or
    python -m pip install selenium
  3. (Optional) You can download pycharm here as to make your coding faster. I will be using pycharm in doing these tutorials but you can also use notepad and command lines.
  4. Download the browser drivers and paste it to the directory where it is accessible to the app. you can it paste the library later. I will be using firefox for most of the tutorials as the geckodriver and firefox are compatible even when the browser were updated.
  5. (Optional) Install the chrome and firefox browsers.
 

Coding My First Selenium Program

  1. Create a directory where you will put your codes
  2. Copy the drivers that you have downloaded and paste them in the directory you've created.
  3. Create a file seleniumintro.py and paste the following codes
    from selenium import webdriver
    import time
    
    The codes above imports the required library that we will use.
  4. Add this line
    driver = webdriver.Firefox(executable_path="geckodriver.exe")
    
    The code above will create a webdriver instance for Firefox
  5. Add this line
    driver.get("https://slackingslacker.github.io/seleniumindex")
    
    The line will got to the website (https://slackingslacker.github.io/seleniumindex).
  6. Add this line
    time.sleep(5)
    
    We will pause the program for 5 seconds. This is to ensure that you can see whats happening in the browser.
  7. Add this line
    driver.close()
    
    The line will close the webdriver as well as the browser.
  8. Add this line
    driver = webdriver.Chrome(executable_path="chromedriver.exe")
    
    This time we are going to use chrome browser to access the website.
  9. Add this line
    driver.get("https://slackingslacker.github.io/seleniumindex")
    
    The line will got to the website (https://slackingslacker.github.io/seleniumindex) on the chrome.
  10. Add this line
    time.sleep(5)
    
    We will again pause the program for 5 seconds.
  11. Add this line
    driver.close()
    
    The line will close the webdriver as well as the browser.
  12. Run the seleniumsimple.py.
    python seleniumintro.py
    or Run on pycharm
     
    It should do the following:
    1. Opens the firefox browser.(assuming you have firefox installed.)
    2. Browser goes to the website https://slackingslacker.github.io/seleniumindex
    3. Halts for 5 seconds
    4. Close firefox browser
    5. Opens the chrome browser (assuming you have chrome installed.)
    6. Browser goes to the website https://slackingslacker.github.io/seleniumindex
    7. Wait for 5 seconds
    8. Close chrome browser
 

Final Selenium Code

from selenium import webdriver
import time

driver = webdriver.Firefox(executable_path="geckodriver.exe")
driver.get("https://slackingslacker.github.io/seleniumindex")
time.sleep(5)
driver.close()

driver = webdriver.Chrome(executable_path="chromedriver.exe")
driver.get("https://slackingslacker.github.io/seleniumindex")
time.sleep(5)
driver.close()

 

Conclusion

Using selenium, we can open the browser and automatically use the functionalities of a website. With just a few lines of codes, we can easily use selenium.
 

No comments:

Post a Comment

Programming

Basic Web Scraping Using Python - A Beginner's Guide to using Requests and Selenium

Beginner Guide to Web Scraping Using Python For Requests and Selenium (Live Examples)   Web scraping is gathering da...