In this beginner-friendly tutorial, I explain how to set up a Selenium Scrapper in Flask on EC2. If you are stuck trying to do it or just beginning to wonder What is Selenium then this blog is a great place to get started.
The prerequisite to this Tutorial is just that you have a running EC2 instance on AWS and you are connected to it.
If not this will help you.
⚠️⚠️ WARNING ⚠️⚠️
Before going ahead with this tutorial make sure that your Security Group on your EC2 instance has HTTP access from anywhere and also exposed the 5000 port number as it is very important. Else you won't be able to access the site at the end of this tutorial.
Start a new screen and become root user
Install python3 , pip, git and virtualenv
yum install python3 -y
yum install git -y
yum install pip
pip install virtualenv
Create a virtual environment with python3 and activate that environment. Replace 3.7 with whatever version you have. To get version use:
Create a new virtual env name my-app
virtualenv -p python3.7 my_app
Activate virtual env
You need some browser to run webdrivers. I am using chrome browser and chrome webdriver here.
It is important to download browser first and then identify it's version and accordingly install webdriver for that particular version. Please follow these steps carefully.
Enter the below command to install chrome
curl https://intoli.com/install-google-chrome.sh | bash
sudo mv /usr/bin/google-chrome-stable /usr/bin/google-chrome
To get the version of google chrome installed
The above code will print the chrome version.
Now go to the following website to get the suitable webdriver for that particular version of chrome.
now my version of chrome is 104.
So based on that I will select the following chrome driver
Click on your driver link and from there copy the link to Linux O.S.
Than use the following command along with your version of webdriver.
sudo mv chromedriver /usr/bin/chromedriver
To check if chrome driver is installed or not
Install necessary python libraries
Now we will install all necessary lib in python using pip
pip install flask
pip install selenium
pip install waitress
Run the program
mkdir your_repo && cd your_repo/
use your fav editor to paste this code into the application.py file.
from selenium import webdriver from flask import Flask from selenium.webdriver.chrome.options import Options from waitress import serve app = Flask(__name__) @app.route('/') def hello(): options = Options() options.add_argument("--headless") options.add_argument("--disable-gpu") options.add_argument("--no-sandbox") options.add_argument("enable-automation") options.add_argument("--disable-infobars") options.add_argument("--disable-dev-shm-usage") driver = webdriver.Chrome(options=options) driver.get("https://www.google.com/") element_text = driver.page_source driver.quit() return element_text if __name__ == '__main__': serve(app,host = '0.0.0.0',port = 5000)
and run the below command in the command terminal after saving the application.py file.
The above code will return the HTML source code of the google home page. You are basically running selenium webdriver here and getting the site data of google.com. You can change it to any website you like to scrap.
Use the public IP of your EC2 instance and add the :5000 port number in back of the Public URL.(use HTTP instead of HTTPS)
When the website is google.com
When the website is youtube.com
Thank you for READING 😊😊