In the era of digitization, businesses have focused on building the customer base using the online platform. In recent years, most of the products and services are available online. So how do companies remain ahead in this competitive market? Are you curious to explore how it is done?
Before answering that, let’s understand what a web crawler is.
Web scraping is the process that is associated with data extraction and content information from a website. The information acquired is collected and exported into a format that the user can easily interpret, such as a spreadsheet format. Although you can manually perform web scraping, it is advisable to opt for an automated tool. Such tools have the possibility of making your web scraping tasks faster at a comparatively lower cost. As the websites’ nature continues to evolve, web scrapers have continued to vary in their functionality and additional features. It is fascinating to understand that legitimate web scraping tasks enable a business to attain website content and rank it, pricing comparisons from competitors’ websites, and help in market research purposes such as understanding the mindset of users on a social media platform. However, many web scraping tools have been used for illegal purposes, such as copyright content theft. Therefore, it is essential to understand how authorized tools can be effectively used for authorized web scraping tasks. Amazon’s AWS Lambda offers one of the primary and most trusted services that can be used for web scraping.
Still, wondering what is AWS Lambda and how it contributes to web scraping? Let’s dive in.
What is AWS Lambda?
As you know, the automated tools being the preferred choice for web scraping tasks, AWS Lambda takes it a notch higher. AWS Lambda is used as a computing service that enables you to run codes without the need for managing servers or runtimes and maintaining event integrations. Using AWS Lambda, you can run the codes for different types of application or backend services. Does that raise the question about the cost factor for using this service? It is simple, and you pay for the calculated time that has been consumed during your task.
Additionally, there are no charges if your codes are not running. Another critical aspect of this service is the lack of minimal administration. AWS Lambda will automatically execute your codes depending on the incoming request or event with complete administration of the computing resources with the operating system, server maintenance, automated scaling, logging, and code monitoring activities.
AWS Lambda for Web Scraping
Now that you know AWS Lambda and its functionalities let’s shift our focus to the primary question. Why should you use AWS Lambda for web scraping? There are several tools available for web scraping, but AWS is a reliable solution from a well-known company. One of the critical aspects of using AWS Lambda for such activities is the cost factor. You don’t need to pay for dedicated servers and pay for the execution of the task. It is crucial for scraping jobs performed regularly within a few hours or days with an economically viable solution.
There are web pages, that last for a short time, such as news flash, airline booking websites or e-commerce platform that displays deal of the day offers. To capture that data, you need an automated schedule for your scarping tool to act efficiently. With AWS Lambda, you can set up a schedule for the function to run automatically, without needing any supervision of starting or stopping the server. Also, you can set your codes to run automatically or call them directly from a web app or mobile app. Besides, you can write any Lambda function using the language you are proficient in, such as Python, Node.js, Java, and several more. Finally, you can have access to both serverless framework and container tools for your web scraping solutions.
The only downside of using AWS Lambda is the lack of a local storage option. To address your storage requirements, you will need to connect to other Amazon services that offer storage space while working with AWS Lambda. Another aspect that might be challenging for new users of AWS Lambda is its documentation. With an enormous amount of tutorials, users might get confused to navigate through the tutorials.
Building your Serverless Web Scraper with AWS Lambda, Python, and Chalice
Setting up the Dev Environment
Firstly, you create the python virtual environment, and you are required to install the chalice. Also, you must install the dependency by using pip which makes the use of the web scraper much easier.
1 2 3 4 5 | mkivrtualenv serverless_scraping pip install chalice pip install requests_html |
Set Up Chalice with AWS
You need to ensure that the authentication of the machine is performed with the AWS account. You will need access keys, and to get them, you have to go to the security credential page from the dropdown menu at the top right corner. Once you expand the access keys, you have to click on “Create New Access Key.” Finally, you need to save the AWS config file. To perform this action, you need to create the AWS config folder and create and open the new file.
1 2 3 | mkdir ~/.aws nano ~/.aws/config |
Paste the following codes to replace the existing keys and region with your specific keys and region that you have created.
1 2 3 4 5 6 7 | [default] aws_access_key_id=YOUR_ACCESS_KEY_HERE aws_secret_access_key=YOUR_SECRET_ACCESS_KEY region=YOUR_REGION (such as us-west-2, us-west-1, etc) |
Creating a Scraper Script
Create the chalice project using the following commands:
1 | chalice new-project producthunt-scraper |
Replace the app.py file that is inside the chalice project using the following codes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | from chalice import Chalice from requests_html import HTMLSession app = Chalice(app_name='producthunt-scraper') @app.route("/product-hunt/top-product") def get_top_product_product_hunt(): session = HTMLSession() url = 'https://www.producthunt.com/' resp = session.get(url) product_list_containers = resp.html.find(".postsList_b2208") if len(product_list_containers) == 1: product_list = product_list_containers[0] else: product_list = product_list_containers[1] if product_list: top_product = product_list.find("li")[0] product_obj = { "name": top_product.find(".content_31491", first=True).find("h3", first=True).text, "url": "https://producthunt.com{url}".format(url=top_product.find("a", first=True).attrs["href"]), "description": top_product.find(".content_31491", first=True).find("p", first=True).text, "upvote_count": top_product.find(".voteButtonWrap_4c515", first=True).text, } return product_obj else: return {"error": "Product List Element Not Found"} |
Now you can simplify the code with the following codes
1 2 3 | @app.route(“/product-hunt/top-product”) Def get_top_product_product_hunt(): |
You should be aware that serverless functions in chalice are similar to regular python functions that you use. The only addition is an @app decorator to call the function. In this example, an @app.route is used to call the function when a request is made using HTTP.
It is to be noted that the primary part of the function makes use of the requests_html package for tasks that requires to perform the parsing of HTML document and to draw out the elements depending on the class names and HTML tags. You can also observe that it will return an object comprising the top product or an error.
Deployment
Once you build the scrapper, you can test it locally using the chalice local command.
When you are ready to proceed, use the command chalice deploy for its deployment. Chalice handles the remaining functionalities, inclusive of the creation of the AWS Lambda function in the console. All the dependencies that are required will be packed together for AWS Lambda for its usage. The deploy command will send an URL, the public URL for the serverless function, and the product hunt scraper.
Conclusion
It is essential for all the enthusiasts looking for web-scraping solutions to gain an in-depth knowledge of AWS Lambda and proficiency in programming with Python or Java. Suppose you are building your web scraper to be deployed using AWS Lambda. In that case, there are vital things that you should be wary of several factors such as better error handling capabilities, protection of API with an API key, availability of a database for storage requirement. The article aimed to provide crucial information on AWS Lambda, its uses for web scraping, and get you started with a simple example for gaining more insight.
References:
Here is the reference link to the original project for gaining more insight: