This repository contains code for a serverless Fav Icon Scraper implemented using AWS CDK and TypeScript.
This project consists of an AWS CDK stack that sets up an S3 bucket, an SQS queue, and two Lambda functions to scrape and download favicons from URLs.
The first Lambda function, list_parser
, is triggered when a new .txt
file is uploaded to the S3 bucket. It reads the file and creates SQS messages for each URL found in the file.
The second Lambda function, scrapper
, is triggered by messages in the SQS queue created by list_parser. It scrapes the favicons from the URLs in the messages and saves them to the S3 bucket.
The architecture of the Fav Icon Scraper is as follows:
- A text file containing a list of website URLs is uploaded to the Amazon S3 bucket.
- The file upload triggers an S3 Put event, which invokes a Lambda function called list_parser.
- The
list_parser
Lambda function reads the file from S3 and adds the URLs to an SQS queue namedurl_sns_queue
- The
url_sns_queue
triggers another Lambda function namedscrapper
. - The scrapper Lambda function takes the URL from the message, downloads the favicon, and saves it to the same S3 bucket in a separate directory.
Before you can deploy the Fav Icon Scraper, you will need:
- An AWS account
- AWS CLI installed on your local machine
- AWS CDK
- Node.js version 12.x or higher installed on your local machine
To setup local dev env
#Clone this repository to your local machine using
git clone https://github.com/nijeesh4all/fav-icon-scrapper-aws-cdk.git.
#Navigate to the project directory using
cd fav-icon-scrapper-aws-cdk
#Install the dependencies using
npm install
#Configure your AWS credentials using the AWS CLI using
aws configure.
Deploy the application to AWS by running cdk deploy
.
- Create a file containing a list of URLs, one URL per line.
- Upload the file to the S3 bucket created during the deployment.
- Wait for a few minutes for the Lambda functions to download the favicon images from the URLs and save them to the S3 bucket.
- Check the icons directory in the S3 bucket to find the downloaded images.
Run cdk destroy
to remove the AWS CDK stack and all associated resources.