Skip to content

Commit

Permalink
Merge branch 'myscraper'
Browse files Browse the repository at this point in the history
  • Loading branch information
ajits-github committed Aug 16, 2023
2 parents b19b85d + 090f93c commit 0e104e4
Show file tree
Hide file tree
Showing 20 changed files with 447 additions and 0 deletions.
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
data/
scraper_spider/
venv/

# Python cache files
__pycache__/
*.pyc
*.pyo

.vscode/
33 changes: 33 additions & 0 deletions DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Decisions Made During the Project

Throughout the development of this project, various decisions were made to ensure effective scraping of data, the optimal structure of the database, and the smooth execution of the API. This document outlines those key decisions.

## Choice of FastAPI
Even though it was an initial recommendation, FastAPI emerged as the preferred framework for the API development because of its speed, efficiency, and ease of use. A notable advantage of FastAPI is the automatic generation of Swagger UI which simplifies the testing and documentation of the API endpoints.

## Database Structure
SQLite was employed given its lightweight nature and ease of setup. It offered a rapid prototyping advantage and negated the need for an external server setup.

The table structure was tailored to accommodate data derived from the scraping process, capturing attributes like manufacturer, category, model, part_number, and part_category.

## Scraping Strategy

### BeautifulSoup Implementation
The website's structure to be scraped influenced the decision to use the BeautifulSoup library due to its flexibility and efficiency.

To achieve cleaner code and facilitate debugging and potential expansions, the scraper was architectured modularly. Distinct functions were dedicated to scraping manufacturers, models, and parts.

### Transition to Scrapy
Recognizing the demand for faster and more concurrent scraping, the project incorporated Scrapy. As a powerful scraping framework, Scrapy offers advantages in speed, concurrency, and handling complex scraping requirements.

## Configuration and Environment Consistency
The introduction of `config.yml` allowed for centralization of key configurations like the database path. This ensures easier manageability and adaptability.

## Dockerization
Docker was utilized to containerize both the scraper (Beautiful Soup and Scrapy versions) and the API. This strategy guaranteed a consistent environment, simplifying the setup process and minimizing disparities between development and production environments.

## Logging
To keep track of the scraping processes, especially with the Scrapy implementation, logging was incorporated. This allows for better monitoring, debugging, and understanding of the scraper's behavior. The verbosity and settings of the logger can be easily adjusted to suit the need.

## Future Considerations
Considering future requirements, the project might pivot to a robust database like PostgreSQL if the volume of scraped data surges substantially.
103 changes: 103 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,105 @@
# DNL_Backend_Challenge
This project involves scraping data from a website and storing it in a SQLite database. An API built with FastAPI is also provided to access and query this data.

# DNL Project

This project involves scraping data from a website using both Beautiful Soup and Scrapy, storing the scraped data in a SQLite database, and providing an API built with FastAPI to access and query this data.

## Table of Contents
- [DNL\_Backend\_Challenge](#dnl_backend_challenge)
- [DNL Project](#dnl-project)
- [Table of Contents](#table-of-contents)
- [Getting Started](#getting-started)
- [Setup](#setup)
- [Configuration](#configuration)
- [Running the Services](#running-the-services)
- [Accessing the API](#accessing-the-api)
- [Components](#components)
- [Database](#database)
- [Scrapers](#scrapers)
- [Beautiful Soup Scraper](#beautiful-soup-scraper)
- [Scrapy Spider](#scrapy-spider)
- [API](#api)
- [API Usage](#api-usage)
- [Fetching Parts\_Data (Table)](#fetching-parts_data-table)
- [Swagger UI](#swagger-ui)
- [Structure](#structure)
- [Logger](#logger)
- [Contributing](#contributing)

-----------------------------------------------------------

### Getting Started

#### Setup
- Ensure Docker and Docker Compose are installed on your machine.
- Clone the repository to your local system:
```https://github.com/ajits-github/DNL_Backend_Challenge.git```

#### Configuration
Configuration values, such as the path for the SQLite database, are maintained in config.yml in the project repository. Make sure to check and modify, if necessary, before running the services.

#### Running the Services
- From the project's root directory, use the following command to start the services:
`docker-compose up --build`
The scraper services will begin, scraping the required data and populating the SQLite database. Following this, the API service will be accessible.

#### Accessing the API
- With the services running, access the FastAPI Swagger UI at `http://127.0.0.1:8000/docs`.
- Here, you can test and interact with the available API endpoints.

-----------------------------------------------------------

### Components

#### Database
- SQLite, a file-based database system, serves as the project's database solution. This eliminates the need for separate database services. The SQLite database file is created and populated when the scraper runs.

#### Scrapers

##### Beautiful Soup Scraper
- Located in the `scraper/` directory, this scraper executes once upon initiation. It fetches the required data using Beautiful Soup and stores it in the SQLite database.
- If you want to run locally and create the database so that can be mounted as volume in docker-compose, execute the below command at a terminal.
`python ./scraper/main.py`

##### Scrapy Spider
- Found within the `scraper_spider/` directory, this scraper uses Scrapy to fetch necessary data and stores it in the SQLite database.
- If you want to run locally and create the database:
`python ./scraper_spider/main.py`

#### API
- Hosted in the `api/` directory, the API taps into the populated SQLite database to deliver data through its endpoints. The FastAPI Swagger UI allows direct interaction and testing of the API.

-----------------------------------------------------------

### API Usage

#### Fetching Parts_Data (Table)
- Endpoint: `/parts`
- Refine results using query parameters, e.g., `?manufacturer=Ammann`.

#### Swagger UI
- Test the API endpoints by accessing the FastAPI Swagger UI at `http://127.0.0.1:8000/docs`.

-----------------------------------------------------------

### Structure

- `scraper/`: Houses the Beautiful Soup scraping logic.
- `scraper_spider/`: Contains the Scrapy logic responsible for web scraping.
- `api/`: Contains the FastAPI server and API logic.
- `database/`: Manages database operations and holds the SQLite file.
- `docker/`: Keeps the Dockerfile and relevant configurations for containerization.

-----------------------------------------------------------

### Logger

Logging is integrated into the application, helping in tracking and debugging activities. You can modify the logging level and format in the Scrapy settings to filter the type of information captured and displayed. This can be especially helpful in identifying issues or optimizing scraper performance.

-----------------------------------------------------------

### Contributing

To contribute to this project, please fork the repository and submit a pull request.

19 changes: 19 additions & 0 deletions api/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Use an official Python runtime as the parent image
FROM python:3.7-slim

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container
COPY api/requirements.txt .

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Copy the content of the local src directory to the working directory
COPY api .
COPY config_util.py .
COPY config.yml .

# Specify the command to run on container start
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
Empty file added api/__init__.py
Empty file.
40 changes: 40 additions & 0 deletions api/api_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import sqlite3
from fastapi import FastAPI, HTTPException

from config_util import config


DB_PATH = config['database']['path']
app = FastAPI()

def query_database(query, params=()):
with sqlite3.connect(DB_PATH) as conn:
cursor = conn.cursor()
cursor.execute(query, params)
rows = cursor.fetchall()
columns = [desc[0] for desc in cursor.description]
return [dict(zip(columns, row)) for row in rows]


@app.get("/parts/")
async def get_parts(manufacturer: str = None):
base_query = "SELECT * FROM parts_data"
params = ()

if manufacturer:
base_query += " WHERE manufacturer=?"
params = (manufacturer, )

results = query_database(base_query, params)

if not results:
raise HTTPException(status_code=404, detail="No parts found")

return results

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

# This will start the FastAPI server. You can then access your API at http://127.0.0.1:8000/parts/ and filter results by adding a manufacturer query parameter, e.g., http://127.0.0.1:8000/parts/?manufacturer=Ammann.
# Swagger UI is enabled by default in FastAPI, so you can access the interactive API documentation at http://127.0.0.1:8000/docs.
4 changes: 4 additions & 0 deletions api/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fastapi==0.68.0
uvicorn==0.15.0
pyyaml==6.0.1
# sqlite3==0.0.1
8 changes: 8 additions & 0 deletions config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
database:
# host: "0.0.0.0"
# name: ""
# user: ""
# pwd: ""
# port: ""
# path: "data/scraped_data1.db"
path: "data/scraped_data.db"
7 changes: 7 additions & 0 deletions config_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import yaml

def load_config():
with open('config.yml', 'r') as config_file:
return yaml.safe_load(config_file)

config = load_config()
Empty file added database/__init__.py
Empty file.
32 changes: 32 additions & 0 deletions database/db_manager.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import sqlite3

from config_util import config

DB_PATH = config['database']['path']

def store_to_db(data):
connection = sqlite3.connect(DB_PATH)
cursor = connection.cursor()

# Create the table
cursor.execute("""
CREATE TABLE IF NOT EXISTS parts_data (
id INTEGER PRIMARY KEY,
manufacturer TEXT,
category TEXT,
model TEXT,
part_number TEXT,
part_category TEXT
)
""")

# Insert data
for entry in data:
cursor.execute("""
INSERT INTO parts_data (manufacturer, category, model, part_number, part_category)
VALUES (?, ?, ?, ?, ?)
""", (entry['manufacturer'], entry['category'], entry['model'], entry['part_number'], entry['part_category']))

connection.commit()
connection.close()

4 changes: 4 additions & 0 deletions database/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SCHEMA = """
CREATE TABLE IF NOT EXISTS parts
(manufacturer TEXT, category TEXT, model TEXT, part_number TEXT, part_category TEXT);
"""
37 changes: 37 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
version: '3'

services:
scraper:
build:
context: .
dockerfile: scraper/Dockerfile
volumes:
- ./scraper:/app/scraper
- database_volume:/app/data

# scraper:
# build:
# context: .
# dockerfile: scraper_spider/Dockerfile
# volumes:
# - ./scraper_spider:/app/scraper_spider
# - database_volume:/app/data
api:
build:
context: .
dockerfile: api/Dockerfile
ports:
- "8000:8000"
depends_on:
- scraper
volumes:
- ./api:/app/api
- database_volume:/app/data

volumes:
database_volume:


# This configuration defines two services: scraper_spider (or scraper) and api.
# It also sets up a volume data to store the SQLite database file.
# This ensures that the database file created by the scraper is also accessible to the API service.
31 changes: 31 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# # Use Python 3.7 as the parent image
# FROM python:3.7-slim

# # Set the working directory in the container
# WORKDIR /app

# # Copy the current directory contents into the container at /app
# COPY . /app

# # Install necessary packages and libraries
# RUN pip install --no-cache-dir requests beautifulsoup4

# # Command to run the scraper
# CMD ["python", "scraper/main.py"]


# FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

# # Set the working directory
# WORKDIR /app

# # Install dependencies
# COPY requirements.txt .
# RUN pip install --no-cache-dir -r requirements.txt

# # Copy the content of the local src directory to the working directory
# COPY ./scraper /app/scraper
# COPY ./api_server.py /app/

# # Specify the command to run on container start
# CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000"]
20 changes: 20 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# The complete requirements of the project.

# Scraper requirements
requests==2.25.1
beautifulsoup4==4.9.3
scrapy==2.9.0

# Database requirements
sqlalchemy==1.3.23
# sqlite==3.*

# FastAPI web service requirements
fastapi==0.63.0
uvicorn==0.13.4
pydantic==1.8.1
starlette==0.13.6
pyyamlL==6.0.1

# Additional utility (only if you want to see more detailed logs)
loguru==0.5.3
21 changes: 21 additions & 0 deletions scraper/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Use an official Python runtime as the parent image
FROM python:3.7-slim

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container
COPY scraper/requirements.txt .
# COPY scraper /app/scraper

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Copy the content of the local src directory to the working directory
COPY scraper .
COPY database /app/database
COPY config_util.py .
COPY config.yml .

# Specify the command to run on container start
CMD ["python", "./main.py"]
Empty file added scraper/__init__.py
Empty file.
Loading

0 comments on commit 0e104e4

Please sign in to comment.