Merge branch 'myscraper'

ajits-github · Aug 16, 2023 · 0e104e4 · 0e104e4
2 parents b19b85d + 090f93c
commit 0e104e4
Show file tree

Hide file tree

Showing 20 changed files with 447 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+data/
+scraper_spider/
+venv/
+
+# Python cache files
+__pycache__/
+*.pyc
+*.pyo
+
+.vscode/
diff --git a/DECISIONS.md b/DECISIONS.md
@@ -0,0 +1,33 @@
+# Decisions Made During the Project
+
+Throughout the development of this project, various decisions were made to ensure effective scraping of data, the optimal structure of the database, and the smooth execution of the API. This document outlines those key decisions.
+
+## Choice of FastAPI
+Even though it was an initial recommendation, FastAPI emerged as the preferred framework for the API development because of its speed, efficiency, and ease of use. A notable advantage of FastAPI is the automatic generation of Swagger UI which simplifies the testing and documentation of the API endpoints.
+
+## Database Structure
+SQLite was employed given its lightweight nature and ease of setup. It offered a rapid prototyping advantage and negated the need for an external server setup.
+
+The table structure was tailored to accommodate data derived from the scraping process, capturing attributes like manufacturer, category, model, part_number, and part_category.
+
+## Scraping Strategy
+
+### BeautifulSoup Implementation
+The website's structure to be scraped influenced the decision to use the BeautifulSoup library due to its flexibility and efficiency. 
+
+To achieve cleaner code and facilitate debugging and potential expansions, the scraper was architectured modularly. Distinct functions were dedicated to scraping manufacturers, models, and parts.
+
+### Transition to Scrapy
+Recognizing the demand for faster and more concurrent scraping, the project incorporated Scrapy. As a powerful scraping framework, Scrapy offers advantages in speed, concurrency, and handling complex scraping requirements.
+
+## Configuration and Environment Consistency
+The introduction of `config.yml` allowed for centralization of key configurations like the database path. This ensures easier manageability and adaptability.
+
+## Dockerization
+Docker was utilized to containerize both the scraper (Beautiful Soup and Scrapy versions) and the API. This strategy guaranteed a consistent environment, simplifying the setup process and minimizing disparities between development and production environments.
+
+## Logging
+To keep track of the scraping processes, especially with the Scrapy implementation, logging was incorporated. This allows for better monitoring, debugging, and understanding of the scraper's behavior. The verbosity and settings of the logger can be easily adjusted to suit the need.
+
+## Future Considerations
+Considering future requirements, the project might pivot to a robust database like PostgreSQL if the volume of scraped data surges substantially.
diff --git a/README.md b/README.md
@@ -1,2 +1,105 @@
 # DNL_Backend_Challenge
 This project involves scraping data from a website and storing it in a SQLite database. An API built with FastAPI is also provided to access and query this data.
+
+# DNL Project
+
+This project involves scraping data from a website using both Beautiful Soup and Scrapy, storing the scraped data in a SQLite database, and providing an API built with FastAPI to access and query this data.
+
+## Table of Contents
+- [DNL\_Backend\_Challenge](#dnl_backend_challenge)
+- [DNL Project](#dnl-project)
+  - [Table of Contents](#table-of-contents)
+    - [Getting Started](#getting-started)
+      - [Setup](#setup)
+      - [Configuration](#configuration)
+      - [Running the Services](#running-the-services)
+      - [Accessing the API](#accessing-the-api)
+    - [Components](#components)
+      - [Database](#database)
+      - [Scrapers](#scrapers)
+        - [Beautiful Soup Scraper](#beautiful-soup-scraper)
+        - [Scrapy Spider](#scrapy-spider)
+      - [API](#api)
+    - [API Usage](#api-usage)
+      - [Fetching Parts\_Data (Table)](#fetching-parts_data-table)
+      - [Swagger UI](#swagger-ui)
+    - [Structure](#structure)
+    - [Logger](#logger)
+    - [Contributing](#contributing)
+
+-----------------------------------------------------------
+
+### Getting Started
+
+#### Setup
+- Ensure Docker and Docker Compose are installed on your machine.
+- Clone the repository to your local system:
+ ```https://github.com/ajits-github/DNL_Backend_Challenge.git```
+
+#### Configuration
+Configuration values, such as the path for the SQLite database, are maintained in config.yml in the project repository. Make sure to check and modify, if necessary, before running the services.
+
+#### Running the Services
+- From the project's root directory, use the following command to start the services:
+  `docker-compose up --build`
+The scraper services will begin, scraping the required data and populating the SQLite database. Following this, the API service will be accessible.
+
+#### Accessing the API
+- With the services running, access the FastAPI Swagger UI at `http://127.0.0.1:8000/docs`.
+- Here, you can test and interact with the available API endpoints.
+
+-----------------------------------------------------------
+
+### Components
+
+#### Database
+- SQLite, a file-based database system, serves as the project's database solution. This eliminates the need for separate database services. The SQLite database file is created and populated when the scraper runs.
+
+#### Scrapers
+
+##### Beautiful Soup Scraper
+- Located in the `scraper/` directory, this scraper executes once upon initiation. It fetches the required data using Beautiful Soup and stores it in the SQLite database.
+- If you want to run locally and create the database so that can be mounted as volume in docker-compose, execute the below command at a terminal.
+`python ./scraper/main.py`
+
+##### Scrapy Spider
+- Found within the `scraper_spider/` directory, this scraper uses Scrapy to fetch necessary data and stores it in the SQLite database.
+- If you want to run locally and create the database:
+`python ./scraper_spider/main.py`
+
+#### API
+- Hosted in the `api/` directory, the API taps into the populated SQLite database to deliver data through its endpoints. The FastAPI Swagger UI allows direct interaction and testing of the API.
+
+-----------------------------------------------------------
+
+### API Usage
+
+#### Fetching Parts_Data (Table)
+- Endpoint: `/parts`
+- Refine results using query parameters, e.g., `?manufacturer=Ammann`.
+
+#### Swagger UI
+- Test the API endpoints by accessing the FastAPI Swagger UI at `http://127.0.0.1:8000/docs`.
+
+-----------------------------------------------------------
+
+### Structure
+
+- `scraper/`: Houses the Beautiful Soup scraping logic.
+- `scraper_spider/`: Contains the Scrapy logic responsible for web scraping.
+- `api/`: Contains the FastAPI server and API logic.
+- `database/`: Manages database operations and holds the SQLite file.
+- `docker/`: Keeps the Dockerfile and relevant configurations for containerization.
+
+-----------------------------------------------------------
+
+### Logger
+
+Logging is integrated into the application, helping in tracking and debugging activities. You can modify the logging level and format in the Scrapy settings to filter the type of information captured and displayed. This can be especially helpful in identifying issues or optimizing scraper performance.
+
+-----------------------------------------------------------
+
+### Contributing
+
+To contribute to this project, please fork the repository and submit a pull request.
+
diff --git a/api/Dockerfile b/api/Dockerfile
@@ -0,0 +1,19 @@
+# Use an official Python runtime as the parent image
+FROM python:3.7-slim
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy the requirements file into the container
+COPY api/requirements.txt .
+
+# Install any needed packages specified in requirements.txt
+RUN pip install --trusted-host pypi.python.org -r requirements.txt
+
+# Copy the content of the local src directory to the working directory
+COPY api .
+COPY config_util.py .
+COPY config.yml .
+
+# Specify the command to run on container start
+CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
diff --git a/api/__init__.py b/api/__init__.py
diff --git a/api/api_server.py b/api/api_server.py
@@ -0,0 +1,40 @@
+import sqlite3
+from fastapi import FastAPI, HTTPException
+
+from config_util import config
+
+
+DB_PATH = config['database']['path']
+app = FastAPI()
+
+def query_database(query, params=()):
+    with sqlite3.connect(DB_PATH) as conn:
+        cursor = conn.cursor()
+        cursor.execute(query, params)
+        rows = cursor.fetchall()
+        columns = [desc[0] for desc in cursor.description]
+        return [dict(zip(columns, row)) for row in rows]
+
+
+@app.get("/parts/")
+async def get_parts(manufacturer: str = None):
+    base_query = "SELECT * FROM parts_data"
+    params = ()
+
+    if manufacturer:
+        base_query += " WHERE manufacturer=?"
+        params = (manufacturer, )
+
+    results = query_database(base_query, params)
+
+    if not results:
+        raise HTTPException(status_code=404, detail="No parts found")
+
+    return results
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+
+# This will start the FastAPI server. You can then access your API at http://127.0.0.1:8000/parts/ and filter results by adding a manufacturer query parameter, e.g., http://127.0.0.1:8000/parts/?manufacturer=Ammann.
+# Swagger UI is enabled by default in FastAPI, so you can access the interactive API documentation at http://127.0.0.1:8000/docs.
diff --git a/api/requirements.txt b/api/requirements.txt
@@ -0,0 +1,4 @@
+fastapi==0.68.0
+uvicorn==0.15.0
+pyyaml==6.0.1
+# sqlite3==0.0.1
diff --git a/config.yml b/config.yml
@@ -0,0 +1,8 @@
+database:
+  # host: "0.0.0.0"
+  # name: ""
+  # user: ""
+  # pwd: ""
+  # port: ""
+  # path: "data/scraped_data1.db"
+  path: "data/scraped_data.db"
diff --git a/config_util.py b/config_util.py
@@ -0,0 +1,7 @@
+import yaml
+
+def load_config():
+    with open('config.yml', 'r') as config_file:
+        return yaml.safe_load(config_file)
+
+config = load_config()
diff --git a/database/__init__.py b/database/__init__.py
diff --git a/database/db_manager.py b/database/db_manager.py
@@ -0,0 +1,32 @@
+import sqlite3
+
+from config_util import config
+
+DB_PATH = config['database']['path']
+
+def store_to_db(data):
+    connection = sqlite3.connect(DB_PATH)
+    cursor = connection.cursor()
+
+    # Create the table
+    cursor.execute("""
+    CREATE TABLE IF NOT EXISTS parts_data (
+        id INTEGER PRIMARY KEY,
+        manufacturer TEXT,
+        category TEXT,
+        model TEXT,
+        part_number TEXT,
+        part_category TEXT
+    )
+    """)
+
+    # Insert data
+    for entry in data:
+        cursor.execute("""
+        INSERT INTO parts_data (manufacturer, category, model, part_number, part_category)
+        VALUES (?, ?, ?, ?, ?)
+        """, (entry['manufacturer'], entry['category'], entry['model'], entry['part_number'], entry['part_category']))
+
+    connection.commit()
+    connection.close()
+
diff --git a/database/models.py b/database/models.py
@@ -0,0 +1,4 @@
+SCHEMA = """
+CREATE TABLE IF NOT EXISTS parts
+(manufacturer TEXT, category TEXT, model TEXT, part_number TEXT, part_category TEXT);
+"""
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,37 @@
+version: '3'
+
+services:
+  scraper:
+    build:
+      context: .
+      dockerfile: scraper/Dockerfile
+    volumes:
+      - ./scraper:/app/scraper
+      - database_volume:/app/data
+
+  # scraper:
+  #   build:
+  #     context: .
+  #     dockerfile: scraper_spider/Dockerfile
+  #   volumes:
+  #     - ./scraper_spider:/app/scraper_spider
+  #     - database_volume:/app/data
+  api:
+    build:
+      context: .
+      dockerfile: api/Dockerfile
+    ports:
+      - "8000:8000"
+    depends_on:
+      - scraper
+    volumes:
+      - ./api:/app/api
+      - database_volume:/app/data
+
+volumes:
+  database_volume:
+
+
+# This configuration defines two services: scraper_spider (or scraper) and api.
+# It also sets up a volume data to store the SQLite database file.
+# This ensures that the database file created by the scraper is also accessible to the API service.
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -0,0 +1,31 @@
+# # Use Python 3.7 as the parent image
+# FROM python:3.7-slim
+
+# # Set the working directory in the container
+# WORKDIR /app
+
+# # Copy the current directory contents into the container at /app
+# COPY . /app
+
+# # Install necessary packages and libraries
+# RUN pip install --no-cache-dir requests beautifulsoup4
+
+# # Command to run the scraper
+# CMD ["python", "scraper/main.py"]
+
+
+# FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7
+
+# # Set the working directory
+# WORKDIR /app
+
+# # Install dependencies
+# COPY requirements.txt .
+# RUN pip install --no-cache-dir -r requirements.txt
+
+# # Copy the content of the local src directory to the working directory
+# COPY ./scraper /app/scraper
+# COPY ./api_server.py /app/
+
+# # Specify the command to run on container start
+# CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,20 @@
+# The complete requirements of the project.
+
+# Scraper requirements
+requests==2.25.1
+beautifulsoup4==4.9.3
+scrapy==2.9.0
+
+# Database requirements
+sqlalchemy==1.3.23
+# sqlite==3.*
+
+# FastAPI web service requirements
+fastapi==0.63.0
+uvicorn==0.13.4
+pydantic==1.8.1
+starlette==0.13.6
+pyyamlL==6.0.1
+
+# Additional utility (only if you want to see more detailed logs)
+loguru==0.5.3
diff --git a/scraper/Dockerfile b/scraper/Dockerfile
@@ -0,0 +1,21 @@
+# Use an official Python runtime as the parent image
+FROM python:3.7-slim
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy the requirements file into the container
+COPY scraper/requirements.txt .
+# COPY scraper /app/scraper
+
+# Install any needed packages specified in requirements.txt
+RUN pip install --trusted-host pypi.python.org -r requirements.txt
+
+# Copy the content of the local src directory to the working directory
+COPY scraper .
+COPY database /app/database
+COPY config_util.py .
+COPY config.yml .
+
+# Specify the command to run on container start
+CMD ["python", "./main.py"]
diff --git a/scraper/__init__.py b/scraper/__init__.py