DistLM: Distributed Training Framework for Large Language Models

Overview

DistLM is a powerful distributed training framework designed for Large Language Models (LLMs) using Ray. It provides a scalable and efficient solution for training LLMs across multiple devices, leveraging the distributed computing capabilities of Ray.

Features

Distributed Training: Utilize multiple devices for parallel training of LLMs.
Ray Integration: Leverage Ray's distributed computing framework for efficient task distribution and execution.
FastAPI Backend: Robust API for device registration, task submission, and status monitoring.
Model and Dataset Management: Easy upload and management of custom models and datasets.
React Frontend: User-friendly interface for interacting with the training system.
Flexible Model Support: Compatible with various LLM architectures, including LLaMA.

System Architecture

DistLM consists of the following main components:

Backend API (main.py): FastAPI-based server handling device registration, task submission, and status queries.
Ray Setup (ray_setup.py): Configures and initializes the Ray cluster for distributed computing.
Training Module (train.py): Implements the distributed training logic using Ray.
Data Loader (dataloader.py): Handles loading and preprocessing of datasets.
Model Loader (model.py): Manages loading and initialization of LLM models.
Frontend (App.tsx): React-based user interface for interacting with the system.

Installation

Clone the repository:

git clone https://github.com/yourusername/DistLM.git
cd DistLM

Install the required Python packages:
```
pip install -r requirements.txt
```
Install Node.js and npm (for the frontend).
Install frontend dependencies:
```
cd frontend
npm install
```

Usage

Starting the Backend

Navigate to the project root directory.

Run the FastAPI server:

uvicorn main:app --host 0.0.0.0 --port 8000

Starting the Frontend

Navigate to the frontend directory.
Start the React development server:
```
npm start
```

API Endpoints

POST /devices/register: Register a new device for distributed training.
GET /devices: List all registered devices.
POST /tasks/submit: Submit a new training task.
GET /tasks: List all submitted tasks.
GET /tasks/{task_id}/status: Check the status of a specific task.
POST /upload/model/: Upload a custom model file.
POST /upload/dataset/: Upload a custom dataset file.

Distributed Training

Register available devices using the /devices/register endpoint.
Upload your model and dataset using the respective upload endpoints.
Submit a training task via the /tasks/submit endpoint, specifying the model, dataset, and devices to use.
Monitor the task status using the /tasks/{task_id}/status endpoint.

Development

Running Tests

Execute the test suite to ensure system integrity:

pytest test_main.py test_ray.py

Adding New Models

To add support for new LLM architectures:

Modify model.py to include the new model loading logic.
Update train.py to handle the training process for the new model type.

Extending the Frontend

The React-based frontend (App.tsx) can be extended to add new features or improve the user interface. Ensure to update the corresponding API calls in the frontend when modifying the backend.

Contributing

We welcome contributions to DistLM! Please follow these steps to contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Implement your changes, ensuring to follow the existing code style.
Write or update tests as necessary.
Submit a pull request with a clear description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions, issues, or contributions, please open an issue on the GitHub repository or contact Abhi Vetukuri or Anir Vetukuri.

DistLM - Empowering distributed training for Large Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DistLM: Distributed Training Framework for Large Language Models

Overview

Features

System Architecture

Installation

Usage

Starting the Backend

Starting the Frontend

API Endpoints

Distributed Training

Development

Running Tests

Adding New Models

Extending the Frontend

Contributing

License

Contact

For questions, issues, or contributions, please open an issue on the GitHub repository or contact Abhi Vetukuri or Anir Vetukuri.

About

Releases

Packages

Contributors 2

Languages

License

AnirudhVetukuri/DistLM

Folders and files

Latest commit

History

Repository files navigation

DistLM: Distributed Training Framework for Large Language Models

Overview

Features

System Architecture

Installation

Usage

Starting the Backend

Starting the Frontend

API Endpoints

Distributed Training

Development

Running Tests

Adding New Models

Extending the Frontend

Contributing

License

Contact

For questions, issues, or contributions, please open an issue on the GitHub repository or contact Abhi Vetukuri or Anir Vetukuri.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages