DistLM is a powerful distributed training framework designed for Large Language Models (LLMs) using Ray. It provides a scalable and efficient solution for training LLMs across multiple devices, leveraging the distributed computing capabilities of Ray.
- Distributed Training: Utilize multiple devices for parallel training of LLMs.
- Ray Integration: Leverage Ray's distributed computing framework for efficient task distribution and execution.
- FastAPI Backend: Robust API for device registration, task submission, and status monitoring.
- Model and Dataset Management: Easy upload and management of custom models and datasets.
- React Frontend: User-friendly interface for interacting with the training system.
- Flexible Model Support: Compatible with various LLM architectures, including LLaMA.
DistLM consists of the following main components:
- Backend API (
main.py
): FastAPI-based server handling device registration, task submission, and status queries. - Ray Setup (
ray_setup.py
): Configures and initializes the Ray cluster for distributed computing. - Training Module (
train.py
): Implements the distributed training logic using Ray. - Data Loader (
dataloader.py
): Handles loading and preprocessing of datasets. - Model Loader (
model.py
): Manages loading and initialization of LLM models. - Frontend (
App.tsx
): React-based user interface for interacting with the system.
-
Clone the repository:
git clone https://github.com/yourusername/DistLM.git cd DistLM
-
Install the required Python packages:
pip install -r requirements.txt
-
Install Node.js and npm (for the frontend).
-
Install frontend dependencies:
cd frontend npm install
- Navigate to the project root directory.
- Run the FastAPI server:
uvicorn main:app --host 0.0.0.0 --port 8000
- Navigate to the
frontend
directory. - Start the React development server:
npm start
POST /devices/register
: Register a new device for distributed training.GET /devices
: List all registered devices.POST /tasks/submit
: Submit a new training task.GET /tasks
: List all submitted tasks.GET /tasks/{task_id}/status
: Check the status of a specific task.POST /upload/model/
: Upload a custom model file.POST /upload/dataset/
: Upload a custom dataset file.
- Register available devices using the
/devices/register
endpoint. - Upload your model and dataset using the respective upload endpoints.
- Submit a training task via the
/tasks/submit
endpoint, specifying the model, dataset, and devices to use. - Monitor the task status using the
/tasks/{task_id}/status
endpoint.
Execute the test suite to ensure system integrity:
pytest test_main.py test_ray.py
To add support for new LLM architectures:
- Modify
model.py
to include the new model loading logic. - Update
train.py
to handle the training process for the new model type.
The React-based frontend (App.tsx
) can be extended to add new features or improve the user interface. Ensure to update the corresponding API calls in the frontend when modifying the backend.
We welcome contributions to DistLM! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Implement your changes, ensuring to follow the existing code style.
- Write or update tests as necessary.
- Submit a pull request with a clear description of your changes.
This project is licensed under the MIT License. See the LICENSE
file for details.
For questions, issues, or contributions, please open an issue on the GitHub repository or contact Abhi Vetukuri or Anir Vetukuri.
DistLM - Empowering distributed training for Large Language Models