Echo Overview

Lastest News🎯

📍[2025/03] Echo slowdown module (v1.0) has been released. This version brings the following features:

Kernel Metric Collection
- Utilizes NVIDIA Nsight Compute and Nsight Systems to automatically profile GPU kernels
- Captures detailed execution metrics
- Generates baseline performance profiles for isolated kernel execution
- Outputs structured JSON files containing raw kernel metrics
Slowdown Collection
- Analyzes kernel behavior under various overlap scenarios
- Measures actual slowdown factors through controlled experiments
- Generates ground truth data for model training and validation
Training & Testing
- Implements machine learning models for slowdown prediction
- Provides comprehensive evaluation metrics

Echo Overview

Echo is a simulation platform designed for distributed training of machine learning and LLMs. It helps researchers and engineers evaluate training scenarios, parallelism strategies, and hardware configurations without deploying on large physical clusters.

Key Features:

Support for 3D Parallelism: Echo accurately simulates data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) to reflect real-world large-scale training.
Workload Tracer: Echo traces runtime execution graphs in an ex-situ manner, allowing a single device to simulate thousands of GPUs, without requiring a full-scale deployment.
Communication-Overlap Estimation: Echo introduces a slowdown predictor to model performance degradation caused by overlapping computation and communication, improving accuracy in simulation.

E2E Results:

We evaluated Echo’s end-to-end performance on a 96-GPU H800 cluster. Echo achieves an average accuracy of 92% in training step time while completing the simulation in under 2 minutes.

Plan

We will gradually release Echo's core components to the community.

Currently, we have published the slowdown module. This module predicts GPU kernel performance slowdowns due to comp.-comm. overlap during distributed training. It provides tools for kernel metric collection, slowdown data generation, and training/testing prediction models.

We are actively developing new features and plan to support:

Expert Parallelism (EP) for Mixture of Experts (MoE) models.
Context Parallelism (CP) to optimize long-sequence training.
Visualization and analysis tool for visualizing and analyzing all events across different GPUs/ranks.
Expanded NCCL communication modeling to enhance network simulation accuracy by incorporating more complex factors such as congestion control, adaptive routing, and bandwidth contention.
Support for diverse hardware backends, including future NVIDIA and AMD architectures.

Citation

If you use this repo in your research, please cite our paper:

@article{echo2024,
  title={Echo: Simulating Distributed Training At Scale},
  author={Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu},
  journal={arXiv preprint arXiv:2412.12487},
  year={2024}
}

Contact

Email Yicheng Feng ([email protected]) or Eric, Kin Hang Sew ([email protected]) if you have any questions.

License

This project is licensed under the MIT license - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs/images		docs/images
echo_slowdown @ 18b7230		echo_slowdown @ 18b7230
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lastest News🎯

Echo Overview

Key Features:

E2E Results:

Plan

Citation

Contact

License

About

Releases

Packages

Contributors 2

License

NetX-lab/Echo

Folders and files

Latest commit

History

Repository files navigation

Lastest News🎯

Echo Overview

Key Features:

E2E Results:

Plan

Citation

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages