Skip to content

OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek.

Notifications You must be signed in to change notification settings

FlagAI-Open/OpenSeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

74 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

OpenSeek Logo

OpenSeek is dedicated to uniting the global open-source community to drive collaborative innovation in algorithms, data, and systems, with the goal of developing next-generation models that surpass DeepSeek.

English| ็ฎ€ไฝ“ไธญๆ–‡

GitHub license GitHub stars GitHub forks GitHub issues

๐Ÿ“Œ Project Overview

OpenSeek is an open source project initiated by the Beijing Academy of Artificial Intelligence (BAAI), aiming to unite the global open source communities to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek. Drawing inspiration from large model initiatives like Bigscience and OPT, the project is dedicated to building an independent open source algorithmic innovation system. Since the open sourcing of the DeepSeek model, academia has seen numerous algorithmic improvements and breakthroughs, but these innovations often lack complete code implementations, necessary computational resources, and high-quality data support. The OpenSeek project hopes to explore high-quality dataset construction mechanisms through uniting the open source community, promote open sourcing of the entire large model training pipeline, build innovative training and inference code to support various AI chips besides Nvidia, and promote independent technological innovation and application development.

Objectives of OpenSeek:

  • Innovative data synthesis technology: Address the challenge of acquiring high-quality data and break through data barriers.
  • Support for multiple AI chips: Reduce dependency on specific chips and improve model universality and adaptability.
  • Build an independent open source algorithmic innovation system: Promote independent algorithmic innovation and technology sharing through open source collaboration.

Project: https://github.com/orgs/FlagAI-Open/projects/1

Acknowledgments & Contribution Guidelines

We extend our sincere gratitude to the FlagScale team for their foundational framework support. This project is built upon FlagScale's robust infrastructure.

  • For framework-related discussions/issues Please direct your questions and report framework-specific issues through FlagScale's GitHub Issues. Code contributions should be submitted via Pull Requests (PRs) to the FlagScale repository.

  • For data strategies & training methodologies Discussions, proposals, and PRs regarding dataset implementations, training optimizations, and experimental configurations should be initiated through this project's GitHub Issues and Pull Requests.

๐Ÿ“ข News

Getting Started

Installation

git clone https://github.com/FlagAI-Open/OpenSeek.git
cd OpenSeek
cd flagscale/install
./install-requirements.sh --env train

The above instructions create conda environments: flagscale-train, which contain the dependency environments for training.

๐Ÿš€ Training

Phase 1: Training

Category Data ckpt Evaluation Results Training Hyperparameters Wandb Discussion
Content Aquila-3B data validation model
OpenSeek-PT-1.3T v0.1
-- Eval
seqlen: 4096
gbs: 8M
lr: 3.0e-3
lr_decay_style: WSD
Loss
https://wandb.ai/aquila3/OpenSeek-3B-v0.1/runs/aquila_3b_exp02-rank-63
--

๐Ÿ‘ Project Highlights

  • High-Quality Data Accessibility: Open-source 10TB-level, high-quality Chinese and English pretraining data, ensuring robust and diverse model training resources.
  • Scalable Data Synthesis Strategy: A streamlined and scalable approach to synthesizing Chain-of-Thought (CoT) data, leveraging Webpage, Code, Math, Wiki, and Book sources to enhance reasoning capabilities.
  • Multi-AI Chip Support: Built on Triton, the project offers optimized support for multiple AI chips, ensuring flexibility and adaptability across diverse hardware ecosystems.
  • High-Performance Training Infrastructure: Highly optimized training support, designed to maximize efficiency and accelerate model development.
  • Advanced Model Architecture: A more efficient model structure, optimized for performance and scalability, enabling superior computational efficiency and inference speed.

โ˜Ž๏ธ Open-Source Co-construction Plan

OpenSeek thrives on community collaboration. We believe in the collective intelligence of developers worldwide and welcome contributions that advance this project toward excellence.

For detailed information on how to contribute, please refer to our Contribution Guide.

Together, we can explore the frontiers of large language models and drive technological innovation through open source collaboration.

wechat

โฐ RoadMap

โœ… Phase 1: Complete OpenSeek-data-1.3TB Creation & OpenSeek-Small Distributed Training

๐Ÿ“Š Data

  • Build data processing and synthesis pipeline
  • Build OpenSeek-PTx1.3T-v0.1
  • Construct OpenSeek-data-1.3T official version based on OpenSeek-Small data ratio experiments

๐Ÿ”„ Training

  • Validate 3B model effects on OpenSeek-PT-1.3T-v0.1 (Baseline)
  • Complete experimental training of OpenSeek-Small (~100B)

๐Ÿ’ป System

  • Support distributed training for MLA, DeepSeek MoE, MTP, Auxiliary-Loss-Free etc.
  • Convert and load DeepSeek V3 parameters

โšก Phase 2: Expand Data Scale & Optimize Distributed Training Performance

๐Ÿ“Š Data

  • Expand data scale, build OpenSeek-PT-8T
  • Construct Long-CoT-Backward synthetic dataset and verify effects

๐Ÿ”„ Training

  • โšก Complete hyperparameter experiments for OpenSeek-Small
  • โšก Validate OpenSeek-PT-8T effects
  • โšก Complete full training of OpenSeek-Small on OpenSeek-PT-1.3T-v1.0

๐Ÿ’ป System

  • โšก Support Node-limited Routing MoE
  • โšก Support FP8 distributed training
  • โšก Integrate Triton-based operator library FlagGems

Phase 3: Support Larger Scale Data & Distributed Training

๐Ÿ“Š Data

  • Build OpenSeek-Zero dataset
  • Build OpenSeek-RL dataset
  • Build OpenSeek-SFT dataset
  • Construct Long-CoT-Forward synthetic dataset and verify effects

๐Ÿ”„ Training

  • Produce OpenSeek-Small-Zero
  • Produce OpenSeek-Small-SFT
  • Produce OpenSeek-Small-RL
  • Complete hyperparameter experiments for OpenSeek-Mid
  • Complete full training of OpenSeek-Mid on OpenSeek-PT-8T

๐Ÿ’ป System

  • Support DualPipe pipeline parallelism
  • Further optimize computation-communication overlap and memory optimization

Phase 4: Upgrade Multi-chip Support & Open Source Release

๐Ÿ“Š Data

  • Release official version of OpenSeek series datasets
  • Construct Long-CoT-RAG synthetic dataset and verify effects

๐Ÿ”„ Training

  • Produce OpenSeek-Mid-Zero
  • Produce OpenSeek-Mid-SFT
  • Produce OpenSeek-Mid-RL

๐Ÿ’ป System

  • Adapt training and precision alignment for different chips
  • Implement customized parallel and optimization strategies for specific chips

๐Ÿ“œ License Agreement

  • Code is licensed under Apache 2.0
  • Model weights are licensed under Apache 2.0
  • Data is licensed under CC BY-SA 4.0

Note: Full reproduction requires at least 8 H100 GPUs, and it is recommended to use the SLURM cluster management system. Datasets need to be applied for or generated independently, and some sensitive data is not included in the open source package.

About

OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next-generation models that surpass DeepSeek.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published