Skip to content

NascentCore/3k

Repository files navigation

3K

codecov

简介

三千平台是云原生大模型训推平台,得名于其 3 个核心指标:

  • 千卡:支持千卡 A100 或等效智算集群
  • 千亿参数:支持千亿参数大模型训练、推理
  • 千小时:支持千小时以上无人干预大模型训练

Named after 3 major performance metrics of the system:

  • 1000+ GPUs
  • 100B+ Transformer model
  • 1000+ hours uninterrupted training

名词解释

Acronym Meaning 涵义
1g 1 GPU 1 卡
1h1g 1 node 1 GPU 1 机 1 卡
1h8g 1 node 8 GPU 1 机 8 卡
2h8g 2 nodes 16 GPUs 2 机 16 卡

SuperLinter

tools/super_linter.sh <file-or-directory-to-be-linted>

使用手册

  • 算想云:面向个人大模型开发者、中小大模型应用企业的无服务器(Serverless)大模型开发、训练、微调、推理云服务
  • 算力源:面向 GPU 集群产权方,通过算想云出租自有 GPU 集群的企业
  • 算想三千:面向企业大模型团队,私有化部署的大模型开发软件平台

安装 SLO

  • 1 小时以内完成安装,即从开始安装 1 小时以内三千平台(不包含大模型相关数据资产)完成安装;评判标准:可以开始运行 IB 测试任务、bert 任务
  • 3 小时内完成 LLaMA2-7B 数据资产安装;即从开始安装 3 小时以内 LLaMA2-7B 模型、容器镜像、数据集完成安装;评判标准:可以开始运行 LLaMA2-7B 预训练、微调、推理演示
  • 24 小时内完成 10 个主流的开源大模型的数据资产;即从开始安装 24 小时、评判标准:可以开始运行任意模型的微调、推理演示

Readme for Development

Look for README.md under each directories for the puposes of the code under the directories, and other details.

Code organization history

  1. In the very beginning, CPodOperator lives in a separate GitHub repo For its skeleton code is generated by Kubebuilder It's not natural to live with other code of 3k platform
  2. At the moment PortalSync is thought to be live together with CPodOperator for easier development To avoid dependences over go mode package
  3. Then CPodOperator was merged into 3k repo for mono repo management That results in awkward coupling between CPodOperator and PortalSync, and the independent nature of PortalSync and CPodOperator was compromised by the fact that PortalSync is a sub-package of CPodOperator

TODO: Use 1.18 multi-module workspace to refactor code file layout

Common items

  • Clone the minimal repo
    git clone --depth=1 --branch=main --single-branch \
      [email protected]:NascentCore/3k.git
    
  • Goproxy setup, open your terminal and execute, this allows downloading Golang packages from a China proxy.
    go env -w GO111MODULE=on
    go env -w GOPROXY=https://goproxy.cn,direct
    
  • Pull request needs to be checked with tools/lint.sh before being submitted for review.
    tools/lint.sh
    
  • Init submodule:
    git submodule update --init --recursive
    

Notes

  • Add // nolint:<linter name> to disable a check of golangci-lint, for example: // nolint:unused.
  • Run super-linter locally:
    # `--workdir /tmp/lint` is needed per
    # https://github.com/super-linter/super-linter/issues/4495
    docker run --rm --env-file .github/super_linter.env \
      -e USE_FIND_ALGORITHM=true -e RUN_LOCAL=true \
      -v $(pwd)/.github:/tmp/lint/.github \
      -v $(pwd)/.git:/tmp/lint/.git \
      -v $(pwd)/<code-path>:/tmp/lint/<code-path> \
      --workdir /tmp/lint \
      super-linter
    

Pip mirror

python3 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

File and dir naming convention

  • OK to break conventions
  • The most important rule is to keep consistent with the dominant convention in the existing codebase

Use '-' to separate file and dir name components, as in foo-bar/baz-tik-tok