Cloud Native AI-Generated Content (CNAGC)

This project is to bring AIGC to Kubernetes via cloud native stateless design.

1. Download Models

You can download model from hugging face.

BF16 model for AMX optimization

# Llama-2-7b-chat-hf-sharded-bf16
cd models
git lfs install
git clone https://huggingface.co/Trelis/Llama-2-7b-chat-hf-sharded-bf16

INT8 model for VNNI optimization

# vicuna-7b-v1.3
cd models
git lfs install
git clone https://huggingface.co/lmsys/vicuna-7b-v1.3

2. Getting Start to Run via Docker

2.1 Build container `bluewish/cnagc-fastchat:2.0.100-cpu`

./container-build.sh -c cnagc-fastchat

By default it will use bluewish/ as registry, if you want to change to your own like <xxxx.com>/, please use -r <xxxx.com>/ option.

./container-build.sh -c cnagc-fastchat -r "<xxxx.com>/"

If you do not want to build container from scratch, you can just pull

docker pull bluewish/cnagc-fastchat:v2.2.0-cpu

2.2 Run chatbot via interactive console directly

./docker-runchat.sh -m ./models/vicuna-7b-v1.3

# Run with AVX512_VNNI for INT8 model
./docker-runchat.sh -m ./models/vicuna-7b-v1.3 -i avx512_vnni

# Run with AMX for BF16 model
./docker-runchat.sh -m ./models/Llama-2-7b-chat-hf-sharded-bf16 -i amx

2.3 Run Chatbot server and export OpenAI API

        +--------------------------+      +------------------------+
        |         UI Server        |      |    OpenAI API Server   |
        |  http://10.0.0.100:9000  |      | http://10.0.0.100:8000 |
        +--------------------------+      +------------------------+
                          |                     |
                          |                     |
                         \|/                   \|/
                       +----------------------------+
                       |     Controller Server      |
                       |   http://10.0.0.100:21001  |
                       +----------------------------+
                         /|\                     /|\
                          |                       |
                          |                       |
                          |                       |
    +-------------------------+       +------------------------------------- --+
    |      Model Worker #1    |       |           Model Worker #2              |
    | http://10.0.0.100:21002 |       |      http://10.0.0.100:21003           |
    |   Model: vicuna-7b-v1.3 |       | Model: Llama-2-7b-chat-hf-sharded-bf16 |
    |    ISA: AVX512_VNNI     |       |            ISA: AMX                    |
    +-------------------------+       +----------------------------------------+

Get the host's IP address which running the containers via ip a, for example 10.0.0.100
Run the controller container
```
./docker-runchat.sh -t controller
```
By default controller will serve at localhost:21001 or 10.0.0.100:21001 if the host IP address is 10.0.0.100

Run UI server

# specify the controller service address, it should be same host IP address
export CONTROLLER_SVC=10.0.0.100
# specify the controller service port, it will be 21001 for the default one
export CONTROLLER_PORT=21001
./docker-runchat.sh -t ui

By default UI web will serve at http://localhost:9000 or http://10.0.0.100:9000 if the host IP address is 10.0.0.100. You can open it in browser.

Register the model inference worker

# specify the controller service address, it should be same host IP address
export CONTROLLER_SVC=10.0.0.100

# specify the controller service port, it will be 21001 for the default one
export CONTROLLER_PORT=21001

# specify the model worker address, it should be same host IP address
export MODEL_WORKER_SVC=10.0.0.100

# specify the model worker port, default is 21002. If register the second
# model worker, please choose new port like 21003, 21004, 21005 etc
export MODEL_WORKER_PORT=21002

./docker-runchat.sh -t model -m ./models/vicuna-7b-v1.3/ -i avx2

./docker-runchat.sh -t model -m ./models/vicuna-7b-v1.3/ -i avx512_vnni

./docker-runchat.sh -t model -m ./models/Llama-2-7b-chat-hf-sharded-bf16/ -i amx

Run OPENAI API server

# specify the controller service address, it should be same host IP address
export CONTROLLER_SVC=10.0.0.100

# specify the controller service port, it will be 21001 for the default one
export CONTROLLER_PORT=21001

./docker-runchat.sh -t apiserver

After running above services via docker directly, please use following approach to communicate:

Approach 1: Just open http://localhost:9000 or http://10.0.0.100:9000 in your web browser, and play with it

Approach 2: Use OpenAPI API to send curl:

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "vicuna-7b-v1.3-avx2",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
}'

Approach 3: Use OpenAPI API to write python code:

Install openai python package

pip install --upgrade openai

Python code is as follows:

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

model = "vicuna-7b-v1.3-avx2"
prompt = "Once upon a time"

# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)

# create a chat completion
completion = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)

3. Getting Start to Run on Kubernetes

3.1 Build container `gar-registry.caas.intel.com/cpio/cnagc-fastchat-k8s`

./container-build.sh -c cnagc-fastchat-k8s

There are following tags, the default one is latest

3.3 Mount LLM Model

Please download the required LLM model on the inference server in advance, and then edit the path.

AMX worker: deployment/cse-aigc-worker-amx.yaml Non-AMX worker: deployment/cse-aigc-worker-non.yaml

    volumes:
    - name: model-path
      hostPath:
        # edit the model path
        path: /home/smgaigc/Downloads/cloud-native-aigc-pipeline-main/models/Llama-2-7b-chat-hf-sharded-bf16
        type: Directory

Also update the node name for each worker.

    nodeSelector:
      kubernetes.io/hostname: smgaigc-ivory2-0

3.2 Deploy

Deploy Fastchat, Kepler-exporter and Kubernetes-dashboard:

cd deployment
kubectl apply -k kustomization.yaml

Deploy Prometheus operator:

Follow guide here.

The git clone step is not needed, since kube-proemetheus has been one of the sub-modules of cse-cnagc repo.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
container		container
deployment		deployment
docs		docs
fastchat @ c6cf78c		fastchat @ c6cf78c
intel-extension-for-pytorch @ a90dc70		intel-extension-for-pytorch @ a90dc70
kube-prometheus @ 0148114		kube-prometheus @ 0148114
models		models
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
container-build.sh		container-build.sh
docker-runchat.sh		docker-runchat.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Native AI-Generated Content (CNAGC)

1. Download Models

2. Getting Start to Run via Docker

2.1 Build container `bluewish/cnagc-fastchat:2.0.100-cpu`

2.2 Run chatbot via interactive console directly

2.3 Run Chatbot server and export OpenAI API

3. Getting Start to Run on Kubernetes

3.1 Build container `gar-registry.caas.intel.com/cpio/cnagc-fastchat-k8s`

3.3 Mount LLM Model

3.2 Deploy

4. Demo

About

Releases 1

Packages

Contributors 3

Languages

kenplusplus/cloud-native-aigc-pipeline

Folders and files

Latest commit

History

Repository files navigation

Cloud Native AI-Generated Content (CNAGC)

1. Download Models

2. Getting Start to Run via Docker

2.1 Build container bluewish/cnagc-fastchat:2.0.100-cpu

2.2 Run chatbot via interactive console directly

2.3 Run Chatbot server and export OpenAI API

3. Getting Start to Run on Kubernetes

3.1 Build container gar-registry.caas.intel.com/cpio/cnagc-fastchat-k8s

3.3 Mount LLM Model

3.2 Deploy

4. Demo

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

2.1 Build container `bluewish/cnagc-fastchat:2.0.100-cpu`

3.1 Build container `gar-registry.caas.intel.com/cpio/cnagc-fastchat-k8s`

Packages