Medusa-Bridge

This software enables you to join multiple local LLM API servers to the KoboldAI Horde as a Scribe worker performing distributed text generation. Obtain an API Key here to accumulate kudos, the virtual currency of the Horde whicah can be used for multiple tasks such as image generation and interrogation.

It is the sucessor to LlamaCpp-Horde-Bridge, rewritten in NodeJS. If upgrading, note that the names of some configuration arguments have changed.

Throughput-enhancing Features:

Multi-threaded processing: generate multiple jobs in parallel
Asyncronous job submission: pop a new job as soon as the previous one has finished generating, submit in the background

Supported inference REST API servers:

llama.cpp server
koboldcpp
vllm
sglang backend

See below for example configurations for each engine.

Installation

Medusa-Bridge requires NodeJS v16, installation via nvm is recommended.

Execute npm ci to install dependencies

Configuration

Run node index.js to see the default configuration:

┌───────────────────┬────────────────────────────────┐
│      (index)      │             Values             │
├───────────────────┼────────────────────────────────┤
│    clusterUrl     │   'https://stablehorde.net'    │
│    workerName     │ 'Automated Instance #73671757' │
│      apiKey       │          '0000000000'          │
│ priorityUsernames │                                │
│     serverUrl     │    'http://localhost:8000'     │
│   serverEngine    │              null              │
│       model       │              null              │
│        ctx        │              null              │
│     maxLength     │             '512'              │
│      threads      │              '1'               │
└───────────────────┴────────────────────────────────┘

Run node index.js --help to see command line equivillents and descriptions for each option:

Usage: index [options]

Options:
  -f, --config-file <file>              Load config from .json file
  -c, --cluster-url <url>               Set the Horde cluster URL (default: "https://stablehorde.net")
  -w, --worker-name <name>              Set the Horde worker name (default: "Automated Instance #31551726")
  -a, --api-key <key>                   Set the Horde API key (default: "0000000000")
  -p, --priority-usernames <usernames>  Set priority usernames, comma-separated (default: [])
  -s, --server-url <url>                Set the REST Server URL (default: "http://localhost:8000")
  -e, --server-engine <engine>          Set the REST Server API type (default: null)
  -m, --model <model>                   Set the model name (default: null)
  -x, --ctx <ctx>                       Set the context length (default: null)
  -l, --max-length <length>             Set the max generation length (default: "512")
  -t, --threads <threads>               Number of parallel threads (default: "1")
  -h, --help                            display help for command

The -f / --config-file option allows you to group configuration into a named json file, while still allowing command-line overrides.

Server Engine Sample

llama.cpp server

Why llama.cpp and not koboldcpp?

See this reddit post, using this trick older Pascal GPUs (GTX 10x0, P40, K80) are almost twice as fast, particulary at long contexts.

Compile llama.cpp with make LLAMA_CUBLAS=1 LLAMA_CUDA_FORCE_MMQ=1 to get a Pascal-optimized server binary.

Example usage

Example server command: ./server ~/models/openhermes-2.5-mistral-7b.Q5_0.gguf -ngl 99 -c 4096

Example configuration file:

{
    "apiKey": "<your api key>",
    "workerName": "<your worker name>",
    "serverEngine": "llamacpp",
    "serverUrl": "http://localhost:8000",
    "model": "llamacpp/openhermes-2.5-mistral-7b.Q5_0",
    "ctx": 4096
}

koboldcpp

Example server command: ./koboldcpp-linux-x64 ~/models/openhermes-2.5-mistral-7b.Q5_0.gguf --usecublas 0 mmq --gpulayers 99 --context-size 4096 --quiet

Example configuration file:

{
    "apiKey": "<your api key>",
    "workerName": "<your worker name>",
    "serverEngine": "koboldcpp",
    "serverUrl": "http://localhost:5001",
    "model": "koboldcpp/openhermes-2.5-mistral-7b.Q5_0",
    "ctx": 4096
}

vllm

TODO

sglang

TODO

aphrodite

python3 -m aphrodite.endpoints.kobold.api_server --model TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ --max-length 512 --max-model-len 8192

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
servers.js		servers.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medusa-Bridge

Installation

Configuration

Server Engine Sample

llama.cpp server

Why llama.cpp and not koboldcpp?

Example usage

koboldcpp

vllm

sglang

aphrodite

About

Languages

the-crypt-keeper/Medusa-Bridge

Folders and files

Latest commit

History

Repository files navigation

Medusa-Bridge

Installation

Configuration

Server Engine Sample

llama.cpp server

Why llama.cpp and not koboldcpp?

Example usage

koboldcpp

vllm

sglang

aphrodite

About

Topics

Resources

Stars

Watchers

Forks

Languages