Skip to content

Latest commit

 

History

History
293 lines (220 loc) · 14 KB

README.md

File metadata and controls

293 lines (220 loc) · 14 KB
Empower Logo

Empower Functions

Empower Functions is a family of LLMs(large language models) that offer GPT-4 level capabilities for real-world "tool using" use cases, with full compatibility support to be served as a drop-in replacement.

Live DemoHuggingface RepoWebsiteDiscord

Update

New Empower Functions v1.1 We have just launched new v1.1 of the Empower Functions family. The updated v1.1 family has been fine-tuned based on Llama3.1 using an enhanced curated dataset. It has achieved state-of-the-art performance on the Berkeley Function Calling leader board:

image (captured on Sep 10, 2024)

What is real world "tool using" use cases?

"tool using" refers to the ability of LLMs to interact with external APIs by recognizing when a function needs to be called and then generating JSON containing the necessary arguments based on user inputs. This capability is essential for building conversational agents and applications that convert natural language into API calls, facilitating tasks such as weather inquiries, data extraction, and interactions with knowledge bases.

Real-world use cases, particularly those involving conversational agents, often introduce complex requirements for LLMs. Models must be capable of retrieving context from multiple round of conversations(multi-turn), choosing between utilizing tools or engaging in standard dialogue ('auto' mode), and asking for clarification if any parameters are missing(clarification). Furthermore, they should integrate responses with tool outputs in a streaming fashion. Additionally, when multiple tools are required to complete a task, models should efficiently execute multiple functions either in parallel (parallel calling) or sequentially with dependencies (sequential calling).

For example, below is a screenshot demonstrating how the model is used in a medical center coordinator bot. You can explore this further in our live demo. image

Family of Models

Model Specs Links Notes
llama3-empower-functions-small 128k context, based on Llama3.1 8B model, gguf Most cost-effective, locally runnable
llama3-empower-functions-large 128k context, based on Llama3.1 70B model Best accuracy

Hardware Requirement

We have tested and the family of models in following setup:

  • empower-functions-small: fp16 on 1xA100 40G, GGUF and 4bit GGUF on Macbook M2 Pro with 32G RAM, in minimal the 4bit GGUF version requires 7.56G RAM.
  • empower-functions-large: fp16 on 4xA100 80G

How to Use?

Running Locally

Running locally is only supported by the llama3-empower-functions-small model. To use other models, please use our API.

Local running is supported through the empower_functions pip package, make sure you install it first by running pip install empower-functions.

If you encounter errors like RuntimeError: Failed to load shared library, (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), please re-install the llama-cpp-python package by running CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python

Running a Local OpenAI Compatible Server

We leverage the llama-cpp-python project to run the model locally. To start a local OpenAI compatible server, you'll need to follow the steps below:

  1. Download the GGUF model from our huggingface repo
  2. Run the command python -m empower_functions.server --model <path to GGUF model> --chat_format empower-functions

You should see the following output when the server is ready:

INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

Then you can use the OpenAI SDK to connect to the server. See below for a basic example:

import openai
import json

client = openai.OpenAI(
    base_url = "http://localhost:8000/v1",
    api_key = "YOUR_API_KEY"
)

messages = [
    {"role": "user", "content": "What's the weather in San Francisco?"}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

chat_completion = client.chat.completions.create(
    model="does_not_matter",
    messages=messages,
    tools=tools,
    temperature=0,
    tool_choice="auto"
)

print(chat_completion)
Running in a Python Environment

You can directly call the model in your python environment through the llama-cpp-python package with the chat handler provided in the empower_functions package. See below for a basic example. For more detailed example, please refer to the python script.

import json
from empower_functions import EmpowerFunctionsCompletionHandler
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="empower-dev/llama3-empower-functions-small-gguf",
    filename="ggml-model-Q4_K_M.gguf",
    chat_format="llama-3",
    chat_handler=EmpowerFunctionsCompletionHandler(),
    tokenizer=LlamaHFTokenizer.from_pretrained("empower-dev/llama3-empower-functions-small-gguf"),
    n_gpu_layers=0
)

# You can then use the llm object to chat with the model
messages = [
    {"role": "user", "content": "What's the weather in San Francisco?"}
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

result = llm.create_chat_completion(
      messages = messages,
      tools=tools,
      tool_choice="auto",
      max_tokens=128
)
print(json.dumps(result["choices"][0], indent=2))
Running in Windows with Cuda

That will take awhile but will overwrite the normal llama-cpp-python module with the Cuda support one.

  • then run the server with the virtual environment active with a command like this: python -m empower_functions.server --model C:\Github\empower-functions-gpu\models\ggml-model-Q4_K_M.gguf --chat_format empower-functions --port 8001 --n_ctx 8196 --n_gpu_layers 20

replacing the path with the path where the model is saved on your computer and adjusting n_ctx to the desired context and n_gpu_layers to the amount of the layers to offload to the GPU.

Using Empower API

The empower platform offers an API that is fully compatible with the OpenAI API, allowing you to directly use the OpenAI SDK. An example is shown below. See below for a basic example, more details can be found here.

Currently streaming and JSON model is only available in Empower API.

from openai import OpenAI

client = OpenAI(
    base_url="https://app.empower.dev/api/v1",
    api_key="YOU_API_KEY"
)

response = client.chat.completions.create(
    model="empower-functions",
    messages=[{"role": "user",
               "content": "What's the weather in San Francisco and Los Angeles in Celsius?"}],
    temperature=0,
    tools=[{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }],
)
response_message = response.choices[0].message.tool_calls
print(response_message)

Prompt Raw Model

The Empower functions model family has been tuned to natively produce JSON. We provide utilities in our Python package to prompt OpenAI-formatted messages. See below for a basic example, more details can be found here.

from transformers import AutoModelForCausalLM, AutoTokenizer
from prompt import prompt_messages

device = "cuda"

model_path = 'empower-dev/empower-functions-small'
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

functions = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
            "required": ["location"],
        },
    }
]

messages = [
    {'role': 'user', 'content': 'What\'s the weather in San Francisco and Los Angles in Celsius?'},
]

messages = prompt_messages(messages, functions)
model_inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt").to(model.device)

generated_ids = model.generate(model_inputs, max_new_tokens=128)
decoded = tokenizer.batch_decode(generated_ids)

print(decoded[0])

Training Approach

Empower's function models are fine-tuned based on state-of-the-art OSS models. We divided the training into two phases.

First, we perform SFT(supervised fine-tuning) using over 100k rows of hand-curated, high-quality conversations involving function calling. These conversations cover different scenarios such as single turn, multi-turn, and parallel calling. Specifically, the model is trained to use beginning tokens to determine whether it is calling functions or returning regular conversation (using and tags). It then returns function calls as JSON or conversations as usual, making streaming integration very straightforward. The SFT sets the model up with a very strong foundation covering various scenarios for general use cases.

Next, we apply DPO (Directly Preference Optimization) for trickier scenario where SFT (Supervised Fine-Tuning) is less effective. For instance, when function specifications include examples for arguments, we want to prevent the model from hallucinating argument values from these examples. We have found DPO to be very effective in correcting such misbehavior with a relatively small amount of data.

Finally, we are committed to continuously optimizing the model for better quality across a wider range of use cases and scenarios :) We can further fine-tune the model based on your specific needs. Please contact us if you have any use-case-specific requirements!

Evaluation

We evaluate our models against the Berkeley Function Calling benchmark and both of the 8B and 70B version have achieved the state of the art performance on its size:

image (captured on Sep 10, 2024)