Skip to content

Latest commit

 

History

History
 
 

ner-fine-tuning

Example: Improving Data Extraction (NER) by Fine-Tuning a Llama 3 Model

Background

Named Entity Recognition (NER) is the process of identifying and categorizing named entities in text into predefined categories such as person, organization, location, and date. NER is a fundamental task in natural language processing (NLP) and is widely used in various applications such as information extraction, question answering, and machine translation.

Once upon a time, this was done using rule-based systems or special-purpose models. In light of progress in foundation models, most would use an LLM to address this task today, especially given recent advancements in structured decoding and JSON mode offerings from most inference providers.

Here, we present a stylized example of a NER system that uses TensorZero JSON functions to decode named entities from text. We build off of the CoNLL++ dataset and work from Predibase for the problem setting. Each example in the dataset includes a short segment of text and instructs the model to produce a JSON of named entities in the input. We provide the output schema to TensorZero at config/functions/extract_entities/output_schema.json. In our problem setting, we consider any output that fails to validate against the schema to be incorrect.

We'll show that an optimized Llama 3.1 8B model can be trained to outperform GPT-4o on this task using a small amount of training data, and served by Fireworks at a fraction of the cost and latency.

Setup

TensorZero

We've written TensorZero configuration files to accomplish this example and have provided them in the config directory. See tensorzero.toml for the main configuration details.

To get started, create a .env file with your OpenAI API key (OPENAI_API_KEY) and Fireworks API key (FIREWORKS_API_KEY) and run the following command. Docker Compose will launch the TensorZero Gateway and a test ClickHouse database. Set CLICKHOUSE_URL=http://localhost:8123/tensorzero in the shell your notebook will run in.

docker compose up

Python Environment

Using uv (Recommended)

uv venv  # Create a new virtual environment
uv pip sync requirements.txt  # Install the dependencies

Using pip

We recommend using Python 3.10+ and a virtual environment.

pip install -r requirements.txt

Running the Example

You can run the example in the conll.ipynb notebook. Make sure to install the dependencies in the requirements.txt file. It should not require any changes to run and will automatically connect to the TensorZero Gateway you started.

The notebook will first attempt to solve the NER task using the extract_entities JSON function and randomly sample either GPT-4o or vanilla Llama 3.1 8B to do it with. After this is done, we evaluate the output using both an exact match metric and Jaccard similarity. We provide feedback in each of these metrics to TensorZero to learn from the results.

Afterwards we run an evaluation on a subset of the test set (and use the same set for each variant) to get a clear picture of the performance of each variant. This inference is performed with a variant specified and dryrun set to true to avoid storing the data and contaminating the training set.

Improving the NER System

At this point, your ClickHouse database will include inferences in a structured format along with feedback on how they went. You can now use TensorZero recipes to learn from this experience to produce better variants of the NER system. You might notice that the best performing LLM is GPT-4o from OpenAI (not surprising!).

However, we offer a recipe in recipes/supervised_fine_tuning/metrics/fireworks/ that can be used with very small amounts of data to fine-tune a Llama-3.1 8B model to achieve superior performance to GPT-4o at a fraction of the cost and latency! At the conclusion of that notebook you should see a few blocks to add to tensorzero.toml to update the system to use the new model and the corresponding variant.

You can also easily experiment with other recipes,models, prompts you think might be better, or combinations thereof by editing the configuration.

Experimenting with Improved Variants

Once you've generated one or more improved variants (and, critically, given them some positive weight), you should restart the TensorZero Gateway with the new configuration:

docker compose up

You can then re-run the test set evaluation in the conll.ipynb notebook to see how the new variants perform.

From a single fine-tune we see the Llama-3.1 8B model greatly outperform GPT-4o on this task with just 100-200 examples!