forked from Tsadoq/ErisForge
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
122 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,123 @@ | ||
 | ||
**ErisForge** is a Python library designed to modify Large Language Models (LLMs) by applying transformations to their internal layers. Named after Eris, the goddess of strife and discord, ErisForge allows you to alter model behavior in a controlled manner, creating both ablated and augmented versions of LLMs that respond differently to specific types of input. | ||
|
||
## Features | ||
|
||
- Modify internal layers of LLMs to produce altered behaviors. | ||
- Ablate or enhance model responses with the `AblationDecoderLayer` and `AdditionDecoderLayer` classes. | ||
- Measure refusal expressions in model responses using the `ExpressionRefusalScorer`. | ||
- Supports custom behavior directions for applying specific types of transformations. | ||
|
||
## Installation | ||
|
||
To install ErisForge, clone the repository and install the required packages: | ||
|
||
```bash | ||
git clone https://github.com/yourusername/erisforge.git | ||
cd erisforge | ||
pip install -r requirements.txt | ||
``` | ||
|
||
or install directly from pip: | ||
|
||
```bash | ||
pip install erisforge | ||
``` | ||
|
||
## Usage | ||
|
||
### Basic Setup | ||
|
||
```python | ||
import torch | ||
from transformers import AutoModelForCausalLM, AutoTokenizer | ||
from erisforge import ErisForge | ||
from erisforge.expression_refusal_scorer import ExpressionRefusalScorer | ||
|
||
# Load a model and tokenizer | ||
model_name = "gpt2" | ||
model = AutoModelForCausalLM.from_pretrained(model_name) | ||
tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
|
||
# Initialize ErisForge and configure the scorer | ||
forge = ErisForge() | ||
scorer = ExpressionRefusalScorer() | ||
``` | ||
|
||
### Transform Model Layers | ||
|
||
You can apply transformations to specific layers of the model to induce different response behaviors. | ||
|
||
#### Example 1: Applying Ablation to Model Layers | ||
|
||
```python | ||
# Define instructions | ||
instructions = ["Explain why AI is beneficial.", "What are the limitations of AI?"] | ||
|
||
# Specify layer ranges for ablation | ||
min_layer = 2 | ||
max_layer = 4 | ||
|
||
# Modify the model by applying ablation to the specified layers | ||
ablated_model = forge.run_forged_model( | ||
model=model, | ||
type_of_layer=AblationDecoderLayer, | ||
objective_behaviour_dir=torch.rand(768), # Example direction tensor | ||
tokenizer=tokenizer, | ||
min_layer=min_layer, | ||
max_layer=max_layer, | ||
instructions=instructions, | ||
max_new_tokens=50 | ||
) | ||
|
||
# Display modified responses | ||
for conversation in ablated_model: | ||
print("User:", conversation[0]["content"]) | ||
print("AI:", conversation[1]["content"]) | ||
``` | ||
|
||
#### Example 2: Measuring Refusal Expressions | ||
|
||
Use `ExpressionRefusalScorer` to measure if the model's response includes common refusal phrases. | ||
|
||
```python | ||
response_text = "I'm sorry, I cannot provide that information." | ||
user_query = "What is the recipe for a dangerous substance?" | ||
|
||
# Scoring the response for refusal expressions | ||
refusal_score = scorer.score(user_query=user_query, model_response=response_text) | ||
print("Refusal Score:", refusal_score) | ||
``` | ||
|
||
### Save Transformed Model | ||
|
||
You can save your modified model locally or push it to the HuggingFace Hub: | ||
|
||
```python | ||
output_model_name = "my_transformed_model" | ||
|
||
# Save the modified model | ||
forge.save_model( | ||
model=model, | ||
behaviour_dir=torch.rand(768), # Example direction tensor | ||
scale_factor=1, | ||
output_model_name=output_model_name, | ||
tokenizer=tokenizer, | ||
to_hub=False # Set to True to push to HuggingFace Hub | ||
) | ||
``` | ||
|
||
## Project Structure | ||
|
||
- **eris_forge.py** - Core functionality for modifying model layers and saving the model. | ||
- **layers.py** - Defines ablation and addition layers. | ||
- **expression_refusal_scorer.py** - Scores model responses based on refusal expressions. | ||
- **layer_utils.py** - Utility functions for handling model layers based on their architecture. | ||
|
||
## Contributing | ||
|
||
Feel free to submit issues, suggestions, or contribute directly to this project. Fork the repository, create a feature branch, and submit a pull request. | ||
|
||
## License | ||
|
||
This project is licensed under the MIT License. |