Qwen Arabic Fine-tuning Project

This project fine-tunes the Qwen2-1.5B model for Arabic language tasks using Quantized LoRA (QLoRA).

Qwen-Arabic Evaluation on ArabicMMLU

Eevaluation of the Qwen-Arabic language model (1.5B parameters) on the ArabicMMLU benchmark. The model demonstrates strong parameter efficiency while maintaining competitive performance across various knowledge domains.

Model Overview

Qwen-Arabic is a 1.5B parameter language model fine-tuned for Arabic language tasks. It is based on the Qwen architecture and optimized using QLoRA (Quantized Low-Rank Adaptation) techniques.

Performance Results

Overall Performance

Average Accuracy: 42.3%
Best Category: Social Science (46.1%)
Most Challenging: Arabic Language (37.8%)

Category-wise Performance

Category	Accuracy (%)
STEM	42.2
Social Science	46.1
Humanities	41.8
Arabic Language	37.8
Other	42.9
Average	42.3

Efficiency Analysis

Performance per Billion Parameters: 28.20 accuracy points
389.0x more parameter-efficient than GPT-4
Achieves 58.3% of GPT-4's performance with only 0.15% of parameters

Comparison with Other Models

Model	Parameters	Average Accuracy	Efficiency Score
GPT-4	~1000B	72.5%	0.072
Jais-chat	30B	62.3%	2.077
AceGPT-chat	13B	52.6%	4.046
Qwen-Arabic	1.5B	42.3%	28.200

Prerequisites

Ubuntu (or similar Linux distribution)
Python 3.10
CUDA-compatible GPU with at least 4GB VRAM
At least 12GB system RAM
Ollama installed and configured

Setup

Clone this repository:

git clone https://github.com/prakash-aryan/qwen-arabic-project.git
cd qwen-arabic-project

Create and activate a virtual environment:

python3.10 -m venv qwen_env
source qwen_env/bin/activate

Install the required packages:

pip install --upgrade pip
pip install -r requirements.txt

Install PyTorch with CUDA support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Project Structure

qwen-arabic-project/
├── data/
│   └── arabic_instruction_dataset/
├── models/
├── results/
├── src/
│   ├── compare_qwen_models.py
│   ├── evaluate_arabic_model.py
│   ├── finetune_qwen.py
│   ├── get_datasets.py
│   ├── load_and_merge_model.py
│   ├── preprocess_datasets.py
│   └── validate_dataset.py
├── tools/
│   └── llama-quantize
├── requirements.txt
├── run_pipeline.sh
├── Modelfile
└── README.md

Usage

Download and prepare datasets:
```
python src/get_datasets.py
```
Preprocess and combine datasets:
```
python src/preprocess_datasets.py
```
Validate the dataset:
```
python src/validate_dataset.py
```

Fine-tune the model:

python src/finetune_qwen.py --data_path ./data/arabic_instruction_dataset --output_dir ./models/qwen2_arabic_finetuned --num_epochs 3 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 2e-5

Load and merge the fine-tuned model:
```
python src/load_and_merge_model.py
```

Convert to GGUF format:

python src/convert_hf_to_gguf.py ./models/qwen2_arabic_merged_full --outfile ./models/qwen_arabic_merged_full.gguf

Quantize the model:

./tools/llama-quantize ./models/qwen_arabic_merged_full.gguf ./models/qwen_arabic_merged_full_q4_k_m.gguf q4_k_m

Create Ollama model:

ollama create qwen-arabic-custom -f Modelfile

Evaluate the model:
```
python src/evaluate_arabic_model.py
```
Compare models:
```
python src/compare_qwen_models.py
```

Running the Full Pipeline

To run the entire pipeline from data preparation to model evaluation, use the provided shell script:

chmod +x run_pipeline.sh
./run_pipeline.sh

Notes

Ensure you have sufficient disk space for the datasets and model files.
The fine-tuning process can take several hours to days, depending on your hardware.
Monitor GPU memory usage during fine-tuning and adjust batch size or gradient accumulation steps if necessary.
Make sure to have Ollama installed for the model creation and evaluation steps.

Troubleshooting

If you encounter CUDA out-of-memory errors, try reducing the batch size or increasing gradient accumulation steps.
For any other issues, please check the error logs or open an issue in the GitHub repository.

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0).

This means:

You can use, modify, and distribute this software.
If you distribute modified versions, you must also distribute them under the GPL-3.0.
You must include the original copyright notice and the license text.
You must disclose your source code when you distribute the software.
There's no warranty for this free software.

For more details, see the LICENSE file in this repository or visit GNU GPL v3.0.

Acknowledgements

This project uses the following main libraries and tools:

Transformers by Hugging Face
PyTorch
PEFT (Parameter-Efficient Fine-Tuning)
Ollama
GGUF (for model conversion)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen Arabic Fine-tuning Project

Qwen-Arabic Evaluation on ArabicMMLU

Model Overview

Performance Results

Overall Performance

Category-wise Performance

Efficiency Analysis

Comparison with Other Models

Prerequisites

Setup

Project Structure

Usage

Running the Full Pipeline

Notes

Troubleshooting

License

Acknowledgements

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Modelfile		Modelfile
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

License

prakash-aryan/qwen-arabic-project

Folders and files

Latest commit

History

Repository files navigation

Qwen Arabic Fine-tuning Project

Qwen-Arabic Evaluation on ArabicMMLU

Model Overview

Performance Results

Overall Performance

Category-wise Performance

Efficiency Analysis

Comparison with Other Models

Prerequisites

Setup

Project Structure

Usage

Running the Full Pipeline

Notes

Troubleshooting

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages