This project fine-tunes the Qwen2-1.5B model for Arabic language tasks using Quantized LoRA (QLoRA).
Eevaluation of the Qwen-Arabic language model (1.5B parameters) on the ArabicMMLU benchmark. The model demonstrates strong parameter efficiency while maintaining competitive performance across various knowledge domains.
Qwen-Arabic is a 1.5B parameter language model fine-tuned for Arabic language tasks. It is based on the Qwen architecture and optimized using QLoRA (Quantized Low-Rank Adaptation) techniques.
- Average Accuracy: 42.3%
- Best Category: Social Science (46.1%)
- Most Challenging: Arabic Language (37.8%)
Category | Accuracy (%) |
---|---|
STEM | 42.2 |
Social Science | 46.1 |
Humanities | 41.8 |
Arabic Language | 37.8 |
Other | 42.9 |
Average | 42.3 |
- Performance per Billion Parameters: 28.20 accuracy points
- 389.0x more parameter-efficient than GPT-4
- Achieves 58.3% of GPT-4's performance with only 0.15% of parameters
Model | Parameters | Average Accuracy | Efficiency Score |
---|---|---|---|
GPT-4 | ~1000B | 72.5% | 0.072 |
Jais-chat | 30B | 62.3% | 2.077 |
AceGPT-chat | 13B | 52.6% | 4.046 |
Qwen-Arabic | 1.5B | 42.3% | 28.200 |
- Ubuntu (or similar Linux distribution)
- Python 3.10
- CUDA-compatible GPU with at least 4GB VRAM
- At least 12GB system RAM
- Ollama installed and configured
-
Clone this repository:
git clone https://github.com/prakash-aryan/qwen-arabic-project.git cd qwen-arabic-project
-
Create and activate a virtual environment:
python3.10 -m venv qwen_env source qwen_env/bin/activate
-
Install the required packages:
pip install --upgrade pip pip install -r requirements.txt
-
Install PyTorch with CUDA support:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
qwen-arabic-project/
├── data/
│ └── arabic_instruction_dataset/
├── models/
├── results/
├── src/
│ ├── compare_qwen_models.py
│ ├── evaluate_arabic_model.py
│ ├── finetune_qwen.py
│ ├── get_datasets.py
│ ├── load_and_merge_model.py
│ ├── preprocess_datasets.py
│ └── validate_dataset.py
├── tools/
│ └── llama-quantize
├── requirements.txt
├── run_pipeline.sh
├── Modelfile
└── README.md
-
Download and prepare datasets:
python src/get_datasets.py
-
Preprocess and combine datasets:
python src/preprocess_datasets.py
-
Validate the dataset:
python src/validate_dataset.py
-
Fine-tune the model:
python src/finetune_qwen.py --data_path ./data/arabic_instruction_dataset --output_dir ./models/qwen2_arabic_finetuned --num_epochs 3 --batch_size 1 --gradient_accumulation_steps 16 --learning_rate 2e-5
-
Load and merge the fine-tuned model:
python src/load_and_merge_model.py
-
Convert to GGUF format:
python src/convert_hf_to_gguf.py ./models/qwen2_arabic_merged_full --outfile ./models/qwen_arabic_merged_full.gguf
-
Quantize the model:
./tools/llama-quantize ./models/qwen_arabic_merged_full.gguf ./models/qwen_arabic_merged_full_q4_k_m.gguf q4_k_m
-
Create Ollama model:
ollama create qwen-arabic-custom -f Modelfile
-
Evaluate the model:
python src/evaluate_arabic_model.py
-
Compare models:
python src/compare_qwen_models.py
To run the entire pipeline from data preparation to model evaluation, use the provided shell script:
chmod +x run_pipeline.sh
./run_pipeline.sh
- Ensure you have sufficient disk space for the datasets and model files.
- The fine-tuning process can take several hours to days, depending on your hardware.
- Monitor GPU memory usage during fine-tuning and adjust batch size or gradient accumulation steps if necessary.
- Make sure to have Ollama installed for the model creation and evaluation steps.
- If you encounter CUDA out-of-memory errors, try reducing the batch size or increasing gradient accumulation steps.
- For any other issues, please check the error logs or open an issue in the GitHub repository.
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).
This means:
- You can use, modify, and distribute this software.
- If you distribute modified versions, you must also distribute them under the GPL-3.0.
- You must include the original copyright notice and the license text.
- You must disclose your source code when you distribute the software.
- There's no warranty for this free software.
For more details, see the LICENSE file in this repository or visit GNU GPL v3.0.
This project uses the following main libraries and tools:
- Transformers by Hugging Face
- PyTorch
- PEFT (Parameter-Efficient Fine-Tuning)
- Ollama
- GGUF (for model conversion)