DeBERTa-ELL: Automated Proficiency Assessment for English Language Learners

Overview

DeBERTa-ELL is an advanced natural language processing (NLP) project that leverages the power of the DeBERTa (Decoding-enhanced BERT with Disentangled Attention) model to automatically assess the language proficiency of high school English Language Learners (ELLs) based on their essays. This project aims to provide a reliable, efficient, and scalable solution for educators and researchers in the field of second language acquisition and assessment.

Features

Utilizes state-of-the-art DeBERTa model for text analysis
Assesses multiple aspects of language proficiency:
- Cohesion
- Syntax
- Vocabulary
- Phraseology
- Grammar
- Conventions
Implements multi-label stratified k-fold cross-validation for robust model evaluation
Supports both training and inference modes
Includes data preprocessing and augmentation techniques
Provides detailed logging and model checkpointing

Requirements

Python 3.10+
PyTorch 2.3+
Transformers 4.37+

For a complete list of dependencies, please refer to the requirements.txt file.

Installation

Clone this repository:

git clone https://github.com/arnavs04/deberta-ell.git
cd deberta-ell

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Data Preparation

The training data is already in the data/feedback-prize-english-language-learning/ directory.

Training

To train the model, run:

python train.py

You can modify the hyperparameters in the configs.py file.

Inference

To run inference on new data:

python inference.py

You can modify the hyperparameters in the configs.py file

Model Architecture

This project uses the DeBERTa-v3-base model as the backbone for essay analysis. The model is fine-tuned on the task of multi-aspect proficiency assessment, with a custom head for multi-label regression.

Performance

The performance of the model was evaluated using Smooth L1 Loss for training and validation, and Mean Column-wise Root Mean Square Error (MCRMSE) score for the final evaluation. Below are the summarized results for each fold:

Fold	Score
0	0.4493
1	0.4576
2	0.4663
3	0.4529
Overall	0.4566

Contributing

Contributions are welcomed to improve DeBERTa-ELL! Please feel free to submit issues, fork the repository and send pull requests!

Citation

If you use this code for your research, please cite our project:

@software{DeBERTa_ELL2024,
  author = {Arnav Samal},
  title = {DeBERTa-ELL: Automated Proficiency Assessment for English Language Learners},
  year = {2024},
  url = {https://github.com/arnavs04/deberta-ell.git}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any queries, please open an issue or contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
__pycache__		__pycache__
data		data
logs		logs
misc		misc
models		models
src		src
tokenizer		tokenizer
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeBERTa-ELL: Automated Proficiency Assessment for English Language Learners

Overview

Features

Requirements

Installation

Usage

Data Preparation

Training

Inference

Model Architecture

Performance

Contributing

Citation

License

Contact

About

Releases

Packages

Languages

License

arnavsamal/deberta-ell

Folders and files

Latest commit

History

Repository files navigation

DeBERTa-ELL: Automated Proficiency Assessment for English Language Learners

Overview

Features

Requirements

Installation

Usage

Data Preparation

Training

Inference

Model Architecture

Performance

Contributing

Citation

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages