PII Masker is an advanced open-source tool that protects your sensitive data using state-of-the-art AI, powered by DeBERTa-v3
Features • Installation • Quick Start • How It Works • Contributing
PII Masker is an advanced open-source tool designed to protect your sensitive data by leveraging cutting-edge AI models. Built on top of DeBERTa-v3, this tool ensures high-precision detection and masking of Personally Identifiable Information (PII), making it a perfect fit for any data-sensitive workflows. Whether you're handling customer data, performing data analysis, or ensuring compliance with privacy regulations, PII Masker provides a robust, scalable solution to keep your information secure.
When handling sensitive information, it's crucial to use tools that not only perform well but also ensure compliance and protect privacy. Here's why PII Masker stands out:
- High Precision: Utilizes DeBERTa-v3 for accurate detection and masking of various PII types.
- Compliance Friendly: Designed to help organizations meet privacy laws and regulations.
- Flexible Integration: Offers easy integration with existing systems through a simple Python API.
- 🔒 Comprehensive Protection: Identifies and masks multiple PII types including names, addresses, phone numbers, and more
- 🚀 High Performance: Powered by DeBERTa-v3 with 1024 token support for processing longer documents
- 🎯 Precision Focused: Advanced NLP model fine-tuned specifically for PII detection
- 📊 Structured Output: Get both masked text and structured PII dictionary
- 🔄 Easy Integration: Simple Python API for seamless integration into your workflow
- Clone the repository:
git clone https://github.com/yourusername/pii-masker.git
cd pii-masker
- Install dependencies:
pip install -r requirements.txt
- Download the model:
# Option 1: Manual download
# Visit: https://huggingface.co/collections/hydroxai/pii-models-674649fea0de7ab99ed11347
# Place files in: pii-masker/output_model/deberta3base_1024/
- Change to the
pii-masker
directory:cd pii-masker
- Use the following code to get started:
from model import PIIMasker # Initialize the PIIMasker masker = PIIMasker() # Mask PII in your text text = "John Doe lives at 1234 Elm St." masked_text, pii_dict = masker.mask_pii(text) print(masked_text) # Output: "[NAME] lives at [ADDRESS]"
PII Masker employs a sophisticated pipeline powered by DeBERTa-v3:
- Tokenization → Smart text splitting for optimal processing
- Model Inference → AI-powered PII detection
- Entity Recognition → Precise identification of sensitive data
- Masking → Secure replacement of PII with placeholders
- Data Extraction → Structured output for further processing
We are excited to announce a significant addition to the PII Masker project: a new model with a different approach from DeBERTa. Here are the details:
-
🌟 Model Link:
-
hydroxai/pii_model_longtransfomer_version
-
Model detail:
-
train_pii_longtransformer.ipynb
-
🔧 Performance Improvement:
-
This new model implementation has resulted in approximately a 4% improvement in performance compared to the previous DeBERTa-v3 model. The combination of Longformer's extended sequence length (4096 tokens) and the Bi-LSTM head enhances the sequential context understanding, making PII detection more accurate and reliable.
Check out our detailed examples:
We are committed to continuously enhancing PII Masker to meet evolving data privacy needs. Over the next two weeks, we plan to expand the scope of PII detection to include text and video data, ensuring comprehensive coverage for sensitive information across multiple media formats.
-
Text Data:
- Improved detection of PII in longer and more complex documents.
- Support for additional entity types, such as financial information and medical records.
-
Video Data:
- Integration of OCR (Optical Character Recognition) for extracting text from video frames.
- Advanced video frame analysis to identify and mask PII directly in video content.
These updates aim to make PII Masker more versatile, covering broader use cases while maintaining the precision and reliability our users trust. Stay tuned for more details in our upcoming releases!
Contributions make the open-source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch
- Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Special thanks to:
- Microsoft for the DeBERTa model
- Hugging Face for model hosting and transformers library
- Zilliz for their support and Milvus, the vector database powering our solution
- All our contributors and supporters
Made with ❤️ for the privacy-conscious developer community