Skip to content

PII Masker is an open-source tool for protecting sensitive data by automatically detecting and masking PII using advanced AI, powered by DeBERTa-v3. It provides high-precision detection, scalable performance, and a simple Python API for seamless integration into workflows, ensuring privacy compliance in various industries.

Notifications You must be signed in to change notification settings

HydroXai/pii-masker

Repository files navigation

PII Masker Cover

PII Masker is an advanced open-source tool that protects your sensitive data using state-of-the-art AI, powered by DeBERTa-v3

License: MIT Python: 3.8+ Milvus Hugging Face

FeaturesInstallationQuick StartHow It WorksContributing

PII Masker is an advanced open-source tool designed to protect your sensitive data by leveraging cutting-edge AI models. Built on top of DeBERTa-v3, this tool ensures high-precision detection and masking of Personally Identifiable Information (PII), making it a perfect fit for any data-sensitive workflows. Whether you're handling customer data, performing data analysis, or ensuring compliance with privacy regulations, PII Masker provides a robust, scalable solution to keep your information secure.

Why Choose PII Masker?

When handling sensitive information, it's crucial to use tools that not only perform well but also ensure compliance and protect privacy. Here's why PII Masker stands out:

  • High Precision: Utilizes DeBERTa-v3 for accurate detection and masking of various PII types.
  • Compliance Friendly: Designed to help organizations meet privacy laws and regulations.
  • Flexible Integration: Offers easy integration with existing systems through a simple Python API.

✨ Key Features

  • 🔒 Comprehensive Protection: Identifies and masks multiple PII types including names, addresses, phone numbers, and more
  • 🚀 High Performance: Powered by DeBERTa-v3 with 1024 token support for processing longer documents
  • 🎯 Precision Focused: Advanced NLP model fine-tuned specifically for PII detection
  • 📊 Structured Output: Get both masked text and structured PII dictionary
  • 🔄 Easy Integration: Simple Python API for seamless integration into your workflow

📦 Installation

  1. Clone the repository:
git clone https://github.com/yourusername/pii-masker.git
cd pii-masker
  1. Install dependencies:
pip install -r requirements.txt
  1. Download the model:
# Option 1: Manual download
# Visit: https://huggingface.co/collections/hydroxai/pii-models-674649fea0de7ab99ed11347
# Place files in: pii-masker/output_model/deberta3base_1024/

🚀 Quick Start

  1. Change to the pii-masker directory:
    cd pii-masker
    
  2. Use the following code to get started:
    from model import PIIMasker
     
    # Initialize the PIIMasker
    masker = PIIMasker()
     
    # Mask PII in your text
    text = "John Doe lives at 1234 Elm St."
    masked_text, pii_dict = masker.mask_pii(text)
     
    print(masked_text)
    # Output: "[NAME] lives at [ADDRESS]"

🔍 How It Works

PII Masker employs a sophisticated pipeline powered by DeBERTa-v3:

  1. Tokenization → Smart text splitting for optimal processing
  2. Model Inference → AI-powered PII detection
  3. Entity Recognition → Precise identification of sensitive data
  4. Masking → Secure replacement of PII with placeholders
  5. Data Extraction → Structured output for further processing

🆕 Latest Updates

We are excited to announce a significant addition to the PII Masker project: a new model with a different approach from DeBERTa. Here are the details:

  • 🌟 Model Link:

  • hydroxai/pii_model_longtransfomer_version

  • Model detail:

  • train_pii_longtransformer.ipynb

  • 🔧 Performance Improvement:

  • This new model implementation has resulted in approximately a 4% improvement in performance compared to the previous DeBERTa-v3 model. The combination of Longformer's extended sequence length (4096 tokens) and the Bi-LSTM head enhances the sequential context understanding, making PII detection more accurate and reliable.

🛠️ Advanced Usage

Check out our detailed examples:

🗓️ Future Updates

We are committed to continuously enhancing PII Masker to meet evolving data privacy needs. Over the next two weeks, we plan to expand the scope of PII detection to include text and video data, ensuring comprehensive coverage for sensitive information across multiple media formats.

Planned Features:

  • Text Data:

    • Improved detection of PII in longer and more complex documents.
    • Support for additional entity types, such as financial information and medical records.
  • Video Data:

    • Integration of OCR (Optical Character Recognition) for extracting text from video frames.
    • Advanced video frame analysis to identify and mask PII directly in video content.

These updates aim to make PII Masker more versatile, covering broader use cases while maintaining the precision and reliability our users trust. Stay tuned for more details in our upcoming releases!

🤝 Contributing

Contributions make the open-source community an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

🙏 Acknowledgments

Special thanks to:

  • Microsoft for the DeBERTa model
  • Hugging Face for model hosting and transformers library
  • Zilliz for their support and Milvus, the vector database powering our solution
  • All our contributors and supporters

Made with ❤️ for the privacy-conscious developer community

About

PII Masker is an open-source tool for protecting sensitive data by automatically detecting and masking PII using advanced AI, powered by DeBERTa-v3. It provides high-precision detection, scalable performance, and a simple Python API for seamless integration into workflows, ensuring privacy compliance in various industries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published