Skip to content
/ privacy Public
forked from tensorflow/privacy

Library for training machine learning models with privacy for training data

License

Notifications You must be signed in to change notification settings

kkodoo/privacy

 
 

Repository files navigation

<<<<<<< HEAD

Learning private models with multiple teachers

This repository contains code to create a setup for learning privacy-preserving student models by transferring knowledge from an ensemble of teachers trained on disjoint subsets of the data for which privacy guarantees are to be provided.

Knowledge acquired by teachers is transferred to the student in a differentially private manner by noisily aggregating the teacher decisions before feeding them to the student during training.

The paper describing the approach is arXiv:1610.05755

Dependencies

This model uses TensorFlow to perform numerical computations associated with machine learning models, as well as common Python libraries like: numpy, scipy, and six. Instructions to install these can be found in their respective documentations.

How to run

This repository supports the MNIST and SVHN datasets. The following instructions are given for MNIST but can easily be adapted by replacing the flag --dataset=mnist by --dataset=svhn. There are 2 steps: teacher training and student training. Data will be automatically downloaded when you start the teacher training.

The following is a two-step process: first we train an ensemble of teacher models and second we train a student using predictions made by this ensemble.

Training the teachers: first run the train_teachers.py file with at least three flags specifying (1) the number of teachers, (2) the ID of the teacher you are training among these teachers, and (3) the dataset on which to train. For instance, to train teacher number 10 among an ensemble of 100 teachers for MNIST, you use the following command:

python train_teachers.py --nb_teachers=100 --teacher_id=10 --dataset=mnist

Other flags like train_dir and data_dir should optionally be set to respectively point to the directory where model checkpoints and temporary data (like the dataset) should be saved. The flag max_steps (default at 3000) controls the length of training. See train_teachers.py and deep_cnn.py to find available flags and their descriptions.

Training the student: once the teachers are all trained, e.g., teachers with IDs 0 to 99 are trained for nb_teachers=100, we are ready to train the student. The student is trained by labeling some of the test data with predictions from the teachers. The predictions are aggregated by counting the votes assigned to each class among the ensemble of teachers, adding Laplacian noise to these votes, and assigning the label with the maximum noisy vote count to the sample. This is detailed in function noisy_max in the file aggregation.py. To learn the student, use the following command:

python train_student.py --nb_teachers=100 --dataset=mnist --stdnt_share=5000

The flag --stdnt_share=5000 indicates that the student should be able to use the first 5000 samples of the dataset's test subset as unlabeled training points (they will be labeled using the teacher predictions). The remaining samples are used for evaluation of the student's accuracy, which is displayed upon completion of training.

Using semi-supervised GANs to train the student

In the paper, we describe how to train the student in a semi-supervised fashion using Generative Adversarial Networks. This can be reproduced for MNIST by cloning the improved-gan repository and adding to your PATH variable before running the shell script train_student_mnist_250_lap_20_count_50_epochs_600.sh.

export PATH="/path/to/improved-gan/mnist_svhn_cifar10":$PATH
sh train_student_mnist_250_lap_20_count_50_epochs_600.sh

Alternative deeper convolutional architecture

Note that a deeper convolutional model is available. Both the default and deeper models graphs are defined in deep_cnn.py, respectively by functions inference and inference_deeper. Use the flag --deeper=true to switch to that model when launching train_teachers.py and train_student.py.

Privacy analysis

In the paper, we detail how data-dependent differential privacy bounds can be computed to estimate the cost of training the student. In order to reproduce the bounds given in the paper, we include the label predicted by our two teacher ensembles: MNIST and SVHN. You can run the privacy analysis for each dataset with the following commands:

python analysis.py --counts_file=mnist_250_teachers_labels.npy --indices_file=mnist_250_teachers_100_indices_used_by_student.npy

python analysis.py --counts_file=svhn_250_teachers_labels.npy --max_examples=1000 --delta=1e-6

To expedite experimentation with the privacy analysis of student training, the analysis.py file is configured to download the labels produced by 250 teacher models, for MNIST and SVHN when running the two commands included above. These 250 teacher models were trained using the following command lines, where XXX takes values between 0 and 249:

python train_teachers.py --nb_teachers=250 --teacher_id=XXX --dataset=mnist
python train_teachers.py --nb_teachers=250 --teacher_id=XXX --dataset=svhn

Note that these labels may also be used in lieu of function ensemble_preds in train_student.py, to compare the performance of alternative student model architectures and learning techniques. This facilitates future work, by removing the need for training the MNIST and SVHN teacher ensembles when proposing new student training approaches.

Contact

To ask questions, please email [email protected] or open an issue on the tensorflow/models issues tracker. Please assign issues to @npapernot.

TensorFlow Privacy

This repository contains the source code for TensorFlow Privacy, a Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. The library comes with tutorials and analysis tools for computing the privacy guarantees provided.

The TensorFlow Privacy library is under continual development, always welcoming contributions. In particular, we always welcome help towards resolving the issues currently open.

Setting up TensorFlow Privacy

Dependencies

This library uses TensorFlow to define machine learning models. Therefore, installing TensorFlow is a pre-requisite. You can find instructions here. For better performance, it is also recommended to install TensorFlow with GPU support (detailed instructions on how to do this are available in the TensorFlow installation documentation).

In addition to TensorFlow and its dependencies, other prerequisites are:

  • scipy >= 0.17

  • mpmath (for testing)

Installing TensorFlow Privacy

First, clone this GitHub repository into a directory of your choice:

git clone https://github.com/tensorflow/privacy

You can then install the local package in "editable" mode in order to add it to your PYTHONPATH:

cd privacy
pip install -e .

If you'd like to make contributions, we recommend first forking the repository and then cloning your fork rather than cloning this repository directly.

Contributing

Contributions are welcomed! Bug fixes and new features can be initiated through Github pull requests. To speed the code review process, we ask that:

  • When making code contributions to TensorFlow Privacy, you follow the PEP8 with two spaces coding style (the same as the one used by TensorFlow) in your pull requests. In most cases this can be done by running autopep8 -i --indent-size 2 <file> on the files you have edited.

  • When making your first pull request, you sign the Google CLA

  • We do not accept pull requests that add git submodules because of the problems that arise when maintaining git submodules

Tutorials directory

To help you get started with the functionalities provided by this library, the tutorials/ folder comes with scripts demonstrating how to use the library features.

NOTE: the tutorials are maintained carefully. However, they are not considered part of the API and they can change at any time without warning. You should not write 3rd party code that imports the tutorials and expect that the interface will not break.

Research directory

This folder contains code to reproduce results from research papers related to privacy in machine learning. It is not maintained as carefully as the tutorials directory, but rather intended as a convenient archive.

Remarks

The content of this repository supersedes the following existing folder in the tensorflow/models repository

Contacts

If you have any questions that cannot be addressed by raising an issue, feel free to contact:

  • Galen Andrew (@galenmandrew)
  • Steve Chien (@schien1729)
  • Nicolas Papernot (@npapernot)

Copyright

Copyright 2019 - Google LLC

9c8d80b36420c5bfdcbc54e71301dd1619e458a2

About

Library for training machine learning models with privacy for training data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.3%
  • TeX 6.2%
  • Shell 1.5%