The idea behind sampling is that we want to draw samples from a probability distribution, but we only have the equation of the distribution's Probability Density Function (PDF), rather than an algorithm for actually drawing samples.
Rejection sampling is one of the simplest methods for doing
this. Rather than directly drawing samples from the target
distribution (
- Draw a sample from the proposal distribution,
$x\sim q$ - Choose a point
$y$ uniformly at random in the interval$[0, q(x)]$ - If
$y < p(x)$ , then accept$x$ as a sample. Otherwise, reject$x$ and start over from step 1.
TODO: Give more of an intuition for how this works.
There are many other sampling methods that you could choose from, but all of them have relatively similar design patterns behind the implementation, so we will just be focusing on rejection sampling here.
The file sampler.py
contains the basic code for implementing a
sampler. On initialization, it takes functions to sample from the
proposal distribution and to compute the log-PDF values for both the
proposal and target distributions.
TODO: discuss the design decision of having RejectionSampler
take
the functions, and then requiring users to instantiate it directly,
rather than subclassing it and having users write write the functions
as methods of the subclass.
TODO: include discussion about using import numpy as np
rather than
import numpy
.
TODO: include discussion about variable names (descriptive vs math?)
Why the log-PDF? When working with sampling methods, it is almost always a good idea to work in "log-space", meaning that your functions should always return log probabilities rather than probabilities. This is because probabilities can get very small very quickly, resulting in underflow errors.
To motivate this, consider that probabilities must range between 0 and
1 (inclusive). NumPy has a useful function, finfo
, that will tell us
the limits of floating point values for our system. For example, on a
64-bit machine, we see that the smallest usable positive number (given
by tiny
) is:
>>> import numpy as np
>>> np.finfo(float).tiny
2.2250738585072014e-308
While that may seem very small, it is not unusual to encounter probabilities of this magnitude, or even smaller! Moreover, it is a common operation to multiply probabilities, yet if we try to do this with very small probabilities, we encounter underflow problems:
>>> tiny = np.finfo(float).tiny
>>> tiny * tiny
0.0
However, taking the log can help alleviate this issue for two reasons.
First, "log-space" ranges from min
value returned by finfo
to zero. Yet,
the log of the smallest positive value is much larger than that! So,
we have greatly expaned our range of representable numbers:
>>> np.finfo(float).min
-1.7976931348623157e+308
>>> np.log(tiny)
-708.39641853226408
Second, we can perform multiplication in log-space using addition,
because of the identity that
>>> np.log(tiny) + np.log(tiny)
-1416.7928370645282
Of course, this solution is not a magic bullet. If we need to bring the number out of log-space (for example, to add probabilities, rather than multiply them), then we are back to the issue of underflow:
>>> tiny*tiny
0.0
>>> np.exp(np.log(tiny) + np.log(tiny))
0.0
The RejectionSampler
class uses two methods to draw samples:
sample
and draw
. The first of these, sample
, is the main outer
sampling loop: it takes one argument n
, which is the number of
desired samples, allocates the storage for these samples, and calls
draw
a total of n
times to collect these samples. The draw
function is the actual sampling logic, following the three steps
described in the previous section.
Note that in draw
, we need to exponentiate continue
to try a different value of
We also include a plot
method, which will let us visualize the
proposal distribution, target distrubtion, and the histogram of
samples. These types of methods are important as an easy way to glance
at the samples and make sure they actually match the target
distribution.
The
IPython notebook
uses RejectionSampler
for the specific application of sampling from
a mixture of Gaussians. A Gaussian distribution is given by the
following equation:
where
Note: in practice, sampling from a Gaussian distribution is fairly easy, and most statistics packages come with methods for doing so. We are using a mixture of Gaussians here mostly just for illustration purposes.
TODO: include multi-dimensional example
TODO: include discussion about where to include visualization code
There are several ways that the basic sampler described here could be extended. Depending on the application, some of these extensions might be crucially important.
The sampling loop could run multiple calls to draw
in parallel. Some
samplers (not rejection sampling) depend on samples being drawn
sequentially, though, so this won't always be applicable.
Function calls in Python have a high overhead. However, based on the
current implementation, we require at least seven Python calls per
sample (more, if the proposal and target functions themselves make
Python calls). Even simple target distributions need upwards of
One solution is to rewrite the code for draw
(and possibly even
sample
) using an extension language like Cython or numba, or even a
lower level language like C or FORTRAN.