Quantile Function of Mixture Distributions in Python
Finite mixtures are just convex combinations of two or more component distributions. Given weights $w_1, \dots, w_n$ and component distributions $P_i$ or densities $p_i$, the distribution and density functions are easy to compute, $$ F(x) = \sum_{i=1}^n w_i P_i(x) $$ $$ f(x) = \sum_{i=1}^n w_i p_i(x) $$ but the inverse distribution $F^{-1}(p)$ or quantile function is not so straightforward. Often, there is no closed-form solution. But fortunately this can be calculated easily by just solving $F(x) = p$ numerically for $x$ (for such families such as Gaussian mixtures, there are better ways to do this).
The following Python code does this and provides an interface similar to other the scipy.stats
probability distributions.
import functools
import numpy as np
from scipy.optimize import root_scalar
def _vectorize_float(f):
vectorized = np.vectorize(f, otypes=[float], signature="(),()->()")
@functools.wraps(f)
def wrapper(*args):
return vectorized(*args)
return wrapper
class MixtureDistribution:
def __init__(self, distributions, weights):
self._distributions = list(distributions)
self._weights = list(weights)
if not (all(w >= 0 for w in self._weights) and sum(self._weights) == 1):
raise ValueError("Invalid weight vector.")
if len(self._distributions) != len(self._weights):
raise ValueError("Mixtures and weights must have the same length.")
if len(self._distributions) < 2:
raise ValueError("Must have at least two component distributions.")
@_vectorize_float
def pdf(self, x):
return sum(w * d.pdf(x) for w, d in zip(self._weights, self._distributions))
@_vectorize_float
def cdf(self, x):
return sum(w * d.cdf(x) for w, d in zip(self._weights, self._distributions))
@_vectorize_float
def ppf(self, p):
bracket = [min(dist.ppf(p) for dist in self._distributions),
max(dist.ppf(p) for dist in self._distributions)]
r = root_scalar(
f=lambda x: self.cdf(x) - p,
fprime=self.pdf,
bracket=bracket,
x0=0
)
assert r.converged
return r.root
The percentage-point function inverts the cdf using the standard root_scalar
function. As a cdf is monotonic, the bracketing interval must be bounded by the mininum and maximum of the quantitle function across all the component distributions. (The _vectorize_float
bit isn't conceptually important: this just allows my functions to be called with numpy arrays; for some reason, numpy's vectorize
function cannot be used as a decorator.)
Then we can create and call the mixture distribution as follows,
from scipy.stats import norm
m = MixtureDistribution([norm(-2), norm(-2)], [1/2, 1/2])
m.cdf(0) == 0.5
>> True
which makes sense.