Stochastic Extension for DuckDB by Query.Farm
The stochastic
extension adds comprehensive statistical distribution functions to DuckDB, enabling advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.
stochastic
is a DuckDB Community Extension.
You can install and use it in DuckDB SQL:
INSTALL stochastic FROM community;
LOAD stochastic;
Statistical distributions are mathematical functions that describe the probability of different outcomes in a dataset. They are fundamental to statistics, data science, machine learning, and scientific computing. This extension provides functions to:
- Calculate probability density and mass functions (PDF/PMF)
- Compute cumulative distribution functions (CDF)
- Generate quantiles (inverse CDF)
- Sample random values from distributions
- Access distribution properties (mean, variance, etc.)
The extension supports a comprehensive set of probability distributions:
- Beta -
dist_beta_*
functions - Cauchy -
dist_cauchy_*
functions - Chi-squared -
dist_chi_squared_*
functions - Exponential -
dist_exponential_*
functions - Extreme Value -
dist_extreme_value_*
functions - Fisher F -
dist_fisher_f_*
functions - Gamma -
dist_gamma_*
functions - Log-normal -
dist_lognormal_*
functions - Logistic -
dist_logistic_*
functions - Normal (Gaussian) -
dist_normal_*
functions - Pareto -
dist_pareto_*
functions - Rayleigh -
dist_rayleigh_*
functions - Student's t -
dist_students_t_*
functions - Uniform (Real) -
dist_uniform_real_*
functions - Weibull -
dist_weibull_*
functions
- Bernoulli -
dist_bernoulli_*
functions - Binomial -
dist_binomial_*
functions - Negative Binomial -
dist_negative_binomial_*
functions - Poisson -
dist_poisson_*
functions - Uniform (Integer) -
dist_uniform_int_*
functions
Each distribution provides the following function types:
dist_{distribution}_sample(params...)
- Generate random samples
dist_{distribution}_pdf(params..., x)
- Probability density functiondist_{distribution}_log_pdf(params..., x)
- Log probability density function
dist_{distribution}_cdf(params..., x)
- Cumulative distribution functiondist_{distribution}_log_cdf(params..., x)
- Log cumulative distribution functiondist_{distribution}_cdf_complement(params..., x)
- Survival function (1 - CDF)dist_{distribution}_log_cdf_complement(params..., x)
- Log survival function
dist_{distribution}_quantile(params..., p)
- Quantile function (inverse CDF)dist_{distribution}_quantile_complement(params..., p)
- Complementary quantile function
dist_{distribution}_hazard(params..., x)
- Hazard functiondist_{distribution}_chf(params..., x)
- Cumulative hazard function
dist_{distribution}_kurtosis_excess(params...)
- Excess kurtosisdist_{distribution}_kurtosis(params...)
- Kurtosisdist_{distribution}_mean(params...)
- Expected valuedist_{distribution}_median(params...)
- Median (50th percentile)dist_{distribution}_mode(params...)
- Mode (most likely value)dist_{distribution}_range(params...)
- Support rangedist_{distribution}_skewness(params...)
- Skewnessdist_{distribution}_stddev(params...)
- Standard deviationdist_{distribution}_support(params...)
- Distribution supportdist_{distribution}_variance(params...)
- Variance
-- Generate random samples from N(0, 1)
SELECT dist_normal_sample(0.0, 1.0) AS random_value;
-- Calculate PDF at x = 0.5 for N(0, 1)
SELECT dist_normal_pdf(0.0, 1.0, 0.5) AS density;
-- Calculate CDF (probability that X ≤ 1.96)
SELECT dist_normal_cdf(0.0, 1.0, 1.96) AS probability;
-- Find 95th percentile
SELECT dist_normal_quantile(0.0, 1.0, 0.95) AS percentile_95;
-- Get distribution properties
SELECT
dist_normal_mean(0.0, 1.0) AS mean,
dist_normal_variance(0.0, 1.0) AS variance,
dist_normal_skewness(0.0, 1.0) AS skewness;
-- Probability mass function for 10 trials, p=0.3
SELECT dist_binomial_pdf(10, 0.3, 7) AS prob_exactly_7;
-- Cumulative probability (≤ 5 successes)
SELECT dist_binomial_cdf(10, 0.3, 5) AS prob_at_most_5;
-- Generate random binomial samples
SELECT dist_binomial_sample(10, 0.3) AS random_successes;
-- Generate synthetic dataset
CREATE TABLE synthetic_data AS
SELECT
i,
dist_normal_sample(100, 15) AS height_cm,
dist_normal_sample(70, 10) AS weight_kg,
dist_binomial_sample(1, 0.5) AS gender -- 0 or 1
FROM range(1000) t(i);
-- Calculate z-scores
SELECT
height_cm,
(height_cm - dist_normal_mean(100, 15)) / dist_normal_stddev(100, 15) AS height_zscore
FROM synthetic_data;
-- Probability calculations
SELECT
weight_kg,
dist_normal_cdf(70, 10, weight_kg) AS percentile
FROM synthetic_data
LIMIT 10;
Common Task: Determine if there's a statistically significant difference between conversion rates.
Relevant Functions: dist_normal_cdf
, dist_normal_cdf_complement
, dist_normal_pdf
Common Task: Calculate Value at Risk (VaR) for portfolio management.
Relevant Functions: dist_normal_sample
, dist_normal_quantile
, dist_normal_cdf
Common Task: Monitor manufacturing processes and detect out-of-control conditions.
Relevant Functions: dist_normal_sample
, dist_normal_cdf
, dist_normal_pdf
Common Task: Build prediction intervals for forecasting models.
Relevant Functions: dist_normal_quantile
, dist_normal_cdf
, dist_normal_sample
Common Task: Model customer lifetime value with uncertainty quantification.
Relevant Functions: dist_normal_sample
, dist_exponential_sample
, dist_normal_quantile
, dist_normal_cdf
Common Task: Detect anomalies in time series data using statistical methods.
Relevant Functions: dist_normal_pdf
, dist_normal_cdf
, dist_normal_cdf_complement
Common Task: Run Monte Carlo simulations for risk analysis, optimization, or modeling.
Relevant Functions: dist_normal_sample
, dist_uniform_real_sample
, dist_gamma_sample
, dist_beta_sample
Common Task: Perform statistical hypothesis tests (t-tests, chi-square tests, etc.).
Relevant Functions: dist_students_t_cdf
, dist_chi_squared_cdf
, dist_normal_cdf
, dist_fisher_f_cdf
Common Task: Implement Bayesian statistical models and posterior analysis.
Relevant Functions: dist_beta_pdf
, dist_gamma_pdf
, dist_normal_pdf
, dist_beta_sample
Common Task: Analyze time-to-event data in medical research or reliability engineering.
Relevant Functions: dist_exponential_pdf
, dist_weibull_pdf
, dist_gamma_pdf
, dist_exponential_cdf
- No Data Movement: Analysis happens where your data lives
- SQL Familiarity: Use existing SQL skills instead of learning specialized libraries
- Performance: Columnar processing with vectorized statistical operations
- Integration: Works seamlessly with existing BI tools and SQL workflows
- Real-time: Analyze streaming data without export/import cycles
Statistical operations are vectorized and optimized for DuckDB's columnar engine.
All distribution functions include comprehensive parameter validation:
-- This will throw an error: standard deviation must be > 0
SELECT dist_normal_pdf(0.0, -1.0, 0.5);
-- Error: normal: Standard deviation must be > 0 was: -1.000000
-- This will throw an error: probability must be between 0 and 1
SELECT dist_binomial_pdf(10, 1.5, 5);
-- Error: binomial: Probability must be between 0 and 1 was: 1.500000
MIT Licensed