Computational Lebesgue Integration

Introduction

The Lebesgue integral, \(\int_{\mathcal{X}} f(x) \mu(\text{d}x)\) makes little structural assumptions on the integrand domain \(\mathcal{X}\). This high level of generality is useful for example when performing Bayesian inference over combinatorial spaces. As a specific use case, consider Bayesian phylogenetic inference, where \(\mathcal{X}\) is the space of phylogenetic trees.

Are there generic tools to approach the problem of computing Lebesgue integrals? By generic, we mean that they should apply to all situations where the Lebesgue integral is defined.

Surprisingly, the answer is yes. The generic setup is too abstract to permit concrete algorithms, but it does permit the elaboration and analysis of “meta-algorithms.”

The general story of these meta-algorithms starts in the same way as Lebesgue’s original idea: exploiting the fact that while the domain of \(f\) is unstructured, its image is \(\mathbb{R}\).

But in contrast to the theory of Lebesgue integration, which is a settled field, Computational Lebesgue Integration is in active development.

This page is a growing bibliography of Computational Lebesgue Integration methods. Create a pull request if you would like to add missing entries. To keep the length of this page manageable, we only include methods that (1) apply to any type of target distributions (e.g., we are not including methods that assume real-valued targets), (2) have reasonable scalability to large problems.

Fundamentals

Annealing

An example of annealed distributions.

Many methods described here anneal the distribution of interest \(\pi(\text{d}x) \propto \exp(\ell(x)) \pi_0(\text{d}x)\), i.e., consider the path \(\pi_\beta(\text{d}x) \propto \exp(\beta \ell(x)) \pi_0(\text{d}x)\). Given the direct connection between \(\beta\) and the inverse temperature in statistical mechanics, the notion of annealing appears right from the very beginning of the Monte Carlo literature, in Equation (2) of Metropolis et al. (1953). It is used as a tool detached from its statistical mechanics interpretation as early as Pincus (1970).

Meta-algorithms

Swap structure of a PT algorithm.

Two large families of meta-algorithms are able to take a basic, inefficient MCMC algorithm and transform it into a potentially much faster one while preserving consistency guarantees.

The first family is designed via MCMC on augmented distributions:

SMC particle genealogy.
  • Parallel tempering (PT): Charles J. Geyer (1991)
  • Simulated tempering (ST): Charles J. Geyer and Thompson (1995)

The second family uses importance sampling on augmented distributions combined with resampling:

  • Annealed importance sampling (AIS): Neal (2001)
  • Annealed Sequential Monte Carlo (ASMC): Del Moral, Doucet, and Jasra (2006)

Normalization constants

To obtain estimates of normalization constant from the output of the above algorithms, several algorithms are available, for example:

  • The stepping stone, a product of importance sampling estimators, see Bennett (1976) and Xie et al. (2011).
  • Bridge sampling (Gelman and Meng 1998).

Non-reversible variants

The variable \(\beta\) is on \([0, 1]\), and hence facilitate the design of non-reversible algorithms even when \(\mathcal{X}\) is a black box state space.

  • In parallel tempering:
    • First occurence of odd-even swaps Okabe et al. (2001)
    • Non-Reversible Parallel Tempering (NRPT) Syed et al. (2022)
  • Non-reversible simulated tempering (NRST)
    • Sakai and Hukushima (2016)
    • Biron-Lattes, Campbell, and Bouchard-Côté (2024)

Methodological extensions

Adaptive schedules

In challenging problems, using an adaptive discretization of \(\beta \in [0, 1]\) is critical for good performance.

  • For stochastic gradient methods, see Atchadé, Roberts, and Rosenthal (2011) and Miasojedow, Moulines, and Vihola (2013).
  • Round based: Syed et al. (2022).
  • See subsection “Round-based method” of Bouchard-Côté et al. (2024) for an explanation of how this implicitly pre-condition the adaptation optimization.

Generalized distribution paths

Most of the methods described so far also apply when the annealing distributions are generalized into a path of distributions. Paths that are geodesic can lead to substantial savings.

  • For AIS/ASMC, see:
    • Gelman and Meng (1998)
    • Grosse, Maddison, and Salakhutdinov (2013)
    • Brekelmans et al. (2020)
  • For PT, see Syed et al. (2021)

Variational blends

Instead of a fixed prior \(\pi_0\) as a tractable reference, a fitted variational distribution \(q_\phi\) can also be used, either by itself or in conjunction.

  • In the context of path sampling:
    • Lefebvre et al. (2009)
    • Wang, Jones, and Meng (2022)
  • For PT:
    • Paquet, Winther, and Opper (2009)
    • Cameron and Pettitt (2014)
    • Surjanovic et al. (2022)

Convergence of adaptive methods

Theoretical frameworks to analyze the convergence of the above adaptive methods (schedule, end points, paths):

  • For stochastic gradient methods, see, e.g., Andrieu and Thoms (2008) for the general toolbox and Miasojedow, Moulines, and Vihola (2013) for analysis in the case of PT specifically.
  • For round based schemes, see Chimisov, Latuszynski, and Roberts (2018) for general tools, and Surjanovic et al. (2022) for analysis in the case of variational PT specifically.

Software

Software implementations of the above meta-algorithms (PT, ST, AIS, ASMC) that support arbitrary data types. We do not include packages specialized to real-valued targets, nor software that use algorithms that do not scale to large problems (e.g. universal PPLs that only provide algorithms such as non-Markovian SMC or lightweight Metropolis-Hastings).

References

Andrieu, Christophe, and Johannes Thoms. 2008. “A Tutorial on Adaptive MCMC.” Statistics and Computing 18 (4): 343–73. https://doi.org/10.1007/s11222-008-9110-y.
Atchadé, Yves F., Gareth O. Roberts, and Jeffrey S. Rosenthal. 2011. “Towards Optimal Scaling of Metropolis-Coupled Markov Chain Monte Carlo.” Statistics and Computing 21 (4): 555–68. https://doi.org/10.1007/s11222-010-9192-1.
Bennett, Charles H. 1976. “Efficient Estimation of Free Energy Differences from Monte Carlo Data.” Journal of Computational Physics 22 (2): 245–68. https://doi.org/10.1016/0021-9991(76)90078-4.
Biron-Lattes, Miguel, Trevor Campbell, and Alexandre Bouchard-Côté. 2024. “Automatic Regenerative Simulation via Non-Reversible Simulated Tempering.” Journal of the American Statistical Association, 1–13. https://doi.org/10.1080/01621459.2024.2335587.
Bouchard-Côté, Alexandre, Trevor Campbell, Geoff Pleiss, and Nikola Surjanovic. 2024. “MCMC-Driven Learning.” In Handbook of Markov Chain Monte Carlo. Vol. (Accepted).
Bouchard-Côté, Alexandre, Kevin Chern, Davor Cubranic, Sahand Hosseini, Justin Hume, Matteo Lepur, Zihui Ouyang, and Giorgio Sgarbi. 2022. “Blang: Probabilitistic Programming for Combinatorial Spaces.” Journal of Statistical Software 103: 1–98.
Brekelmans, Rob, Vaden Masrani, Frank Wood, Greg Ver Steeg, and Aram Galstyan. 2020. “All in the Exponential Family: Bregman Duality in Thermodynamic Variational Inference.” In Proceedings of the 37th International Conference on Machine Learning, 1111–22. PMLR. https://proceedings.mlr.press/v119/brekelmans20a.html.
Cameron, Ewan, and Anthony Pettitt. 2014. “Recursive Pathways to Marginal Likelihood Estimation with Prior-Sensitivity Analysis.” Statistical Science 29 (3): 397–419. https://www.jstor.org/stable/43288518.
Chimisov, Cyril, Krzysztof Latuszynski, and Gareth Roberts. 2018. “Adapting the Gibbs Sampler.” arXiv:1801.09299 [Stat], January. http://arxiv.org/abs/1801.09299.
Del Moral, Pierre, Arnaud Doucet, and Ajay Jasra. 2006. “Sequential Monte Carlo Samplers.” Journal of the Royal Statistical Society. Series B (Statistical Methodology) 68 (3): 411–36. http://www.jstor.org/stable/3879283.
Ge, Hong, Kai Xu, and Zoubin Ghahramani. 2018. “International Conference on Artificial Intelligence and Statistics.” In, 1682–90. PMLR. https://proceedings.mlr.press/v84/ge18b.html.
Gelman, Andrew, and Xiao-Li Meng. 1998. “Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling.” Statistical Science 13 (2): 163–85. https://doi.org/10.1214/ss/1028905934.
Geyer, Charles J. 1991. “Markov Chain Monte Carlo Maximum Likelihood.” Interface Proceedings.
Geyer, Charles J., and Elizabeth A. Thompson. 1995. “Annealing Markov Chain Monte Carlo with Applications to Ancestral Inference.” Journal of the American Statistical Association 90 (431): 909–20. https://doi.org/10.2307/2291325.
Grosse, Roger B, Chris J Maddison, and Russ R Salakhutdinov. 2013. “Annealing Between Distributions by Averaging Moments.” In, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2769–77. http://papers.nips.cc/paper/4879-annealing-between-distributions-by-averaging-moments.pdf.
Lefebvre, Geneviève, Russell Steele, Alain C. Vandal, Sridar Narayanan, and Douglas L. Arnold. 2009. “Path Sampling to Compute Integrated Likelihoods: An Adaptive Approach.” Journal of Computational and Graphical Statistics 18 (2): 415–37. https://doi.org/10.1198/jcgs.2009.07019.
Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics 21 (6): 1087–92. https://doi.org/10.1063/1.1699114.
Miasojedow, Błażej, Eric Moulines, and Matti Vihola. 2013. “An Adaptive Parallel Tempering Algorithm.” Journal of Computational and Graphical Statistics 22 (3): 649–64. https://doi.org/10.1080/10618600.2013.778779.
Neal, Radford M. 2001. “Annealed Importance Sampling.” Statistics and Computing 11 (2): 125–39. https://doi.org/10.1023/A:1008923215028.
Okabe, Tsuneyasu, Masaaki Kawata, Yuko Okamoto, and Masuhiro Mikami. 2001. “Replica-Exchange Monte Carlo Method for the Isobaricisothermal Ensemble.” Chemical Physics Letters 335 (5-6): 435–39. https://doi.org/10.1016/S0009-2614(01)00055-0.
Paquet, Ulrich, Ole Winther, and Manfred Opper. 2009. “Perturbation Corrections in Approximate Inference: Mixture Modelling Applications.” Journal of Machine Learning Research 10 (43): 1263–1304. http://jmlr.org/papers/v10/paquet09a.html.
Pincus, Martin. 1970. “A Monte Carlo Method for the Approximate Solution of Certain Types of Constrained Optimization Problems.” Operations Research 18 (6): 1225–28. https://www.jstor.org/stable/169420.
Pleydell, David. 2021. DRJP/nimbleAPT: V1.0.4. Zenodo. https://doi.org/10.5281/zenodo.5013688.
Sakai, Yuji, and Koji Hukushima. 2016. “Irreversible Simulated Tempering.” Journal of the Physical Society of Japan 85 (10): 104002. https://doi.org/10.7566/JPSJ.85.104002.
Surjanovic, Nikola, Miguel Biron-Lattes, Paul Tiede, Saifuddin Syed, Trevor Campbell, and Alexandre Bouchard-Côté. 2023. “Pigeons.jl: Distributed Sampling From Intractable Distributions.” arXiv. https://doi.org/10.48550/arXiv.2308.09769.
Surjanovic, Nikola, Saifuddin Syed, Alexandre Bouchard-Côté, and Trevor Campbell. 2022. “Parallel Tempering With a Variational Reference.” In Advances in Neural Information Processing Systems 36 (NeurIPS), 36:565–77.
Syed, Saifuddin, Alexandre Bouchard-Côté, George Deligiannidis, and Arnaud Doucet. 2022. “Non-Reversible Parallel Tempering: A Scalable Highly Parallel MCMC Scheme.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84 (2): 321–50. https://doi.org/10.1111/rssb.12464.
Syed, Saifuddin, Vittorio Romaniello, Trevor Campbell, and Alexandre Bouchard-Côté. 2021. “Parallel Tempering on Optimized Paths.” In International Conference on Machine Learning (ICML), 139:10033–42.
Wang, Lazhi, David E. Jones, and Xiao-Li Meng. 2022. “Warp Bridge Sampling: The Next Generation.” Journal of the American Statistical Association 117 (538): 835–51. https://doi.org/10.1080/01621459.2020.1825447.
Xie, Wangang, Paul O. Lewis, Yu Fan, Lynn Kuo, and Ming-Hui Chen. 2011. “Improving Marginal Likelihood Estimation for Bayesian Phylogenetic Model Selection.” Systematic Biology 60 (2): 150–60. https://doi.org/10.1093/sysbio/syq085.