notes.osteele.com / math

Math Notes

Study notes on probability, information theory, random processes, and Bayesian inference.

∑

Probability & Statistics #

Foundations, estimation, dependence, and convergence.

Selective study notes; not externally reviewed. Some derivations and implementations still need checking.

10 items

Measure Theory & Random Variables

Measurable spaces, probability measures, pushforward measures, densities, and importance sampling

Notation: density form vs. measure-theoretic form

Density notation, measure-theoretic notation, and the places where the choice changes the statement

Named Distributions

How Bernoulli, Poisson, Gaussian, Cauchy, chi-square, t, F, conjugate priors, and heavy-tail laws are related

Modes of Convergence

Almost sure, in probability, in distribution, in L^p — the implication lattice, counterexamples as sample paths, Markov/Chebyshev/Chernoff bounds, and MCT/DCT/Fatou

Calculus of Variations

First variations, Euler-Lagrange residuals, curve relaxation, and the brachistochrone race

Sufficient Statistics

Fisher–Neyman factorization, the fiber picture, Rao–Blackwell variance collapse, and a categorical diagram showing factorization, variance, and Fisher info as one commuting square

The Exponential Family

The canonical form, naming and the statistical-physics log-partition story, derivatives of A giving the moments of T(X), canonical links (logit, log) behind GLMs, and an interactive picker stepping through six standard members

Fisher Information

Likelihood geometry, score functions, Fisher information, exponential families, log-partition, Jeffreys and max-entropy priors, and Bayesian updates

Hypothesis Testing

Type-I error, type-II error, power, and decision thresholds through the classic overlapping-distributions diagram

Distance Correlation

Distance-based dependence tests, partial distance correlation, and cases Pearson r misses

Paper notes →

H

Information Theory #

Surprise, divergence, and the geometry of distributions.

Self-study notes; not externally reviewed. Some derivations and implementations still need checking.

4 items

Entropy & Mutual Information

Average surprise, conditional entropy, shared information, binary distributions, and noisy channels

Directed distribution mismatch, support errors, and the difference between forward and reverse KL

Optimal Transport

Wasserstein distance and the transport plan: Sinkhorn iteration, and why OT gives gradients where KL divergence does not

Information Geometry

Probability distributions as a manifold: the probability simplex, Fisher-Rao geodesics, dual flatness, and e- vs m-projections

∿

Random Processes #

Stochastic processes in time, frequency, and function space.

Topics guided by (but not strictly following) Prof. Ercan Kuruoğlu's Fall 2025 Random Processes course at Tsinghua SIGS.

Selective study notes; not externally reviewed. Some derivations and implementations still need checking.

5 items

Poisson Processes

Rare events in time: exponential waits, Poisson counts, equivalent definitions, thinning, splitting, and process diagnostics

Discrete-time and continuous-time chains, stationary distributions, mixing, recurrence, periodicity, and CTMC rates

LTI Systems on Random Inputs

Convolution, correlation propagation, spectra, AR(1) as a leaky integrator, and stationarity through linear filters

Power Spectral Density

Autocorrelation, spectra, LTI shaping, and periodograms for stationary random processes

Gaussian Processes for Regression

Priors over functions, kernels, posterior conditioning, marginal likelihood, 2-D regression, and acquisition

ϕ

Bayesian Inference #

Approximating intractable posteriors by sampling and by optimization.

Material draws on Prof. Ercan Kuruoğlu's Spring 2026 Bayesian Inference and Monte Carlo Simulation course at Tsinghua SIGS.

Selective study notes; not externally reviewed. Some use measure-theoretic notation alongside density notation. Some derivations and implementations still need checking.

12 items

Choosing a Prior

Principles of prior selection: use real prior information when you have it; otherwise group invariance, max entropy, or Jeffreys — and how the three routes disagree near boundaries

Conjugate Priors & the Exponential Family

Why some prior–likelihood pairs update in closed form, hyperparameters as pseudo-counts, worked Beta/Normal/Gamma examples, and a table of standard pairs

Posterior Summaries & Bayes Risk

Squared, absolute, and zero-one loss pick out the posterior mean, median, and mode — three views of the same posterior, only one of which ignores everything but the peak

Hierarchical Bayes

Two-level Normal–Normal model, the posterior formula for borrowing strength across groups, empirical-Bayes fitting of the between-group variance, and the connection to ridge regression

Bayesian Regression: Penalties as Priors

OLS, ridge, LASSO, and best-subset selection as MAP under four noise/prior pairs — and why the shape of the prior near zero determines whether the estimator shrinks, selects, or both

Bayesian Graphical Models

DAG factorization, d-separation, explaining away, Dirichlet-multinomial CPT learning, and structure scoring

Hidden Markov Models

HMM sampling, forward-backward filtering and smoothing, log-domain messages, and Viterbi versus marginal MAP paths

Monte Carlo & MCMC

Rejection, importance sampling, Metropolis-Hastings, Gibbs, RJMCMC, simulated annealing, and when to use each method on a static target

Kalman & Particle Filters

Sequential inference of a hidden state from noisy observations: Kalman filter for linear-Gaussian models, EKF/UKF for local linearization, particle filter for fully nonlinear non-Gaussian SSMs

Free Energy & Variational Inference

The free-energy/ELBO identity and how it turns posterior approximation into optimization

Variational Bayes for Gaussian Mixtures

CAVI for a 2-D Gaussian mixture with Normal–Wishart and Dirichlet priors, showing component ellipses, automatic pruning of unused components, and the ELBO trace

Bayesian Neural Networks

Weight posteriors, predictive function ensembles, Laplace approximation, evidence, Occam's hill, and prior mismatch