Bayesian Regression

Every loss is a likelihood. Every regularizer is a prior. Least squares, ridge, LASSO, and best-subset selection are MAP estimates under four different probabilistic models.

1. The dictionary

The MAP estimate maximises the posterior, or equivalently minimises its negative log:

$$ \hat\beta_{\text{MAP}} \;=\; \arg\max_\beta\;\log p(y\mid X,\beta) + \log p(\beta) \;=\; \arg\min_\beta\;\bigl[\,-\log p(y\mid X,\beta)\,\bigr] + \bigl[\,-\log p(\beta)\,\bigr]. $$

The first bracket is a data-fit loss; the second is a regularizer. Read the equation in reverse and every penalised-likelihood method you already know turns into a Bayesian model:

Frequentist pieceBayesian piece
loss $\;\ell(y, X\beta)$$-\log p(y\mid X,\beta)$  (likelihood)
regularizer $\;\lambda\,R(\beta)$$-\log p(\beta)$  (prior)
regularization strength $\lambda$ratio of noise variance to prior variance
$\arg\min$ (loss + penalty)MAP estimate

That is the entire trick. The four sections below just instantiate it for specific choices of likelihood and prior.

2. Least squares ⇔ MLE with Gaussian noise

The model:

$$ y_i \;=\; x_i^\top\beta + \varepsilon_i,\qquad \varepsilon_i \sim \mathcal N(0,\sigma^2). $$

The likelihood factorises across observations:

$$ p(y\mid X,\beta) \;\propto\; \exp\!\left(-\frac{1}{2\sigma^2}\sum_i (y_i - x_i^\top\beta)^2\right). $$

Taking the negative log and dropping additive constants leaves $\tfrac{1}{2\sigma^2}\sum_i (y_i - x_i^\top\beta)^2$. Maximum likelihood is therefore ordinary least squares:

$$ \hat\beta_{\text{OLS}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2. $$

Why squared error? Because the Gaussian log-density is a quadratic. A residual twice as large is four times as unlikely, so the estimator pushes hardest against the biggest residuals. Switch the noise distribution to Laplace ($\varepsilon_i \sim \text{Laplace}(0,b)$) and the same derivation gives least absolute deviations, the median regression of robust statistics. Switch to Student-$t$ and you get a robust regression that down-weights outliers.

3. L2 regularizer ⇔ Gaussian prior (ridge)

Place an independent Gaussian prior on each coefficient:

$$ \beta_j \;\sim\; \mathcal N(0, \tau^2) \qquad\Longrightarrow\qquad p(\beta) \;\propto\; \exp\!\left(-\frac{1}{2\tau^2}\sum_j \beta_j^2\right). $$

The negative log posterior, again up to constants, is the ridge objective:

$$ \hat\beta_{\text{ridge}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\sum_j \beta_j^2, \qquad \lambda \;=\; \frac{\sigma^2}{\tau^2}. $$

The regularization strength is the ratio of noise variance to prior variance. Tight prior (small $\tau^2$) means strong shrinkage; vague prior (large $\tau^2$) means $\lambda\to 0$ and ridge collapses back to OLS.

Why ridge shrinks but doesn't select. The Gaussian log-density $-\beta_j^2/2\tau^2$ is differentiable and flat at zero: its slope there is $0$. A small coefficient incurs vanishingly small penalty, so the optimum sits at the place where the likelihood gradient balances the prior gradient. That balance point is generically nonzero. The estimator pulls every coefficient toward zero but stops short of the axis.

4. L1 regularizer ⇔ Laplace prior (LASSO)

Swap the Gaussian prior for a Laplace prior: same symmetric shape, but with $|\beta_j|$ in the exponent rather than $\beta_j^2$:

$$ \beta_j \;\sim\; \text{Laplace}(0, b), \qquad p(\beta_j) \;\propto\; \exp\!\left(-\frac{|\beta_j|}{b}\right). $$

The MAP objective becomes the LASSO:

$$ \hat\beta_{\text{LASSO}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\sum_j |\beta_j|, \qquad \lambda \;=\; \frac{\sigma^2}{b}. $$

Why LASSO selects. The Laplace log-density $-|\beta_j|/b$ is not differentiable at zero: its left and right slopes are $+1/b$ and $-1/b$. That kink means there is a finite range of likelihood gradients for which the prior pull at $\beta_j = 0$ can dominate from either side, and the optimum sits exactly at zero. The Laplace prior also has heavier tails than a Gaussian of comparable scale, so once a coefficient escapes the kink, the prior tolerates large values better than ridge would. The net effect is shrinkage plus selection: some coefficients are pushed to exactly $0$, the rest are only mildly biased.

5. L0 regularizer ⇔ sparsity prior (best subset)

The $\ell_0$ "norm" counts nonzero coefficients:

$$ \|\beta\|_0 \;=\; \#\{\,j : \beta_j \neq 0\,\}. $$

A penalty $\lambda \|\beta\|_0$ corresponds to a spike-and-slab prior:

$$ z_j \sim \text{Bernoulli}(\pi), \qquad \beta_j \mid z_j = 0 \;=\; 0, \qquad \beta_j \mid z_j = 1 \;\sim\; q(\beta_j), $$

where $q$ is some diffuse "slab" distribution on the nonzero coefficients. Each $z_j = 0$ pays $\log\bigl((1-\pi)/\pi\bigr)$ relative to $z_j = 1$ in log-prior terms; the magnitude of an included coefficient is essentially free under the slab.

$$ \hat\beta_{\ell_0} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\,\|\beta\|_0. $$

Why L0 is literal model selection, and why we usually don't solve it. The L0 penalty charges a flat fee per included variable and is indifferent to how large the included coefficients get. That is the Bayesian statement of "choose which variables matter, not how big they are." The cost is computational: minimising $\ell_0$ requires searching over $2^p$ subsets, which is NP-hard in general. LASSO is the convex relaxation: the tightest convex penalty that still encourages sparsity, and the reason L1 became the default in practice.

6. Three estimators side by side

Slide $\beta_{\text{OLS}}$ and watch the three estimators respond. The top panel plots the negative log posterior (the thing being minimised); the bottom panel plots the resulting $\hat\beta$ as a function of $\beta_{\text{OLS}}$. Ridge is a straight line through the origin with slope $1/(1+\lambda)$, proportional shrinkage. LASSO is a soft threshold: a flat dead zone of width $\lambda$ around the origin, then a line of slope $1$ shifted toward zero by $\lambda/2$. L0 is a hard threshold: identity outside $\pm\sqrt\lambda$, exact zero inside.

Figure 1 · Shrink, soft-threshold, and hard-threshold side by side
ridge (L2 / Gaussian prior) LASSO (L1 / Laplace prior) L0 (spike-and-slab) $\beta_{\text{OLS}}$ reference
$\beta_{\text{OLS}}$: 1.60
$\lambda$: 0.80

Things to notice as you drag:

7. The four-row dictionary

Estimator Loss + penalty Noise model Prior on $\beta_j$ Behaviour at $\beta_j = 0$
OLS $\sum(y_i - x_i^\top\beta)^2$ Gaussian noise flat (improper) no shrinkage
ridge $\;\;+\;\lambda\sum\beta_j^2$ Gaussian noise $\mathcal N(0,\tau^2)$ smooth shrinkage, never exactly $0$
LASSO $\;\;+\;\lambda\sum|\beta_j|$ Gaussian noise $\text{Laplace}(0,b)$ kink at $0$ → exact zeros + shrinkage
best subset $\;\;+\;\lambda\|\beta\|_0$ Gaussian noise spike-and-slab discrete in/out, no shrinkage when in

8. LASSO lands on axes because diamond constraints have corners

The 1-D picture above shows soft- vs hard-thresholding cleanly, but it doesn't yet show which coefficient LASSO zeros out when there's more than one to choose from. For that, the canonical picture is the 2-D constraint-region view. Write each estimator as a constrained problem:

$$ \min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \quad\text{subject to}\quad \|\beta\|_q \le s, \qquad q \in \{0, 1, 2\}. $$

The OLS objective's level sets are ellipses centred at $\beta_{\text{OLS}}$. The MAP estimate is the point of the smallest such ellipse that still touches the feasible region. Three regions, three touch geometries:

Figure 2 · OLS ellipses meet L0 (axes), L1 (diamond), L2 (disk)
L2 disk (ridge) L1 diamond (LASSO) L0 axes (best subset) $\beta_{\text{OLS}}$ (drag me)
constraint radius $r$: 1.00
OLS correlation: 0.00

Drag the OLS point around. Things to notice:

9. The shape-near-zero intuition

The whole story is in the picture of the log-prior near the origin. Three curves, three behaviours:

Once you see the four estimators as MAP under four priors, several familiar facts stop looking arbitrary:

10. MAP discards posterior spread

MAP collapses the posterior to its mode, which means it ignores everything else the prior was telling you. Two specific costs are worth flagging:

What next