Bayesian Regression

Every loss is a likelihood. Every regularizer is a prior. Least squares, ridge, LASSO, and best-subset selection are MAP estimates under four different probabilistic models.

1. The dictionary

The MAP estimate maximises the posterior, or equivalently minimises its negative log:

$$ \hat\beta_{\text{MAP}} \;=\; \arg\max_\beta\;\log p(y\mid X,\beta) + \log p(\beta) \;=\; \arg\min_\beta\;\bigl[\,-\log p(y\mid X,\beta)\,\bigr] + \bigl[\,-\log p(\beta)\,\bigr]. $$

The first bracket is a data-fit loss; the second is a regularizer. Read the equation in reverse and every penalised-likelihood method you already know turns into a Bayesian model:

Frequentist piece	Bayesian piece
loss $\;\ell(y, X\beta)$	$-\log p(y\mid X,\beta)$ (likelihood)
regularizer $\;\lambda\,R(\beta)$	$-\log p(\beta)$ (prior)
regularization strength $\lambda$	ratio of noise variance to prior variance
$\arg\min$ (loss + penalty)	MAP estimate

That is the entire trick. The four sections below just instantiate it for specific choices of likelihood and prior.

2. Worked case: ridge is Gaussian MAP

Use Gaussian observation noise and an independent Gaussian prior on each coefficient:

$$ \begin{aligned} y_i &= x_i^\top\beta + \varepsilon_i,\qquad \varepsilon_i \sim \mathcal N(0,\sigma^2),\\ \beta_j \;\sim\; \mathcal N(0, \tau^2) \end{aligned} $$

The Gaussian log likelihood gives squared error. The Gaussian log prior gives an $\ell_2$ penalty. Dropping constants and multiplying by $2\sigma^2$ leaves the ridge objective:

$$ \hat\beta_{\text{ridge}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\sum_j \beta_j^2, \qquad \lambda \;=\; \frac{\sigma^2}{\tau^2}. $$

The regularization strength is the ratio of noise variance to prior variance. Tight prior (small $\tau^2$) means strong shrinkage; vague prior (large $\tau^2$) means $\lambda\to 0$ and ridge collapses back to OLS. The other three rows in the dictionary below follow by changing either the likelihood (the loss) or the prior shape near zero (the penalty).

Why ridge shrinks but doesn't select. The Gaussian log-density $-\beta_j^2/2\tau^2$ is differentiable and flat at zero: its slope there is $0$. A small coefficient incurs vanishingly small penalty, so the optimum sits at the place where the likelihood gradient balances the prior gradient. That balance point is generically nonzero. The estimator pulls every coefficient toward zero but stops short of the axis.

3. Three estimators side by side

Slide $\beta_{\text{OLS}}$ and watch the three estimators respond. The top panel plots the negative log posterior (the thing being minimised), with a small inset showing the prior penalty shape; the bottom panel plots the resulting $\hat\beta$ as a function of $\beta_{\text{OLS}}$. Ridge is a straight line through the origin with slope $1/(1+\lambda)$, proportional shrinkage. LASSO is a soft threshold: a flat dead zone of width $\lambda$ around the origin, then a line of slope $1$ shifted toward zero by $\lambda/2$. L0 is a hard threshold: identity outside $\pm\sqrt\lambda$, exact zero inside.

Figure 1 · Shrink, soft-threshold, and hard-threshold side by side

ridge (L2 / Gaussian prior) LASSO (L1 / Laplace prior) L0 (spike-and-slab) $\beta_{\text{OLS}}$ reference

$\beta_{\text{OLS}}$: 1.60

$\lambda$: 0.80

Things to notice as you drag:

For small $\beta_{\text{OLS}}$ (close to zero), LASSO and L0 both snap to exactly zero, while ridge stays close to but never reaches the axis. That's the difference between shrinkage and selection.
For large $\beta_{\text{OLS}}$, L0 returns $\beta_{\text{OLS}}$ unmodified: once a variable is "in", L0 doesn't bias it. Ridge and LASSO still shrink it, ridge by a multiplicative factor, LASSO by a constant offset.
Increase $\lambda$. Ridge's slope flattens uniformly. LASSO's dead zone widens. L0's gap jumps outward.
On the top panel, ridge's minimum sits in a smooth valley; LASSO's minimum sits in a V-shaped valley; L0's "minimum" can sit at the discrete notch at $\beta = 0$, visible as a separate dot dropping below the parabola whenever the L0 estimate has selected out.

4. The four-row dictionary

Estimator	Loss + penalty	Noise model	Prior on $\beta_j$	Behaviour at $\beta_j = 0$
OLS	$\sum(y_i - x_i^\top\beta)^2$	Gaussian noise	flat (improper)	no shrinkage
ridge	$\;\;+\;\lambda\sum\beta_j^2$	Gaussian noise	$\mathcal N(0,\tau^2)$	smooth shrinkage, never exactly $0$
LASSO	$\;\;+\;\lambda\sum\|\beta_j\|$	Gaussian noise	$\text{Laplace}(0,b)$	kink at $0$ → exact zeros + shrinkage
best subset	$\;\;+\;\lambda\\|\beta\\|_0$	Gaussian noise	spike-and-slab	discrete in/out, no shrinkage when in

5. LASSO lands on axes because diamond constraints have corners

The 1-D picture above shows soft- vs hard-thresholding cleanly, but it doesn't yet show which coefficient LASSO zeros out when there's more than one to choose from. For that, the canonical picture is the 2-D constraint-region view. Write each estimator as a constrained problem:

$$ \min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \quad\text{subject to}\quad \|\beta\|_q \le s, \qquad q \in \{0, 1, 2\}. $$

The OLS objective's level sets are ellipses centred at $\beta_{\text{OLS}}$. The MAP estimate is the point of the smallest such ellipse that still touches the feasible region. Three regions, three touch geometries:

Figure 2 · OLS ellipses meet L0 (axes), L1 (diamond), L2 (disk)

L2 disk (ridge) L1 diamond (LASSO) L0 axes (best subset) $\beta_{\text{OLS}}$ (drag me)

constraint radius $r$: 1.00

OLS correlation: 0.00

Drag the OLS point around. Things to notice:

The L2 disk is smooth, so the touch point is generically not on either axis. Ridge shrinks but rarely zeros a coefficient.
The L1 diamond has corners on the axes. For a broad range of $\beta_{\text{OLS}}$ directions, the smallest touching ellipse meets the diamond at one of those corners. At a corner, one coordinate is exactly $0$. That is the geometric reason LASSO produces sparse solutions: most directions in which OLS can sit make the closest feasible point a corner.
L0 confines the MAP to the axes themselves (one nonzero coordinate at most, for small $r$). The closest point on an axis projects $\beta_{\text{OLS}}$ onto whichever axis lies nearer.
Crank the OLS correlation. The ellipses tilt at 45°. The L1 diamond's corners still attract the touch point. LASSO's selection is robust to OLS anisotropy.
Shrink $r$ (tighten the constraint, equivalent to increasing $\lambda$). All three MAPs move toward the origin; the L1 MAP picks up selection earlier than the L2 MAP.

6. The shape-near-zero intuition

The whole story is in the picture of the log-prior near the origin. Three curves, three behaviours:

Quadratic bowl (Gaussian): smooth, slope $0$ at the origin. The prior never insists on the axis, so the MAP sits just inside it. Shrinkage only.
V-shape (Laplace): kinked, slope $\pm 1/b$ at the origin. There's a finite range of data gradients for which the prior wins, so the MAP lands exactly on the axis. Shrinkage + selection.
Step (spike-and-slab): a discrete jump at the origin. In or out, no in-between. Selection only, no shrinkage of included coefficients.

Once you see the four estimators as MAP under four priors, several familiar facts stop looking arbitrary:

Cross-validating $\lambda$ is empirical Bayes for the prior scale. You're asking the data to estimate $\sigma^2/\tau^2$.
The Bayes ridge posterior is Gaussian, conjugate to the Gaussian likelihood, so ridge gives you not just a point estimate but full uncertainty. (See the conjugate-priors page.)
The Bayes LASSO posterior is not Gaussian and not conjugate. The MAP is still tractable, but the full posterior needs MCMC or VI (see sampling, variational inference).
Elastic net ($\lambda_1\|\beta\|_1 + \lambda_2\|\beta\|_2^2$) is MAP under a product of Laplace and Gaussian priors, combining the selection of one with the grouping behaviour of the other.

7. MAP discards posterior spread

MAP collapses the posterior to its mode, which means it ignores everything else the prior was telling you. Two specific costs are worth flagging:

No uncertainty quantification. The MAP point estimate doesn't tell you which coefficients the data really pinned down vs. which the prior largely chose. The full posterior does. See posterior summaries for why mode, mean, and median disagree.
Reparametrization sensitivity. The mode moves under a smooth change of variable; the mean and median do not. So the LASSO "selection" of $\beta_j$ depends on the chosen parametrization.

What next

Hierarchy

Hierarchical Bayes

Ridge as a hierarchical model: putting a hyperprior on $\tau^2$ recovers cross-validated $\lambda$ as empirical Bayes.

Decision

Posterior Summaries & Bayes Risk

MAP is the Bayes estimator under zero-one loss. Squared and absolute losses pick the posterior mean and median instead.

Closed form

Conjugate Priors & the Exponential Family

The Gaussian–Gaussian pair behind ridge regression is conjugate; the Laplace–Gaussian pair behind LASSO is not.

Likelihood

Fisher Information

The likelihood-geometry view of the OLS objective and why the Gaussian noise model gives quadratic loss.