Bayesian Regression
1. The dictionary
The MAP estimate maximises the posterior, or equivalently minimises its negative log:
$$ \hat\beta_{\text{MAP}} \;=\; \arg\max_\beta\;\log p(y\mid X,\beta) + \log p(\beta) \;=\; \arg\min_\beta\;\bigl[\,-\log p(y\mid X,\beta)\,\bigr] + \bigl[\,-\log p(\beta)\,\bigr]. $$The first bracket is a data-fit loss; the second is a regularizer. Read the equation in reverse and every penalised-likelihood method you already know turns into a Bayesian model:
| Frequentist piece | Bayesian piece |
|---|---|
| loss $\;\ell(y, X\beta)$ | $-\log p(y\mid X,\beta)$ (likelihood) |
| regularizer $\;\lambda\,R(\beta)$ | $-\log p(\beta)$ (prior) |
| regularization strength $\lambda$ | ratio of noise variance to prior variance |
| $\arg\min$ (loss + penalty) | MAP estimate |
That is the entire trick. The four sections below just instantiate it for specific choices of likelihood and prior.
2. Least squares ⇔ MLE with Gaussian noise
The model:
$$ y_i \;=\; x_i^\top\beta + \varepsilon_i,\qquad \varepsilon_i \sim \mathcal N(0,\sigma^2). $$The likelihood factorises across observations:
$$ p(y\mid X,\beta) \;\propto\; \exp\!\left(-\frac{1}{2\sigma^2}\sum_i (y_i - x_i^\top\beta)^2\right). $$Taking the negative log and dropping additive constants leaves $\tfrac{1}{2\sigma^2}\sum_i (y_i - x_i^\top\beta)^2$. Maximum likelihood is therefore ordinary least squares:
$$ \hat\beta_{\text{OLS}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2. $$Why squared error? Because the Gaussian log-density is a quadratic. A residual twice as large is four times as unlikely, so the estimator pushes hardest against the biggest residuals. Switch the noise distribution to Laplace ($\varepsilon_i \sim \text{Laplace}(0,b)$) and the same derivation gives least absolute deviations, the median regression of robust statistics. Switch to Student-$t$ and you get a robust regression that down-weights outliers.
3. L2 regularizer ⇔ Gaussian prior (ridge)
Place an independent Gaussian prior on each coefficient:
$$ \beta_j \;\sim\; \mathcal N(0, \tau^2) \qquad\Longrightarrow\qquad p(\beta) \;\propto\; \exp\!\left(-\frac{1}{2\tau^2}\sum_j \beta_j^2\right). $$The negative log posterior, again up to constants, is the ridge objective:
$$ \hat\beta_{\text{ridge}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\sum_j \beta_j^2, \qquad \lambda \;=\; \frac{\sigma^2}{\tau^2}. $$The regularization strength is the ratio of noise variance to prior variance. Tight prior (small $\tau^2$) means strong shrinkage; vague prior (large $\tau^2$) means $\lambda\to 0$ and ridge collapses back to OLS.
Why ridge shrinks but doesn't select. The Gaussian log-density $-\beta_j^2/2\tau^2$ is differentiable and flat at zero: its slope there is $0$. A small coefficient incurs vanishingly small penalty, so the optimum sits at the place where the likelihood gradient balances the prior gradient. That balance point is generically nonzero. The estimator pulls every coefficient toward zero but stops short of the axis.
4. L1 regularizer ⇔ Laplace prior (LASSO)
Swap the Gaussian prior for a Laplace prior: same symmetric shape, but with $|\beta_j|$ in the exponent rather than $\beta_j^2$:
$$ \beta_j \;\sim\; \text{Laplace}(0, b), \qquad p(\beta_j) \;\propto\; \exp\!\left(-\frac{|\beta_j|}{b}\right). $$The MAP objective becomes the LASSO:
$$ \hat\beta_{\text{LASSO}} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\sum_j |\beta_j|, \qquad \lambda \;=\; \frac{\sigma^2}{b}. $$Why LASSO selects. The Laplace log-density $-|\beta_j|/b$ is not differentiable at zero: its left and right slopes are $+1/b$ and $-1/b$. That kink means there is a finite range of likelihood gradients for which the prior pull at $\beta_j = 0$ can dominate from either side, and the optimum sits exactly at zero. The Laplace prior also has heavier tails than a Gaussian of comparable scale, so once a coefficient escapes the kink, the prior tolerates large values better than ridge would. The net effect is shrinkage plus selection: some coefficients are pushed to exactly $0$, the rest are only mildly biased.
5. L0 regularizer ⇔ sparsity prior (best subset)
The $\ell_0$ "norm" counts nonzero coefficients:
$$ \|\beta\|_0 \;=\; \#\{\,j : \beta_j \neq 0\,\}. $$A penalty $\lambda \|\beta\|_0$ corresponds to a spike-and-slab prior:
$$ z_j \sim \text{Bernoulli}(\pi), \qquad \beta_j \mid z_j = 0 \;=\; 0, \qquad \beta_j \mid z_j = 1 \;\sim\; q(\beta_j), $$where $q$ is some diffuse "slab" distribution on the nonzero coefficients. Each $z_j = 0$ pays $\log\bigl((1-\pi)/\pi\bigr)$ relative to $z_j = 1$ in log-prior terms; the magnitude of an included coefficient is essentially free under the slab.
$$ \hat\beta_{\ell_0} \;=\; \arg\min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \;+\; \lambda\,\|\beta\|_0. $$Why L0 is literal model selection, and why we usually don't solve it. The L0 penalty charges a flat fee per included variable and is indifferent to how large the included coefficients get. That is the Bayesian statement of "choose which variables matter, not how big they are." The cost is computational: minimising $\ell_0$ requires searching over $2^p$ subsets, which is NP-hard in general. LASSO is the convex relaxation: the tightest convex penalty that still encourages sparsity, and the reason L1 became the default in practice.
6. Three estimators side by side
Slide $\beta_{\text{OLS}}$ and watch the three estimators respond. The top panel plots the negative log posterior (the thing being minimised); the bottom panel plots the resulting $\hat\beta$ as a function of $\beta_{\text{OLS}}$. Ridge is a straight line through the origin with slope $1/(1+\lambda)$, proportional shrinkage. LASSO is a soft threshold: a flat dead zone of width $\lambda$ around the origin, then a line of slope $1$ shifted toward zero by $\lambda/2$. L0 is a hard threshold: identity outside $\pm\sqrt\lambda$, exact zero inside.
Things to notice as you drag:
- For small $\beta_{\text{OLS}}$ (close to zero), LASSO and L0 both snap to exactly zero, while ridge stays close to but never reaches the axis. That's the difference between shrinkage and selection.
- For large $\beta_{\text{OLS}}$, L0 returns $\beta_{\text{OLS}}$ unmodified: once a variable is "in", L0 doesn't bias it. Ridge and LASSO still shrink it, ridge by a multiplicative factor, LASSO by a constant offset.
- Increase $\lambda$. Ridge's slope flattens uniformly. LASSO's dead zone widens. L0's gap jumps outward.
- On the top panel, ridge's minimum sits in a smooth valley; LASSO's minimum sits in a V-shaped valley; L0's "minimum" can sit at the discrete notch at $\beta = 0$, visible as a separate dot dropping below the parabola whenever the L0 estimate has selected out.
7. The four-row dictionary
| Estimator | Loss + penalty | Noise model | Prior on $\beta_j$ | Behaviour at $\beta_j = 0$ |
|---|---|---|---|---|
| OLS | $\sum(y_i - x_i^\top\beta)^2$ | Gaussian noise | flat (improper) | no shrinkage |
| ridge | $\;\;+\;\lambda\sum\beta_j^2$ | Gaussian noise | $\mathcal N(0,\tau^2)$ | smooth shrinkage, never exactly $0$ |
| LASSO | $\;\;+\;\lambda\sum|\beta_j|$ | Gaussian noise | $\text{Laplace}(0,b)$ | kink at $0$ → exact zeros + shrinkage |
| best subset | $\;\;+\;\lambda\|\beta\|_0$ | Gaussian noise | spike-and-slab | discrete in/out, no shrinkage when in |
8. LASSO lands on axes because diamond constraints have corners
The 1-D picture above shows soft- vs hard-thresholding cleanly, but it doesn't yet show which coefficient LASSO zeros out when there's more than one to choose from. For that, the canonical picture is the 2-D constraint-region view. Write each estimator as a constrained problem:
$$ \min_\beta\;\sum_i (y_i - x_i^\top\beta)^2 \quad\text{subject to}\quad \|\beta\|_q \le s, \qquad q \in \{0, 1, 2\}. $$The OLS objective's level sets are ellipses centred at $\beta_{\text{OLS}}$. The MAP estimate is the point of the smallest such ellipse that still touches the feasible region. Three regions, three touch geometries:
Drag the OLS point around. Things to notice:
- The L2 disk is smooth, so the touch point is generically not on either axis. Ridge shrinks but rarely zeros a coefficient.
- The L1 diamond has corners on the axes. For a broad range of $\beta_{\text{OLS}}$ directions, the smallest touching ellipse meets the diamond at one of those corners. At a corner, one coordinate is exactly $0$. That is the geometric reason LASSO produces sparse solutions: most directions in which OLS can sit make the closest feasible point a corner.
- L0 confines the MAP to the axes themselves (one nonzero coordinate at most, for small $r$). The closest point on an axis projects $\beta_{\text{OLS}}$ onto whichever axis lies nearer.
- Crank the OLS correlation. The ellipses tilt at 45°. The L1 diamond's corners still attract the touch point. LASSO's selection is robust to OLS anisotropy.
- Shrink $r$ (tighten the constraint, equivalent to increasing $\lambda$). All three MAPs move toward the origin; the L1 MAP picks up selection earlier than the L2 MAP.
9. The shape-near-zero intuition
The whole story is in the picture of the log-prior near the origin. Three curves, three behaviours:
- Quadratic bowl (Gaussian): smooth, slope $0$ at the origin. The prior never insists on the axis, so the MAP sits just inside it. Shrinkage only.
- V-shape (Laplace): kinked, slope $\pm 1/b$ at the origin. There's a finite range of data gradients for which the prior wins, so the MAP lands exactly on the axis. Shrinkage + selection.
- Step (spike-and-slab): a discrete jump at the origin. In or out, no in-between. Selection only, no shrinkage of included coefficients.
Once you see the four estimators as MAP under four priors, several familiar facts stop looking arbitrary:
- Cross-validating $\lambda$ is empirical Bayes for the prior scale. You're asking the data to estimate $\sigma^2/\tau^2$.
- The Bayes ridge posterior is Gaussian, conjugate to the Gaussian likelihood, so ridge gives you not just a point estimate but full uncertainty. (See the conjugate-priors page.)
- The Bayes LASSO posterior is not Gaussian and not conjugate. The MAP is still tractable, but the full posterior needs MCMC or VI (see sampling, variational inference).
- Elastic net ($\lambda_1\|\beta\|_1 + \lambda_2\|\beta\|_2^2$) is MAP under a product of Laplace and Gaussian priors, combining the selection of one with the grouping behaviour of the other.
10. MAP discards posterior spread
MAP collapses the posterior to its mode, which means it ignores everything else the prior was telling you. Two specific costs are worth flagging:
- No uncertainty quantification. The MAP point estimate doesn't tell you which coefficients the data really pinned down vs. which the prior largely chose. The full posterior does. See posterior summaries for why mode, mean, and median disagree.
- Reparametrization sensitivity. The mode moves under a smooth change of variable; the mean and median do not. So the LASSO "selection" of $\beta_j$ depends on the chosen parametrization.