than the length of the dataset
(sometimes referred to as the “large
, small
problem”) or there is high correlation between the
regressors, least squares estimates are very sensitive to random errors, and suffer from over-fitting and numeric instability.![]() | (37.1) |
observations on the dependent variable and
regressors:
,
:![]() | (37.2) |
![]() | (37.3) |
, and where the
norm is the
‑weighted sum (
) of the absolute values of the coefficient values, ![]() | (37.4) |
norm is the
‑weighted sum of the squared values of the coefficient values,![]() | (37.5) |
for all
.
does not have a closed-form solution and requires iterative methods, but efficient algorithms are available for solving this problem (Hastie, Friedman, and Tibshirani 2010).
, (
), which controls the magnitude of the overall coefficient penalties. Clearly, larger values of
are associated with a stronger coefficient penalty, and smaller values with a weaker penalty. When
, the objective reduces to that of standard least squares.
varies from high to low.
.
values.
.
for which all of the estimated coefficients are 0. The resulting
will be the largest value of
to be considered.
to determine
.
, the smallest value of
to be considered, as a predetermined fractional value of
.
descending penalty values on a natural log scale from
to
so that
, where
.
and starting coefficient values of 0, and continuing sequentially from
to
or until triggering a stopping rule. The procedure employs what the authors term “warm starts” with coefficients obtained from estimating the
model used as starting values for estimating the
model.
is derived so that the coefficients values are all 0, an arbitrarily specified fractional value
has no general interpretation. Further, estimation at a given
for
can be numerically taxing, might not converge, or may offer negligible additional value over prior estimates. To account for these possibilities, we may specify rules to truncate the
-path of estimated models when encountering non-convergent or redundant results.
if the
exceeds some threshold, or the relative change in the SSR falls below some value. Similarly, if model parsimony is a priority, we might end estimation at a given
if the number of non-zero coefficients exceeds a specific value or a specified fraction of the sample size.
from the complete path. Cross-validation involves partitioning the data into training and test sets, estimating over the entire
path using the training set, and using the coefficients and the test set to compute evaluation statistics.
, the penalty parameter associated with the best evaluation statistic is selected as
.
, we average the statistics across the multiple sets, and the penalty parameter for the model with the best average value is selected as
. Further, we may compute the standard error of the mean of the cross-validation evaluation statistics, and determine the penalty values
and
corresponding to the most penalized models with average values within one and two standard errors of the optimum.![]() | (37.6) |
, the coefficient penalty terms
and
, individual penalty weights
, and the mixing parameter
.
term is quadratic with respect to the representative coefficient
. The derivatives of the
norm are of the form
, so that the absolute value of the j-th derivative increases and decreases with the weighted values of
. Crucially, the derivative approaches 0 as
approaches 0.
is linear in the representative coefficient
, with a constant derivative of
for all values of
.
mixing parameter controls the relative importance of the
and
penalties in the overall objective. Larger values of
assign more importance to
while smaller values assign more importance to
.
.
norm component tend to have more zero coefficients than models which feature the
penalty, as the derivatives of the
remain constant as the coefficient approaches zero, while the derivatives of the
norm for a given coefficient become negligible in the neighborhood of zero.
, and
are themselves well-known estimators.
yields a Lasso (Least-Absolute Shrinkage and Selection Operator) model that contains only the
-norm penalty:![]() | (37.7) |
produces a ridge regression specification with only the
-penalty:![]() | (37.8) |
![]() | (37.9) |
where
corresponds to the unpenalized intercept. Differentiating the objective yields:![]() | (37.10) |
![]() | (37.11) |
![]() | (37.12) |
and diagonal adjustment weights
, where
.
is a form of ridge regression, it is important to recognize that the parameterization of the diagonal adjustment differs between elastic net ridge and traditional ridge regression, with the former using
and the latter using
.
intercept coefficient.
for
to attenuate or accentuate the impact of the overall penalty
on individual coefficients.
for any
implies that
is not penalized. We may, for example, express the intercept excluding penalty in
Equation (37.1) using
and
terms that sum over all
coefficients (including the intercept) with
:![]() | (37.13) |
is a composite individual penalty weight associated with coefficient
.
where
is the number of non-zero
.
prior to OLS produces coefficients that equal the unscaled estimates divided by
and an SSR equal to the unscaled SSR divided by
.
prior to OLS produces coefficients equal to the unscaled estimates multiplied by
. The optimal SSR is unchanged.
and multiplying the scaled SSR by
.
.
and corresponding coefficients
and penalty
. We will see whether imposing the restricted parametrization,
alters the objective function.
is the penalty associated with the scaled data objective. Notice that this objective using scaled coefficients is not, in general, a simple scaled version of the objective in
Equation (37.1) since the
term is scaled by an additional
and the
is not.
, the relative importance of the residual and the penalty portions of the objective are unchanged from
Equation (37.1) if
. Thus, the estimates for Lasso model coefficients using the scaled dependent variable will be scaled versions of the unscaled data coefficients obtained using an appropriately scaled penalty.
, then the relative importance of the residual and the penalty portions of the objective is unchanged for
. Thus, for elastic net ridge regression models, the scaling of the dependent variable produces an equivalent model at the same penalty values.
, it is not possible to find a
where the simple scaling of coefficients does not alter the relative importance of the residual and penalty components, so estimation of the scaled dependent variable model results in coefficients that differ by more than scale.
and coefficients
,
, and corresponding penalty
, and impose the restriction that the new coefficients are individual scaled versions of the original coefficients,
. ![]() | (37.14) |
estimates will differ from the unweighted estimates by more than just individual variable scale.
) produces ridge regression coefficients that are simple
scaled versions of those obtained using unscaled data. For Lasso regression, a similar scaled coefficient result holds provided the penalty parameter is also scaled accordingly.
) prior to
elastic net estimation produces scaled and unscaled coefficient estimates that differ by more than scale.
) prior to estimation produces coefficient results that differ by more than the individual scales from those obtained using unscaled regressors.