User’s Guide : Basic Single Equation Analysis : Regression Variable Selection : Example
Stepwise Example
Swapwise Example
Auto-search/GETS Example
Lasso Example
As an example of using variable selection in EViews, we will estimate a model of hourly electricity spot prices. We will follow, in spirit, the analysis done in Uniejewski, Nowotarski and Weron (2016), estimating a model with a large number of candidate variables to try to perform a one day-ahead forecast of the spot price.
We have a workfile containing hourly data between 2015 and 2018 of the electricity spot price for a region in the United States, along with hourly electricity load data (how much electricity is used) and day-ahead load forecasts for the same region.
As a dependent variable we use the series TPRICE containing the deviation of the log of the spot price from the mean of the log of spot prices for the same hour.
We have a total of 107 search regressors:
3 days (72 hours) of lags of TPRICE (72)
a single one week (168 hour) lag of TPRICE (1)
the minimum TPRICE of the previous day, two days before and three days before (3)
the maximum TPRICE of the previous day, two days before and three days before (3)
the mean TPRICE of the previous day, two days before and three days before (3)
log of electricity load for current hour, same hour previous day and same hour previous week (3)
log of forecasted load for same hour next day (1)
day of week dummy variables (set to zero for weekday holidays) (7)
Interaction between dummy variables and log of load (y)
Interaction between dummy variables and TPRICE (7)
These search regressions are stored in the group SEARCHREGS.
The model also has an always-included constant.
For this example we will model TPRICE at noon each day between January 1, 2015 and December 30, 2018, and will use those estimates to forecast TPRICE for December 31, 2018 —a one observation forecast.
Stepwise Example
To begin, we’ll use a stepwise regression to select the most appropriate regressors. We click on Quick/Estimate Equation… to bring up the Equation Estimation dialog, and then change the Method dropdown to VARSEL-Variable selection and Stepwise Least Squares.
We enter “TPRICE C” in the first part of the Equation Specification, and then “SEARCHREGS” in the Search Regressors box.
We’ll leave the Selection method as Stepwise, and keep the options on the Options tab at their default settings.
Finally we set the sample to be the estimation dates we want, ensuring that only the noon observations are used.
Clicking OK performs the stepwise variable selection, and estimates the final model. The top portion of the output describes the selection process:
The final line issues a warning that the final estimation results shown use a different sample than that used during selection. This is because some of the search regressors contain NAs (due to lags in our case), and thus those observations are dropped from the selection process. However some of those variables were not selected, and so those observations can be re-included in the final estimation.
The middle part of the output (partially shown here) displays the selected variables, their coefficients, standard errors, t-statistics and probability values.
In this case a total of 59 of the 107 search regressors were selected.
The bottom of the output (partially shown) describes the selection process the stepwise regression undertook.
Swapwise Example
For a second estimation of the model, we will use the swapwise algorithm. We again click on Quick\Estimation Equation to bring up the Equation Estimation dialog, and fill in the variables and sample as before. We change the Selection method combo to Swapwise.
Switching to the Options tab, we switch to Min R-squared increment and instruct EViews to find the best 60 regressors (which seems a reasonable number given the stepwise procedure’s selection of 59).
The output from this equation is similar to that of the first—the top portion displays a summary of the estimation, the middle section provides the estimation results, and the bottom portion displays the selection process.
Auto-search/GETS Example
Our third estimation will use the Auto-search/GETS algorithm. On the Specification page, change the Selection method to Auto-Search/GETS, and then click on the Options tab to alter some of the default settings. We’ll change the criteria used to decide between the final candidate models to the Schwarz criterion using the Criterion dropdown. We also elect to only use the AR LM test and PET tests as diagnostic tests during the path search.
The output from this estimation is similar to the previous two; the upper portion describes the selection method, the middle displays the final model’s coefficient estimates and associates statistics, and the bottom provides some small detail on the selection process. In contrast to the stepwise procedure, Auto-search has only selected 27 regressors.
Lasso Example
In this example we will reproduce an example from Statistical Learning with Sparsity, by Trevor Hastie, Robert Tibshirani, and Martin Wainwright.
Section 2.2, Table 2.2 contains the results of a short analysis on a set of crime data. The dataset consists of six variables, with the dependent variable being the total overall reported crime rate per one million residents in fifty US cities. The independent variables are annual police funding in dollars per resident, percent of people 25 years and older with four years of high school, percent of 16- to 19-year olds not in high school and not high school graduates, percent of 18- to 24-year olds in college, and percent of people 25 years and older with at least four years of college.
For the first example we will replicate the leftmost side of Table 2.2. This is a simple OLS calculation and using LS in EViews we get an exact match on the coefficients with standard errors. We can see that the "FUNDING" variable has the biggest influence on the overall crime rate:
For the second part we will fit a lasso model to the same data. The comparison is with the middle set of coefficients in Table 2.2. In EViews we choose the ENET procedure with the dependent, a constant, and all five independent variables. We select the lasso penalty and leave the lambda field blank to automatically generate a lambda sequence and perform cross-validation. We also standardize the independent variables and use 10-fold cross-validation. These resulting coefficients are not an exact match because of the randomness associated with cross-validation, but relative magnitudes and signs are the same. They also match our intuition about lasso compared to OLS. The coefficients have shrunk toward zero, and in fact two of them are zero at the minimum value of lambda chosen by cross-validation. Again, the FUNDING variable has the greatest influence:
The last comparison is with the rightmost column in Table 2.2. These values are OLS applied to the variables selected (the nonzero coefficients) from the lasso fit. In EViews we choose the VARSEL procedure with the five independent variables as search variables. The ENET options are the same as those for the lasso analysis earlier. As expected, the two variables COLLEGE and COLLEGE4 that were zero in the lasso model are not included in the subsequent OLS regression. While the lasso shrank, or biased, the coefficients toward zero, the OLS fit expands, or de-biases, them away from zero. This results in a decrease in the variance of the final model, as you can see by comparing the errors in the variable selection model with the first OLS model:
We can examine some of the details behind the model selection process by going to View/ Model Selection Summary and choosing Criteria Graph and Criteria Table. Here we can see the models ranked from best to worst in order of the AIC. Note that the selection process is different from the rankings of the best models in ENET's cross-validation, which is based on error selection measures such as mean squared error.
In View/Model Selection Summary we have also included some of the views from ENET. The selected models here, highlighted in red in the Coefficient Matrix and Summary Path and with the dotted line in the Lambda Coefficient Graph, are based on AIC rather than MSE.