Soci709 (formerly 209) Module 9n - MODEL BUILDING
& SPECIFICATION
0. REFERENCES
ALSM5e pp. 343-383; ALSM4e pp. 327-360.
STATA [R] sw
Miller, A. J. 2002. Subset Selection in
Regression. 2nd edition. London: Chapman and Hall.
1. THE MODEL BUILDING/SPECIFICATION PROBLEM
IN RESEARCH CONTEXT
The problem of choosing the independent variables
to include in the regression model depends on the design of the study.
One can distinguish 4 types of research designs.
1. Controlled Experiments
Example: an agricultural experiment to assess
the effect on yield of a variety of corn of three levels of fertilizer
and three levels of watering.
Example: Shephard's experiments in which the
dependent variable is the amount of time taken by subjects to determine
whether an object rotated by a certain angle is the same as a reference
object.
Example: A psychological experiment relating
task performance to the amount of anxiety experienced by subjects.
In a controlled experiment the treatments
are randomly allocated to the experimental units. No other variables
need to be collected about the units. The regression model consists
of the dependent variable (e.g., yield) regressed on the predetermined
independent variables, or factors, corresponding to different experimental
treatments (e.g., amounts of fertilizer and of watering).
2. Controlled Experiments with Supplemental
Variables
In some experiments, characteristics of the experimental
units are measured. For instance, for human subjects, their age,
sex, weight, educational attainment. As these characteristics (because
of randomization) are not supposed to be related to the values of the independent
variables (the factors), their role is limited to helping reduce the error
variance of the regression model. Including these variables in the
regression model is not expected to change the effect of the factor(s).
They are usually few so they can be all included in the model and later
discarded if they do not help in reducing error variance.
Example: In Shephard's experiments sex of
the subject might be added to the model, as there may be an association
between spatial visualization and sex; but including sex is not expected
to affect the finding that recognition time is function of the angle of
rotation of the object.
3. Confirmatory Observational Studies
These are studies that use observational data
to test hypotheses derived from "theory", which means previous studies
in the field and new ideas and interpretations by the researcher.
Thus, the variables to collect and to include in the model belong to 2
categories
-
explanatory variables that previous studies have
shown to affect the dependent variable
-
new explanatory variables that the researcher
believes also affect the dependent variable
Example: in the article by Scott South (1985)
on factors explaining the divorce rate in the US, the author included 2
measures of economic well-being (the unemployment rate and the rate of
growth of the GNP per capita) because he detected a consensus in the literature
on divorce that the divorce rate is lower during hard economic times because
individuals are more dependent on their spouses.
Thus, in confirmatory observational studies,
previous studies and the new theories the researcher wants to test guide
the choice of variables to include in the model, although there may be
some choices to be made among alternative indicators of the same theoretical
concept (e.g., unemployment versus GNP growth as an indicator of economic
conditions).
4. Exploratory Observational Studies
These are studies using observational data in
which a strong basis of previous knowledge about the phenomenon of interest
is lacking. Or they may be studies in which there is some knowledge
of the factors affecting the dependent variable but the goal is prediction
rather than an understanding of the phenomenon.
Example: a study of the rates of child abuse
among counties of North Carolina. There are many characteristics
of counties that can be obtained from government sources such as decennial
censuses, but only vague ideas about which socio-economic characteristics
of counties are expected to be associated with the rate of child abuse:
poverty?, female-headed household?, race composition?, ...
Example: a government agency collects data
on the sale prices of houses in a county and uses a regression model to
estimate the market value of homes in order to assess real estate taxes.
In developing the regression model the agency can choose among a number
of measured characteristics of the homes (heated area, size of land,
initial purchase price, age, number of bedrooms, etc...). The goal
is to find the subset of measured characteristics that best predicts the
market value of a home (as measured by sale price in recent transactions).
To reduce the number of independent variables
to be included in the model in exploratory observational studies, and to
some extent in confirmatory observational studies, computer-based approaches
have been developed that are based on 2 general strategies
-
all-possible regressions procedures identify "good"
subsets of the pool of potential independent variables among all possible
subsets of the variables, where "good" may be defined with respect to several
criteria
-
forward stepwise regression (and other automatic
search procedures) search for the "best" subset of independent variables
without comparing all possible regressions
2. ALL-POSSIBLE-REGRESSIONS PROCEDURES
The all-possible-regressions procedure examines
all the 2P-1 possible subsets (of 1, 2, ..., P-1 variables)
of the pool of P-1 potential X variables and identifies a few "good" subsets
according to one of the criteria below. These criteria can also be
used outside the all-possible-regressions context, to compare two or more
regression models for the same dependent variable. (There are 2P-1
subsets because each of the P-1 variables can be either included or excluded
from a model.)
1. Rp2 (or SSEp)
Criterion
With the Rp2 criterion (where
the p subscript refers to the number of variables in the model) subsets
of the potential X variables for which the ordinary R-square is large are
considered "good". Choosing the model with largest R2
is equivalent to choosing the model with smallest SSE (since R2
= 1 - SSE/SSTO and SSTO is constant across all models). The
Rp2 criterion is used to judge when to stop adding
more variables rather than finding the "best" model, since Rp2
can never decrease when p increases.
2. Ra2 (or MSEp)
Criterion
The Ra2 criterion
compares models ont he basis of the adjusted R-square, which adjusts for
the number of independent variables included. It can be shown that
Ra2 = 1 - MSE/(SSTO/(n-1)), so that maximizing
Ra2 is equivalent to minimizing MSE.
3. (Optional) Mallows' Cp Criterion
Mallows' Cp criterion is based
on the concept of the total mean squared error of the n fitted
values for each subset regression model. It can be shown that the
total mean squared error for all n fitted values ^yi
is
Si=1
to n [E({^yi} - mi)2
+ s2{^yi}]
= Si=1
to n(E{^yi} - mi)2
+ Si=1
to ns2{^yi}
where mi
denotes the true mean response when the values of the Xk are
those for the ith case. The total mean squared error is seen as composed
of a squared bias component (E{^yi} - mi)2
and a variance component s2{^yi}.
The criterion measure Gp
is the mean squared error divided by the true error variance
s2
Gp
= (1/s2)
[Si=1 to
n(E{^Yi} - mi)2
+ Si=1
to ns2{^Yi}]
Note that s2
is unknown. Assuming that the model that includes all P-1 potential
X variables is such that MSE(X1, ... , XP-1) is an
unbiased estimator of s2,
it can be shown that Gp
can be estimated as
Cp = SSEp/MSE(X1,
... , XP-1) - (n-2p)
where SSEp (with lowercase p) is the
SSE for the subset model with p-1 X variables and MSE(X1, ...
, XP-1) (with capital P) is the MSE for the model with all
P-1 X variables. It can be shown that when there is no bias in
the subset model with p-1 X variables then
E{Cp} ~= p (where ~= stands
for "is approximately equal to")
Thus when Cp values are plotted against
p, unbiased models will fall near the line Cp = p.
The strategy with the Cp criterion
is to identify models with
-
small Cp
-
a Cp value near p
4. AIC and SBC Criteria
AIC (Akaike Information Criterion) is defined
as
AICp = n ln(SSEp)
- n ln(n) + 2p
SBC (Schwatz's Bayesian Criterion) is defined
as
SBCp = n ln(SSEp)
- n ln(n) + [ln(n)]p
For both criteria smaller values are better.
Note that both criteria increase with SSE (poor model fit) and with
p
(number of independent variables). Thus both criteria penalize models
with many independent variables.
M: check BIC formula; is BIC same as SBC?
5. PRESSp Criterion
The PREdiction Sum of Squares criterion and the
SSEp criteria are analogous, with a difference, as seen in their
formulas
SSEp = S(yi
- ^yi)2
PRESSp = S(yi
- ^yi(i))2
where the sums are for i=1 to n. The difference
is that, in PRESSp, yi is compared to its predicted
value from a regression from which observation i was excluded.
(Optional note. ^yi(i) is
the "deleted predictor" by analogy with the "deleted residual" discussed
in connection with diagnostics for outliers and influential cases in Module
10. In fact, SSEp and PRESSp can also
be written
SSEp = Sei2
PRESSp = Sdi2
that is, PRESSp is the sum of the
squared external residuals (or deleted residuals) di .)
3. FORWARD STEPWISE REGRESSION
The forward stepwise regression procedure is illustrated
by two examples.
-
Exhibit: Stepwise
regression for depression score (with age, female, educatn, l10inc, cath,
jewi, none, married, drinks, goodhlth )
-
Exhibit: Stepwise
regression for 1920-1970 time series of US divorce rate (with unemp, flfprt,
marumf, trend, brthrt, brtw1544, milperk)
The major weakness of forward stepwise regression
(and other automatic search procedures), compared to the all-possible-regressions
methods, is that the end result is a single "best" model. This model
may not be as desirable as other models missed by the procedure.
4. MODEL VALIDATION
The problem of validating the model (choice of independent variables) arises
mostly for exploratory observational studies. Validation involves
checking the model against independent data. There are three approaches.
1. Comparison with theoretical expectations, other empirical evidence,
and simulation results
2. Collection of new data
Collection of new data to check the model is desirable but rarely feasible.
3. Splitting the data
Data are split randomly into two sets:
-
the model-building or training sample, used to develop the
model, and
-
the validation or prediction set, used to validate the model
The cross-validation procedure consists in estimating candidate models
developed from the model-building sample with the validation sample and
(1) comparing the values of the estimated regression coefficients, and
(2) calculating the MSRP (mean squared prediction error) as
MSRP = (Si=1 to n* (yi
- ^yi)2)/n*
where
yi is the response for ith case in validation sample
^yi is predicted value based on candidate model for ith
case in validation sample
n* is number of cases in validation sample.
The candidate model is validated to the extent that the values of MSRP
and MSE for the training sample regression are close. (It is not
entirely clear to me how to decide what "close" means; see ALSM5e p. 374.
Furthermore ALSM5e decide to drop a variable from the model because its
coefficient is negative, contrary to theoretical expectation; but this
coefficient is non-significant so its sign should not matter. It
is better to drop that variable on the ground that it is non significant.)
Last modified 27 Mar 2006