Module 15 - MODEL BUILDING & SPECIFICATION
1. THE MODEL BUILDING/SPECIFICATION PROBLEM
IN RESEARCH CONTEXT
The problem of choosing the variables to include
in the regression model is not as open-ended as it seems. Depending
on the design of the research, the variables to include may be predetermined,
or there may be strong guidelines for choosing variables. One can
distinguish 4 types of research designs.
1. Controlled Experiments
EX: an agricultural experiment to assess the effect
on yield of a variety of corn of three levels of fertilizer and three levels
of watering.
In a controlled experiment the "treatments"
are randomly allocated to the experimental units. No other variables
are collected about the units. The regression model consists of the
dependent variable (yield) regressed on the predetermined independent variables,
or factors (amount of fertilizer and of watering).
2. Controlled Experiments with Supplemental
Variables
In some experiments, characteristics of the experimental
units are measured. For instance, for human subjects, their age,
sex, weight, educational attainment. As these characteristics (because
of randomization) are not supposed to be related to the values of the independent
variables (the factors), their role is only to help reduce the error variance
of the regression model. They are usually few so they can be all
included in the model and later discarded if they do not help in reducing
error variance.
3. Confirmatory Observational Studies
These are studies that use observational data
to test hypotheses derived from "theory", which means previous studies
in the field and new ideas and interpretations by the researcher.
Thus, the variables to collect and to include in the model belong to 2
categories
-
explanatory variables that previous studies have
shown to affect the dependent variable
-
new explanatory variables that the researcher
believes also affect the dependent variable
EX: in the article by Scott South (1985) on factors
explaining the divorce rate in the US, the author included 2 measures of
economic well-being (the unemployment rate and the rate of growth of the
GNP per capita) because he detected a consensus in the literature on divorce
that the divorce rate is lower during hard economic times because individuals
are more dependent on their spouses.
Thus, in confirmatory observational studies,
previous studies and the new theories the researcher wants to test guide
the choice of variables to include in the model, although there may be
some choices to be made among alternative indicators of the same theoretical
concept (e.g., unemployment versus GNP growth as an indicator of economic
conditions).
4. Exploratory Observational Studies
These are studies using observational data in
which a strong basis of previous knowledge about the phenomenon of interest
is lacking. Or they may be studies in which there is some knowledge
of the factors affecting the dependent variable but the goal is prediction
rather than an understanding of the phenomenon.
EX: a study of the rates of child abuse among
counties of North Carolina. There are many characteristics of counties
that can be obtained from government sources such as decennial censuses,
but only vague ideas about which characteristics of counties are expected
to be associated with the rate of child abuse.
EX: government agencies use regression models
to estimate the value of your house in order to calculate the real estate
tax that you pay. When they develop these models they choose among
all measurable characteristics of your house (heated area, size of land,
initial purchase price, age, number of bedrooms, etc...) the subset that
best predict the value of the house (as measured by the sale price in recent
transactions).
To reduce the number of independent variables
to be included in the model in exploratory observational studies, and to
some extent in confirmatory observational studies, computer-based approaches
have been developed that are based on 2 general strategies
-
all-possible regressions procedures identify "good"
subsets of the pool of potential independent variables among all possible
subsets of the variables, where "good" may be defined with respect to several
criteria
-
forward stepwise regression (and other automatic
search procedures) search for the "best" subset of independent variables
without comparing all possible regressions
2. ALL-POSSIBLE-REGRESSIONS PROCEDURES
The all-possible-regressions procedure examines
all possible subsets (of 1, 2, ..., P-1 variables) of the pool of P-1 potential
X variables and identifies a few "good" subsets according to one of the
following criteria:
1. the Rp2 (or SSEp)
Criterion
With the Rp2 criterion (where
the p subscript refers to the number of variables in the model) subsets
of the potential X variables for which the ordinary R-square is large are
considered "good". Choosing the model with largest R2
is equivalent to choosing the model with smallest SSE (since R2
= 1 - SSE/SSTO and SSTO is constant across all models).
The Rp2 criterion
is used to judge when to stop adding more variables rather than finding
the "best" model, since Rp2 can never decrease when
p increases.
2. the Ra2 (or MSEp)
Criterion
The Ra2 criterion
uses the adjusted R-square, which adjusts for the number of independent
variables included, to compare models. It can be shown (NKNW 8.4
p. 339) that Ra2 = 1 - MSE/(SSTO/(n-1)), so
that maximizing Ra2 is equivalent to minimizing
MSE.
3. the Cp Criterion
The Cp criterion is based on
the concept of the total mean squared error of the n fited values for each
subset regression model. It can be shown that the total mean squared
error for all n fitted values ^Yi is
Si=1
to n(E{^Yi} - mi)2
+ Si=1
to ns2{^Yi}
which is seen as composed of a bias component
and a variance component.
The criterion measure Gp
is the mean squared error divided by the true error variance
s2
Gp
= (1/s2)
[Si=1 to
n(E{^Yi} - mi)2
+ Si=1
to ns2{^Yi}]
Note that s2
is unknown. Assuming that the model that includes all P-1 potential
X variables is such that MSE(X1, ... , XP-1) is an
unbiased estimator of s2,
it can be shown that Gp
can be estimated as
Cp = SSEp/MSE(X1,
... , XP-1) - (n-2p)
where SSEp is the SSE for the model
with p-1 X variables and MSE(X1, ... , XP-1) is the
MSE for the model with all P-1 X variables. It can be shown
that when there is no bias in the model with p-1 X variables then
E{Cp} ~= p
Thus when Cp values are plotted against
p, unbiased models will fall near the line Cp = p.
The strategy with the Cp criterion
is to identify models with
-
small Cp
-
a Cp value near p
4. the PRESSp Criterion
The PREdiction Sum of Squares criterion and the
SSEp criteria are analogous, with a difference, as seen in their formulas
SSEp = S(Yi
- ^Yi)2
PRESSp = S(Yi
- ^Yi(i))2
where the sums are for i=1 to n. The difference
is that in PRESSp Yi is compared to its predicted
value from a regression from which observation i was excluded, so that
^Yi(i) is the "deleted predictor" (by analogy with the "deleted
residual" of regression diagnostics). In fact, SSEp and
PRESSp can also be written
SSEp = Sei2
PRESSp = Sdi2
that is, PRESSp is the sum of the
squared external residuals (or deleted residuals) di discussed
in Module 10 on diagnostics for outliers and influential cases.
3. FORWARD STEPWISE REGRESSION
The forward stepwise regression procedure will
be illustrated in class by two examples
-
the depression model with the Afifi & Clark
data with calculated variables (survey2b.syd). For this example
start with >model total=constant+age+female+educatn+l10inc+cath+jewi+none+married+drinks+goodhlth
then enter >start/forward then >step ... >step until
the program says Nothing to do! Then enter >stop to obtain the detailed
final regression.
-
the divorce rate model with the 1920-1970 time
series for the US (ignoring the problem of autocorrelated errors for the
sake of this example). For this example start with the model divmf=constant+unemp+flfprt+marumf+trend+brthrt+brtw1544+milperk
then enter >start/forward then >step
... >step until the program says Nothing to do! Then enter >stop
to obtain the detailed final regression.
The major weakness of forward stepwise regression
(and other automatic ssearch procedures), compared to the all-possible-regressions
methods, is that the end result is a single "best" model. This model
may not be as desirable as other models missed by the procedure.
Last modified 2 May 2000