soci209 - module 15 - model building & specification

Soci709 (formerly 209) Module 9n - MODEL BUILDING & SPECIFICATION

0. REFERENCES

ALSM5e pp. 343-383; ALSM4e pp. 327-360.
STATA [R] sw
Miller, A. J. 2002. Subset Selection in Regression. 2nd edition. London: Chapman and Hall.

1. THE MODEL BUILDING/SPECIFICATION PROBLEM IN RESEARCH CONTEXT

The problem of choosing the independent variables to include in the regression model depends on the design of the study. One can distinguish 4 types of research designs.

1. Controlled Experiments

Example: an agricultural experiment to assess the effect on yield of a variety of corn of three levels of fertilizer and three levels of watering.
Example: Shephard's experiments in which the dependent variable is the amount of time taken by subjects to determine whether an object rotated by a certain angle is the same as a reference object.
Example: A psychological experiment relating task performance to the amount of anxiety experienced by subjects.
In a controlled experiment the treatments are randomly allocated to the experimental units. No other variables need to be collected about the units. The regression model consists of the dependent variable (e.g., yield) regressed on the predetermined independent variables, or factors, corresponding to different experimental treatments (e.g., amounts of fertilizer and of watering).

2. Controlled Experiments with Supplemental Variables

In some experiments, characteristics of the experimental units are measured. For instance, for human subjects, their age, sex, weight, educational attainment. As these characteristics (because of randomization) are not supposed to be related to the values of the independent variables (the factors), their role is limited to helping reduce the error variance of the regression model. Including these variables in the regression model is not expected to change the effect of the factor(s). They are usually few so they can be all included in the model and later discarded if they do not help in reducing error variance.
Example: In Shephard's experiments sex of the subject might be added to the model, as there may be an association between spatial visualization and sex; but including sex is not expected to affect the finding that recognition time is function of the angle of rotation of the object.

3. Confirmatory Observational Studies

These are studies that use observational data to test hypotheses derived from "theory", which means previous studies in the field and new ideas and interpretations by the researcher. Thus, the variables to collect and to include in the model belong to 2 categories

explanatory variables that previous studies have shown to affect the dependent variable
new explanatory variables that the researcher believes also affect the dependent variable

Example: in the article by Scott South (1985) on factors explaining the divorce rate in the US, the author included 2 measures of economic well-being (the unemployment rate and the rate of growth of the GNP per capita) because he detected a consensus in the literature on divorce that the divorce rate is lower during hard economic times because individuals are more dependent on their spouses.
Thus, in confirmatory observational studies, previous studies and the new theories the researcher wants to test guide the choice of variables to include in the model, although there may be some choices to be made among alternative indicators of the same theoretical concept (e.g., unemployment versus GNP growth as an indicator of economic conditions).

4. Exploratory Observational Studies

These are studies using observational data in which a strong basis of previous knowledge about the phenomenon of interest is lacking. Or they may be studies in which there is some knowledge of the factors affecting the dependent variable but the goal is prediction rather than an understanding of the phenomenon.
Example: a study of the rates of child abuse among counties of North Carolina. There are many characteristics of counties that can be obtained from government sources such as decennial censuses, but only vague ideas about which socio-economic characteristics of counties are expected to be associated with the rate of child abuse: poverty?, female-headed household?, race composition?, ...
Example: a government agency collects data on the sale prices of houses in a county and uses a regression model to estimate the market value of homes in order to assess real estate taxes. In developing the regression model the agency can choose among a number of measured characteristics of the homes (heated area, size of land, initial purchase price, age, number of bedrooms, etc...). The goal is to find the subset of measured characteristics that best predicts the market value of a home (as measured by sale price in recent transactions).
To reduce the number of independent variables to be included in the model in exploratory observational studies, and to some extent in confirmatory observational studies, computer-based approaches have been developed that are based on 2 general strategies

all-possible regressions procedures identify "good" subsets of the pool of potential independent variables among all possible subsets of the variables, where "good" may be defined with respect to several criteria
forward stepwise regression (and other automatic search procedures) search for the "best" subset of independent variables without comparing all possible regressions

2. ALL-POSSIBLE-REGRESSIONS PROCEDURES

The all-possible-regressions procedure examines all the 2^P-1 possible subsets (of 1, 2, ..., P-1 variables) of the pool of P-1 potential X variables and identifies a few "good" subsets according to one of the criteria below. These criteria can also be used outside the all-possible-regressions context, to compare two or more regression models for the same dependent variable. (There are 2^P-1 subsets because each of the P-1 variables can be either included or excluded from a model.)

1. R_p² (or SSE_p) Criterion

With the R_p² criterion (where the p subscript refers to the number of variables in the model) subsets of the potential X variables for which the ordinary R-square is large are considered "good". Choosing the model with largest R² is equivalent to choosing the model with smallest SSE (since R² = 1 - SSE/SSTO and SSTO is constant across all models). The R_p² criterion is used to judge when to stop adding more variables rather than finding the "best" model, since R_p² can never decrease when p increases.

2. R_a² (or MSE_p) Criterion

The R_a² criterion compares models ont he basis of the adjusted R-square, which adjusts for the number of independent variables included. It can be shown that R_a² = 1 - MSE/(SSTO/(n-1)), so that maximizing R_a² is equivalent to minimizing MSE.

3. (Optional) Mallows' C_p Criterion

Mallows' C_p criterion is based on the concept of the total mean squared error of the n fitted values for each subset regression model. It can be shown that the total mean squared error for all n fitted values ^y_i is

S_{i=1
to n}[E({^y_i} - m_i)² + s²{^y_i}] = S_{i=1
to n}(E{^y_i} - m_i)² + S_{i=1
to n}s²{^y_i}

where m_i denotes the true mean response when the values of the X_k are those for the ith case. The total mean squared error is seen as composed of a squared bias component (E{^y_i} - m_i)² and a variance component s²{^y_i}.
The criterion measure G_p is the mean squared error divided by the true error variance s²

G_p = (1/s²) [S_{i=1 to
n}(E{^Y_i} - m_i)² + S_{i=1
to n}s²{^Y_i}]

Note that s² is unknown. Assuming that the model that includes all P-1 potential X variables is such that MSE(X₁, ... , X_P-1) is an unbiased estimator of s², it can be shown that G_p can be estimated as

C_p = SSE_p/MSE(X₁, ... , X_P-1) - (n-2p)

where SSE_p (with lowercase p) is the SSE for the subset model with p-1 X variables and MSE(X₁, ... , X_P-1) (with capital P) is the MSE for the model with all P-1 X variables. It can be shown that when there is no bias in the subset model with p-1 X variables then

E{C_p} ~= p (where ~= stands for "is approximately equal to")

Thus when C_p values are plotted against p, unbiased models will fall near the line C_p = p.
The strategy with the C_p criterion is to identify models with

small C_p
a C_p value near p

4. AIC and SBC Criteria

AIC (Akaike Information Criterion) is defined as

AIC_p = n ln(SSE_p) - n ln(n) + 2p

SBC (Schwatz's Bayesian Criterion) is defined as

SBC_p = n ln(SSE_p) - n ln(n) + [ln(n)]p

For both criteria smaller values are better. Note that both criteria increase with SSE (poor model fit) and with p (number of independent variables). Thus both criteria penalize models with many independent variables.
M: check BIC formula; is BIC same as SBC?

5. PRESS_p Criterion

The PREdiction Sum of Squares criterion and the SSE_p criteria are analogous, with a difference, as seen in their formulas

SSE_p = S(y_i - ^y_i)²
PRESS_p = S(y_i - ^y_i(i))²

where the sums are for i=1 to n. The difference is that, in PRESS_p, y_iis compared to its predicted value from a regression from which observation i was excluded.
(Optional note. ^y_i(i) is the "deleted predictor" by analogy with the "deleted residual" discussed in connection with diagnostics for outliers and influential cases in Module 10. In fact, SSE_p and PRESS_p can also be written

SSE_p = Se_i²
PRESS_p = Sd_i²

that is, PRESS_p is the sum of the squared external residuals (or deleted residuals) d_i .)

3. FORWARD STEPWISE REGRESSION

The forward stepwise regression procedure is illustrated by two examples.

The major weakness of forward stepwise regression (and other automatic search procedures), compared to the all-possible-regressions methods, is that the end result is a single "best" model. This model may not be as desirable as other models missed by the procedure.

4. MODEL VALIDATION

The problem of validating the model (choice of independent variables) arises mostly for exploratory observational studies. Validation involves checking the model against independent data. There are three approaches.

1. Comparison with theoretical expectations, other empirical evidence, and simulation results

2. Collection of new data

Collection of new data to check the model is desirable but rarely feasible.

3. Splitting the data

Data are split randomly into two sets:

the model-building or training sample, used to develop the model, and
the validation or prediction set, used to validate the model

The cross-validation procedure consists in estimating candidate models developed from the model-building sample with the validation sample and (1) comparing the values of the estimated regression coefficients, and (2) calculating the MSRP (mean squared prediction error) as

MSRP = (S_{i=1 to n*}(y_i - ^y_i)²)/n*

where

y_i is the response for ith case in validation sample
^y_i is predicted value based on candidate model for ith case in validation sample
n* is number of cases in validation sample.

The candidate model is validated to the extent that the values of MSRP and MSE for the training sample regression are close. (It is not entirely clear to me how to decide what "close" means; see ALSM5e p. 374. Furthermore ALSM5e decide to drop a variable from the model because its coefficient is negative, contrary to theoretical expectation; but this coefficient is non-significant so its sign should not matter. It is better to drop that variable on the ground that it is non significant.)

Last modified 27 Mar 2006