SOCI209 - Module 5 - Multiple Regression & the General Linear Model

1.  Need for Models With More Than One Independent Variable

1.  Motivations for Multiple Regression Analysis

The 2 principal motivations for models with more than one independent variable are: The second motivation is very important for scientific applications of regression analysis.  It is discussed further in the next section.

2.  Supporting a Causal Statement by Eliminating Alternative Hypotheses

Theories about social phenomena are made up of causal statements.  A causal statement or law is "a statement or proposition in a theory which says that there exist environments ... in which a change in the value of one variable is associated with a change in the value of another variable and can produce this change without any change in other variables in the environment" (Stinchcombe, Arthur.  1968.  Constructing Social Theories, p. 31).  Such a causal statement can be represented schematically as
where Y is the dependent variable and X the independent variable.  One of the requirements to support or refute a causal theory is to ascertain nonspuriousness.  This means eliminating the possibility that one or more other variables (say Z and W) affect both X and Y and thereby produce an apparent association between X and Y that is spuriously attributed to a causal influence of X on Y.  The mechanism of spurious association is shown in the following picture:
Multiple regression analysis can be used to ascertain nonspuriousness by adding variables explicitly to the regression model to eliminate alternative hypotheses on the source of the relationship between X and Y.  The effectiveness of this strategy depends on the design of the study:
With observational data the task of ascertaining nonspuriousness remains open-ended, as it is never possible to prove that all potential sources of spuriousness have been controlled.  In the context of regression analysis, spuriousness is called specification bias.  Specification bias is a more general and continuous notion than spuriousness.  The idea is that if a regression model of Y on X excludes a variable that is both associated with X and a cause of Y (the model is then called misspecified) the estimated association of Y with X will be inflated (or, conversely, deflated) relative to its true value.  The regression estimator, in a sense, falsely "attributes" to X a causal influence that is in reality due to the omitted variable(s).

2.  Yule's Study of Pauperism (1899)

1.  The First Modern Multiple Regression Analysis

The use of multiple regression analysis as a means of controlling for possible confounding factors that may spuriously produce an apparent relationship between two variables was first proposed by G. Udny Yule in the late 1890s.  In a pathbreaking 1899 paper "An Investigation into the Causes of Changes in Pauperism in England, chiefly in the last Two Intercensal Decades" Yule investigated the effect of a change in the ratio of the poor receiving relief outside as opposed to inside the poorhouses (the "out-relief ratio")  on change in pauperism (poverty rate) in British unions.  (Unions are British administrative units.)  This was a hot topic of policy debate in Great Britain at the time.  Charles Booth had argued that increasing the proportion of the poor receiving relief outside the poorhouses did not increase pauperism in a union.  Using correlation coefficients (a then entirely novel technique that had been just developed by his colleague and mentor Karl Pearson), Yule had discovered that there was a strong bivariate association between change in the out-relief ratio and change in pauperism, contrary to Booth's impression.  In the 1899 paper Yule uses multiple regression to confirm the relationship between pauperism and out-relief by controlling for other possible causes of the apparent association, specifically change in proportion of the old (to control for the greater incidence of poverty among the elderly) and change in population (using population increase in a union as an indicator of prosperity).  The 1899 paper is the first published use of multiple regression analysis.  It is hard to improve on Yule's description of the logic of the method.  (On this episode in the history of statistics see Stigler 1986, pp. 345-361.) Yule's argument is that (using modern notation) in the estimated simple regression model
^Y = b0 + b1X1
where ^Y is change in pauperism and X1 is change in the proportion of out-relief , the association between Y and X1 measured by b1 confounds the direct effect of X1 on Y with the common association of both Y and X1 with other variables ("economic and social changes") that are not explicitly included in the model.
By contrast, in the multiple regression model
^Y = b0 + b1X1 + b2X2 + b3X3
where X2 measures change in proportion of the elderly and X3 measures change in population, the (partial) regression coefficient b1 now measures the estimated effect of X1 on Y when the other factors included in the model are kept constant.  The possibility that b1 contains a spurious component due to the joint association of Y with X1, X2, and X3 is now excluded, since X2 and X3 are explicitly included in the model (i.e., "controlled").

2.  Applying the Elaboration Model to Yule's Data

The technique of "testing" the coefficient of a variable X1 for spuriousness by introducing in the model additional variables X2, X3, etc., measuring potential confounding factors is called the elaboration model.  It is an effective and widely used strategy to conduct a regression analysis, and to present the results.  Thus, as remarked above, the standard tabular presentation of regression results is often based on the elaboration model.
These points are illustrated by a replication of Yule's (1899) analysis for 32 unions in the London metropolitan area. The dependent variable is change in pauperism (labeled PAUP).  The predictor of interest is change in the out-relief ratio (OUTRATIO).  The control variables are change in the proportion of the old (PROPOLD) and change in population (POP).  The following exhibits show Yule's (1899) original results and the replication using the elaboration model strategy, in which each control variable is added to the model in turn. The first model, a simple regression of PAUP on OUTRATIO, shows a significant positive effect of OUTRATIO on PAUP.  In the second model, including PROPOLD in addition to OUTRATIO, the coefficient of OUTRATIO remains positive and highly significant.  PROPOLD also has a positive effect on PAUP, significant at the .05 level, suggesting that an increasing proportion of old people in a union is associated with increasing pauperism, when OUTRATIO is kept constant.  In the third model POP is introduced as a third independent variable.  OUTRATIO remains positive and highly significant and POP has a negative and highly significant effect on PAUP (suggesting that metropolitan unions with growing population had decreasing rates of pauperism), but the effect of PROPOLD has now vanished (i.e., it has become non significant).  A non-significant regression coefficient implies that the corresponding independent variable can be safely dropped from the model. The fourth and final model is often called a trimmed model.  It is estimated after removing non significant variables from the full model (PROPOLD in this case).  (When the full model contains several non significant variables, one should test the joint significance of these variables before removing them to estimate the trimmed model.  See Module 8.)
From the point of view of the elaboration model, OUTRATIO comes out of it with flying colors, since the original positive association with PAUP has remained large and significant despite the introdution of the "test variables" PROPOLD and POP.  Thus the analysis has shown that the effect of OUTRATIO on PAUP is not a spurious association due to the common association of PAUP and OUTRATIO with either PROPOLD or POP.  However, it is never possible to exclude the possibility that some other factor, unsuspected and/or unmeasured, may be generating a spurious effect of OUTRATIO on PAUP.  As Yule (1899:251) concludes:
There is still a certain chance of error depending on the number of factors correlated both with pauperism and with proportion of out-relief which have been omitted, but obviously this chance of error will be much smaller than before.
The following table shows how the results of a regression analysis can be presented in a table in a way that emphasizes the elaboration model logic of the analysis.  In fact this is often the way regression results are presented in professional publications.  (Although the analysis can be simplified by introducing test variables in groups rather than singly; thus Model 2 in the table below might be omitted to save space.)
 
Table 1.  Unstandardized Regression Coefficients for Models of Change in Pauperism on Selected Independent Variables: 32 London Metropolitan Unions, 1871-1881 (t-ratios in parentheses)
Independent variable
Model 1
Model 2
Model 3
Model 4
Constant
31.089***
-27.822
63.188*
69.659***
 
(5.840)
(-1.132)
(2.328)
(9.065)
Change in proportion of out-relief
.765***
.718***
.752***
.756***
 
(4.045)
(4.075)
(5.572)
(5.736)
Change in proportion of the old
--
.606*
.056
--
   
(2.446)
(.249)
 
Change in population
--
 
-.311***
-.320***
     
(-4.648)
(-5.730)
R2
.353
.464
.697
.697
Adjusted R2
.331
.427
.665
.676
Note:  * p < .05  ** p < .01  *** p < .001  (2-tailed tests)

3.  The Mechanism of Specification Bias aka Spuriousness

The mechanism of spuriousness aka specification bias is presented graphically in the context of the D-Score example in the next exhibit. The algebra of specification bias is shown in the next exhibit Although spuriousness often creates the appearance of a significant effect, where none exists in reality, spuriousness may also create the appearance of no effect, where there is an effect in reality.
Example: As discussed in an article in Scientific American (February 2003), it is now known that drinking alcohol lowers the risk of coronary heart disease by reducing the deposit of plaque in the arteries.  For a long time the beneficial effect of alcohol in reducing the risk of disease was overlooked because alcohol consumption is associated with smoking, which increases the risk of coronary heart disease.  In early studies (that did not properly control for smoking behavior) the effect of alcohol consumption was non-significant, because the negative (beneficial) direct effect on risk was cancelled-out by positive (detrimental) effect corresponding to the product of the positive correlation between alcohol consumption and smoking times the positive effect of smoking on risk.  Thus the non-significant bivariate association of alcohol consumption with risk of coronary heart disease was a spurious non-effect.

3.  The Multiple Regression Model in General

1.  Multiple Regression Model with p - 1 Independent Variables

The multiple linear regression model with p - 1 independent variables can be written
Yi = b0 + b1Xi  + b2Xi2 + ...  + bp-1Xi,p-1 + ei  i = 1,..., n
where
Yi is the response for the ith case
Xi1 ,Xi2 , ...,Xi,p-1are the values of p - 1 independent variables for the ith case, assumed to be known constants
b0, b1, ..., bp-1are parameters
ei are independent ~ N(0, s2)
(The independent variables are indexed 1 to p - 1 so that the total number of independent variables, including the implicit column of 1 associated with the intercept b0, is equal to p.)
The interpretation of the parameters is
  1. b0, the Y intercept, indicates the mean of the distribution of Y when X1 = X2 = ... = Xp-1 = 0
  2. bk (k = 1, 2, ..., p - 1) indicates the change in the mean response E{Y} (measured in Y units) when Xk increases by one unit while all the other independent variables remain constant
  3. s2 is the common variance of the distribution of Y
The  bk are sometimes called partial regression coefficients, but more often just regression coefficients, or unstandardized regression coefficients (to distinguish them from standardized coefficients discussed below.)  Mathematically, bk corresponds to the partial derivative of the response function with respect to Xk
dE{Y}/dXk = bk
Defining y and e as before, and b = [b0b1 ... bp-1]  and X =
 
1
X11
X12
...
X1,p-1
X
=
1
X21
X22
...
X2,p-1
 
...
...
...
...
 
1
Xn1
Xn2
...
Xn,p-1

the regression model for the entire data set can be written

y = Xb + e
In the model It follows that random vector Y has expectation
E{y} = E{Xb + e} = Xb
and the variance-covariance matrix of Y is the same as that of e, so that
s2{y} = E{(y - E{y})(y - E{y})'} = E{ee'} = s2I
E{y} = Xb is called the response function.  The response function can also be written long hand
E{y} = b0 + b1X1  + b2X2 + ...  + bp-1Xp-1
When the X's represent all different predictors the model is called the first order model with p - 1 variables.

2.  Geometry of the First Order Multiple Regression Model

The response function (also called regression function or response surface) defines a hyperplane in p-dimensional space.  When there are only 2 predictor variables (besides the constant) the response surface is a plane.

Example:  In the trimmed model of change in pauperism estimated from the Yule data (Model 4) the response function E{Y} is a function of two variables, with estimated response function

estimated E{Y} = ^Y = 69.659 + 0.756X1 - 0.320X2
where y = PAUP (change in pauperism), x1 is OUTRATIO (change in proportion of out-relief), and x2 is POP (change in population).
b1 = 0.756 means that, irrespective of the value of X2, increasing X1 by 1 percent point increases y by 0.756 percent point.  The parameter b2 is interpreted similarly.
In a first order model such as this the effect of a variable does not depend on the values of the other variables.  The effects are therefore called additive or not interactive.  The response function is a plane.  For example, if X2 = 150 it follows that
estimated E{y} = 69.659 + 0.756X1 - (0.320)(150) = 21.659 + 0.756X1
which is a straight line.  For any given value of x2 the value of y as a function of x1 corresponds to a straight line with constant slope .756.  Likewise, for any given value of x1 the relation between y and x2 is a straight line with constant slope -.320. When there are more than 2 independent variables (in addition to the constant) the regression function is a hyperplane and can no longer be visualized in 3-dimensional space.

3.  (Optional) Alternative Geometry for First-Order Multiple Regression Model

There is an alternative geometry for multiple regression that represents the problem in n-dimensional space, where n is the number of observations.  Then the vector y of observations on the dependent variable and each vector xk of observations on an independent variable correspond to points in that n-dimensional space.  In that representation, OLS estimates the perpendicular projection of the vector y on the subspace "spanned" by the vectors xk.

4.  Elements of the Regression Model

1.  Example - Full Model (Model 3) For the Yule Data

To illustrate a typical multiple regression analysis we use the example of Yule's full model
PAUP = b0 + b1OUTRATIO + b2PROPOLD + b3POP + ei
The variables are defined as
(y) PAUP, Change in pauperism
(x1) OUTRATIO, Change in proportion of out-relief
(x2) PROPOLD, Change in proportion of the old
(x3) POP, Change in population

2.  Correlation Matrix and Splom

The simple correlation coefficients among variables in the multiple regression model are often presented in the form of a matrix.
The correlations can also be presented graphically in a corresponding scatterplot matrix, or splom.  As presented in the next exhibit, the dependent variable (PAUP) is listed last, so the correlations involving it appear together on the bottom row of the splom, with each panel showing the dependent variable on the vertical axis.  The splom uses the HALF option so that only one panel is shown for each correlation, to reduce the visual clutter.

3.  Estimated Regression Function ^y

The estimated regression function for the multiple regression model with p - 1 variables is
^y = b0 + b1x1 + ... + bp - 1xp - 1
where b0, b1, ..., bp - 1 are estimated as the solution of the ordinary least squares normal equations
X'Xb = X'Y
or
b = (X'X)-1X'Y
as derived in Module 4.
The variance-covariance matrix of b is estimated as
s2{b} = MSE(X'X)-1
The standard errors of each estimated coefficient bk is the square root of the corresponding diagonal element of s2{b}, so that s{b0} is in position (1,1), s(b1} in position (2,2), ..., and s{bp - 1} in position (p,p).
On the standard multiple regression printout the estimated coefficients bk are presented, together with the estimated standard errors s{bk} and the t-ratio t* = bk/s{bk} (see later).
Example: Results in Table 2 show that, keeping the other variables in the model constant, the estimated coefficient for OUTRATIO is 0.752, so that an increase of 1 unit of OUTRATIO is associated with an increase of 0.752 unit of PAUP.  The standard error of the coefficient of OUTRATIO is 0.135, and the t-ratio is given as 0.752/0.135 = 5.572.  (Significance of the coefficient is discussed below.)

Table 2.  SYSTAT Regression Printout for Yule's Full Model (Model 3)

Dep Var: PAUP   N: 32   Multiple R: 0.835   Squared multiple R: 0.697
Adjusted squared multiple R: 0.665   Standard error of estimate: 9.547

Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT           63.188       27.144        0.000      .       2.328    0.027
OUTRATIO          0.752        0.135        0.584     0.985    5.572    0.000
PROPOLD           0.056        0.223        0.031     0.711    0.249    0.805
POP              -0.311        0.067       -0.570     0.719   -4.648    0.000

Analysis of Variance
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
Regression              5875.320     3     1958.440      21.488       0.000
Residual                2551.899    28       91.139

-------------------------------------------------------------------------------
*** WARNING ***
Case           15 has large leverage   (Leverage =        0.424)
Case           30 is an outlier        (Studentized Residual =        3.618)

Durbin-Watson D Statistic          2.344
First Order Autocorrelation       -0.177


4.  Analysis of Variance (ANOVA)

1.  Fitted Values ^Yi
The fitted values ^yi are defined in a way analogous to simple regression as
^yi = b0 + b1xi1 + ... + bp - 1xi, p - 1
or
^y = Xb
where ^y is a nx1 vector of fitted values.  Note that ^yi is a single number associated with each case, regardless of the number p - 1 of independent variables in the model.
2.  Sums of Squares
As shown in Module 4, the sums of squares are defined identically in simple and multiple regression, as
SSTO = S(Yi - Y.)2
SSE = S(Yi - ^Yi)2
SSR = S(^Yi - Y.)2
with the relation
SSTO = SSR + SSE
3.  Degrees of Freedom
As shown in Module 4. the degrees of freedom (df) associated with various sums of squares are
SSTO has n - 1 df; 1 df is lost because the sample mean is estimated from the data (same as before)
SSE has n - p df; the n residuals ei = Yi - ^Yi are calculated using p parameters b0, b1, ..., bp-1 estimated from the data
SSR has p - 1 df; there are p estimated parameters b0, b1, ..., bp-1 used to calculate the ^Yi, minus 1 df associated with a constraint on the sum of the fitted values (see Module 4 and NWW p. 604)
4.  Mean Squares
Mean squares are sums of squares divided by their respective degrees of freedom (df).
In particular, MSE = SSE/(n - p) is again the estimate of s2, the common variance of e and of Y.
5.  ANOVA Table
Analysis of variance results are summarized in an ANOVA table analogous to the one for simple regression.  Table 6a shows the general format of the ANOVA table and Table 6b shows the table for Yule's Model 3 example.
 
Table 3a.  General Format of ANOVA Table for Multiple Regression
Source of variation
SS
df
MS
F Ratio
Regression SSR = S(^Yi - Y.)2
p - 1
MSR = SSR/(p -  1) F* = MSR/MSE
Error SSE = S(Yi - ^Yi)2
n - p
MSE = SSE/(n - p)
Total SSTO = S(Yi - Y.)2
n - 1
sY2 = SSTO/(n - 1)
Table 3b.  ANOVA Table for Yule's Model 3 Example
Source of variation
SS
df
MS
F Ratio
Regression SSR = 5875.320
3
MSR = 1958.440 F* = 21.488
Error SSE = 2551.899
28
MSE = 91.139
Total SSTO = 8427.219
31
sY2 = 271.846

Table 3a and Table 3b also show the calculation of the f-ratio or f-statistic F*= MSR/MSE.  The interpretation of F* is discussed below.

5.  Coefficient of Multiple Determination R2

1.  Coefficient of Multiple Determination R2
The coefficient of multiple determination R2 is defined analogously to the simple regression r2 as
R2 = SSR/SSTO = 1 - (SSE/SSTO)
where
0 <= R2 <= 1
Example: in Yule's Model 3
R2 = SSR/SSTO = 5875.320/8427.219 = 0.697
as shown on the printout of Table 5.
2.  Coefficient of Multiple Correlation
The coefficient of multiple correlation R is the positive square root of R2
R = +(R2)1/2
so that R is always positive (0 <= R <= 1).
Q - Why is R always positive in the multiple regression context, while the simple correlation r can vary between -1 and +1?
Example: in Yule's Model 3 R = (0.697) = 0.835.

R can also be interpreted as the correlation of y with the fitted value ^y.

3.  Adjusted R-Square Ra2
The adjusted coefficient of multiple determination Ra2 adjusts for the number of independent variables in the model (to correct the tendency of R2 to always increase when independent variables are added to the model).  It is calculated as
R2a = 1 - ((n-1)/(n-p))(SSE/SSTO) = 1 - MSE/(SSTO/(n - 1))
R2a can be interpreted as 1 minus the ratio of the variance of the errors (MSE) to the variance of y, SSTO/(n-1).
Example: In Yule's Model 3 the adjusted r-square R2a is
1 - ((32 - 1)/(32 - 4))(2551.899/8427.219) = .665
as contrasted with the ordinary (unadjusted) R2 = .697

5.  Inference for Entire Model - F Test for Regression Relation

The F test for regression relation (aka screening test) tests the existence of a relation between the dependent variable and the entire set of independent variables.  The test involves the hypothesis setup
H0: b1= b2 = ... = bp-1= 0
H1: Not all bk = 0  k = 1, 2,..., p - 1
The test statistic is (same as for simple linear regression)
F* = MSR/MSE
which is distributed as F(p - 1; n - p), the same df as the numerator and denominator, respectively, in the ratio MSR/MSE.

Using the P-value method, calculate the P-value P{F(p - 1; n - p) > F*}.
Choose a significance level a.
Then the decision rule is

if P-value < a conclude H1 (not all coefficients = 0 so there is a significant statistical relation)
if P-value >= a conclude H0 (there is no significant statistical relation)
Using the decision theory method, choose a significance level a.
Calculate the critical value F(1 - a; p - 1, n - p).
Then the decision rule is
if F* <= F(1 - a; p - 1, n - p), conclude H0
if F* > F(1 - a; p - 1, n - p), conclude H1
Example: In Yule's Model 3
F* = 1958.440/91.139 = 21.488
with p - 1 = 3 and n - p = 28 df (see Table 6b).
Using the P-value method, P{F(3, 28) > 21.488} = .000000.  Choose a = .05.  Since P-value = .000000 < .05 = a, conclude H1, that not all regression coefficients are 0.
Using the decision theory method, choose a = .05.  Find F(0.95; 3, 28) = 2.947.  Since F* = 21.488 > 2.947, conclude H1, that not all regression coefficients are 0 with this method also.

6.  Inference for Individual Regression Coefficients

Statistical inference on individual regression bk is carried out in the same way as for simple regression, except that the t tests are now based on the Student t distribution with n - p df (corresponding to the n - p df associated with MSE), instead of the n - 2 df of the simple regression model.

1.  Hypothesis Tests for bk

1. Two-Sided Tests
The most common tests concerning bk involve the null hypothesis that bk = 0.
The alternatives are
H0: bk = 0
H1: bk <> 0
The test statistic is
t* = bk/s{bk}
where s{bk} is the estimated standard deviation of bk.
When bk = 0, t* ~ t(n - p).

Example: Test that the coefficient of OUTRATIO is different from 0.  The hypotheses are

H0: b1 = 0
H1: b1 <> 0
The test statistic (aka "t ratio") is
t* = b1/s{b1} = 0.752/0.135 = 5.572 (provided on printout under "T")
When b1= 0, t* is distributed as t(n - p) = t(28).
Using the P-value method, find the 2-tailed P-value P{|t(28)| > |5.572|} = (2)P{t(28) > 5.572} = 0.000006.
Choose significance level a = .05.
Since P-value = 0.000006 < 0.05 = a, conclude H1, that b1 <> 0.
Using the decision theory method, choose significance level, say a = 0.05.  The critical value t(0.975; 28) = 2.048.
Since |t*| = |5.572| > 2.048, conclude H1, that b1 <> 0, by this method also.
2.  One-Sided Tests
One-sided tests for a coefficient bk are carried out by dividing the 2-sided P-value by 2, as before.
Example: Test that the coefficient of OUTRATIO is positive.  The hypotheses are
H0: b1 <= 0
H1: b1 > 0
Using the P-value method, find the 1-tailed P-value P{t(28) > 5.572} = 0.000006/2 = 0.000003.
Thus conclude H1, that b1 > 0.
Thus a 1-sided test is "easier" (more likely to yield a significant result) than a 2-sided test, as before.
 

2.  Confidence Interval for bk

1.  Construction of CI for bk
The 1 - a confidence limits for a coefficient bk of a multiple regression model are given by
bk -/+ t(1 - a/2; n - p)s{bk}
where s{bk} is the estimated standard deviation of bk and is provided on the standard regression printout next to bk under the Std Error heading.

Example:  For Yule's Model 3, calculate a 95% CI for the coefficient of OUTRATIO (x1).  The ingredients are

b1 = 0.752; s{b1} = 0.135; n = 32; p = 4; a = .05
Calculate n - p = 28 and t(0.975, 28) = 2.048.  Thus the confidence limits are
L =  0.752 - (2.048)(0.135) = 0.475
U =  0.752 + (2.048)(0.135) = 1.029
In other words one can say that with 95% confidence
 0.475 <= bk <= 1.029
One can say that, with 95% confidence, the increase in PAUP associated with an increase of 1 unit in OUTRATIO is between 0.475 and 1.029 percent point.
2.  Equivalence of CI and 2-sided Test
The (1-a) CI for bk and 2-sided hypothesis test on bk are equivalent in the sense that if the (1-a) CI for bk does not include 0, bk is significant at the a-level in a 2-sided test.

7.  CI for E{Yh}

It is often important to estimate the mean response E{Yh} for given values of the independent variables.
The values of the independent variables for which E{Yh} is to be estimated are denoted
Xh1, Xh2, ..., Xh, p - 1
(This set of values of the X variables may or may not correspond to one of the cases in the data set.)
The estimator of E{Yh} is
^Yh = b0 + b1Xh1 + b2Xh2 + ... + bp - 1Xh, p - 1
The 1 - a confidence limits for the mean response E{Yh} are then given by
^Yh  -/+ t(1 - a/2; n - p)s{^Yh}
where s{^Yh} is the estimated standard deviation of ^Yh.
The standard error s{^Yh} of ^Yh is estimated as (Module 4)
s{^Yh} = (MSE(Xh'(X'X)-1Xh))1/2
s{^Yh} can be obtained from a statistical program using the technique explained in the next example.

Example: In Yule's Model 3 one can obtain the predicted value ^Yh for PAUP and its estimated standard error s{^Yh} by adding to the data set a "dummy" case with the chosen Xhk values for the independent variables, and a missing value for the dependent variable.  (This is only necessary if the combinations of values in Xh does not correspond to any existing case in the data set.)  To do this using SYSTAT, go to the data window and add a case (row) to the data set with PAUP = ., OUTRATIO = 20, PROPOLD = 100, POP = 100.  The ID number for the new case is 33.  Then run the regression model and save the residuals.  Open the file of residuals.  The desired quantities are given for case 33 as

^Yh = ESTIMATE = 52.716
s{^Yh} = SEPRED = 2.196
STATA commands are
predict yhat, xb
predict syhat, stdp
Choosing a = 0.05, the 0.95 confidence limits for ^Yh are then calculated as
L = 52.716 - (2.048)(2.196) = 48.219
U = 52.716 + (2.048)(2.196) = 57.213
where 2.048 is t(0.975; 28).
One can then say that for a metropolitan union with these values of the independent variables, the predicted change in pauperism is between 48.219 and 57.213 with 95% confidence.

8.  Prediction Interval for Yh(new)

Given a new observation with values Xh of the independent variables, the predicted value Yh(new)  is estimated as ^Yh , the same as for the mean response.  But the variance s2{pred} of Yh(new) is different.  The expression for s2{pred} combines the sampling variance of the mean response, estimated as s2(^Yh}, and the variance of individual observations around the mean response, estimated as MSE, so that
s2{pred} = MSE + s2(^Yh} = MSE +MSE Xh'(X'X)-1Xh
Thus the standard error s{pred} is obtained as
s{pred} = (MSE + s2(^Yh})1/2 = (MSE +MSE Xh'(X'X)-1Xh)1/2
STATA command is
predict spred, stdf
Then the 1 - a prediction interval for Yh(new) corresponding to Xh is
^Yh +/- t(1 - a/2; n - p) s{pred}
Example: For Yule's Model 3, calculate a 95% prediction interval for PAUP, for a new union with the same combination of values for Xh as in Section 6 (above).  Thus Yh(new) = ^Yh =  52.716, same as above.  s2{pred} is estimated as
s2{pred} = MSE + s2(^Yh} = 91.139 + (2.196)2 = 95.961
so that
s{pred} = (95.961)1/2 = 9.796


With a = 0.05, the 0.95 confidence limits for Yh(new) are then calculated as

L = 52.716 - (2.048)(9.796) = 32.654
U = 52.716 + (2.048)(9.796) = 72.778
where 2.048 is t(0.975; 28).
Note how much wider the prediction interval for Yh(new) is (32.654, 72.778) compared to the interval for ^Yh (48.219, 57.213).

(See NKNW p. 235 for inference in predicting the mean of m new observations or predicting g new observations with the Bonferroni approach.)

9.  Other Elements of the Multiple Regression Printout

Two additional elements of the standard regression output become relevant in the multiple-regression context.

1.  Standardized Regression Coefficients

The standardized regression coefficient  bk* is  calculated as:
bk*  =  bk(s(Xk)/s(Y))
where s(Xk) and s(Y) denote the sample standard deviations of Xk and Y, respectively.
Thus the standardized coefficient bk* is calculated as the original (unstandardized) regression coefficient bk multiplied by the ratio of the standard deviation of Xk to the standard deviation of Y.
Conversely, one can recover the unstandardized coefficient from the standardized one as
bk  =   bk*(s(Y)/s(Xk))
The standardized coefficient bk* measures the change in standard deviations of Y associated with an increase of one standard deviation of X.
Standardized coefficients permit comparisons of the relative strengths of the effects of different independent variables, measured in different metrics (= units).

Example:  The SYSTAT output for Yule's Model 3 (Table 5) lists the standardized coefficients in the column headed Std Coef as
 

OUTRATIO .584
PROPOLD .031
POP -.570

The coefficient of OUTRATIO means that a change of one standard deviation unit in OUTRATIO is associated with a change of .584 standard deviations of PAUP.  The other coefficients are interpreted similarly.  The coefficients show that the effects of OUTRATIO and POP are strong and of comparable magnitude, although they are in opposite directions (.584 and -.570) and that the effect of PROPOLD is negligible (.031).

The following exhibit discusses alternative standardizations of regression coefficients.

2.  Tolerance or Variance Inflation Factor

The standard multiple regression output often provides a diagnostic measure of the collinearity of a predictor with the other predictors in the model, either the tolerance (TOL) or the variance inflation factor (VIF).
1.  Tolerance (TOL)
TOL = 1 - Rk2
where Rk2 is the R-square of the regression of Xk on the other p-2 predictors in the regression and a constant.  TOL can vary between 0 and 1; A common rule of thumb is that
TOL < .1
is an indication that collinearity may unduly influence the results.
2.  Variance Inflation Factor
VIF = (TOL)-1 = (1 - Rk2)-1
The variance inflation factor is the inverse of the tolerance.  Large values of VIF therefore indicate a high level of collinearity.
The corresponding rule of thumb is that
 VIF > 10
is an indication that collinearity may unduly influence the results.
Collinearity is discussed further in Module 11.

Example: In the SYSTAT output for Yule's Model 3 (Table 5), TOL values are given in the column headed Tolerance.  TOL values are .985, .711, and .719 for OUTRATIO, PROPOLD, and POP, respectively.  The smallest TOL value is thus well above the 0.1 cutoff, so one concludes there is no collinearity problem in this regression model.  The same conclusion is obtained considering the corresponding values of VIF (calculated as 1/TOL) 1.015, 1.406, and 1.391, which are well below the cutoff of 10.

10.  The General Linear Model

The term general linear model is used for multiple regression models that include variables other than first powers of different predictors.  The X variables can also represent The following table illustrates the use of polynomial expressions, categorical variables, and mathematical transformations of a variable within the general linear model. We look at these options in the next modules.

Appendix A.  An Example of Spurious Association: The D-Score Data

The D-score data (Koopmans 1987) illustrate how a spurious association can be elucidated using multiple regression analysis.
A test of cognitive development is administered to a sample of 12 children with ages ranging from 3 to 10.  The cognitive development score is called D-score.  The simple regression of D-score on sex is carried out.  Sex is represented by the variable BOY (coded Boy 1, Girl 0).  The regression reveals a significant positive effect of BOY on D-score: boys score significantly higher than girls (P-value = 0.039).

Table A1.  Simple Regression Analysis of the D-Score Data Set

Example from Koopmans, Lambert.  1987.  Introduction to Contemporary Statistical Methods.  (2d edition.)  PWS-Kent.  Pp. 554-557.

Data
 Case number          OBS       DSCORE          AGE          BOY         BOY$
        1            1.000        8.610        3.330        0.000 G
        2            2.000        9.400        3.250        0.000 G
        3            3.000        9.860        3.920        0.000 G
        4            4.000        9.910        3.500        0.000 G
        5            5.000       10.530        4.330        1.000 B
        6            6.000       10.610        4.920        0.000 G
        7            7.000       10.590        6.080        1.000 B
        8            8.000       13.280        7.420        1.000 B
        9            9.000       12.760        8.330        1.000 B
       10           10.000       13.440        8.000        0.000 G
       11           11.000       14.270        9.250        1.000 B
       12           12.000       14.130       10.750        1.000 B

Pearson Correlation Matrix

                    DSCORE          AGE          BOY
 DSCORE              1.000
 AGE                 0.957        1.000
 BOY                 0.600        0.647        1.000

Simple Linear Regression

Dep Var: DSCORE   N: 12   Multiple R: 0.600   Squared multiple R: 0.360
Adjusted squared multiple R: 0.296   Standard error of estimate: 1.671

Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT            10.305        0.682        0.000      .      15.109    0.000
BOY                  2.288        0.965        0.600     1.000    2.372    0.039

Analysis of Variance
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
Regression                15.709     1       15.709       5.629       0.039
Residual                  27.910    10        2.791

-------------------------------------------------------------------------------
*** WARNING ***
Case           10 is an outlier        (Studentized Residual =        2.566)

Durbin-Watson D Statistic          1.183
First Order Autocorrelation        0.315


However, a symbolic plot of D-score against age, using symbols to identify sex (B = Boy, G = Girl), reveals a systematic pattern.

Q - What is the pattern in the following figure?


A multiple regression analysis was then carried out, with D-score as the dependent variable and both BOY and AGE as independent variables.
The results are shown in Table 2.  This time the effect of BOY becomes non-significant (P-value is 0.799); the effect of AGE on D-score is strongly significant.  One concludes that the significant effect of sex (represented by the variable BOY) in the first regression was spurious.  It was a consequence of the (accidental) association in the sample between age and sex, i.e. the tendency (visible in the scatterplot) for boys to be older than girls, combined with the strong effect of age on D-score.  Introducing ("controlling for") age in the model has eliminated the spurious effect of sex on cognitive development.

Table  A2.  Multiple Regression of D-Score on BOY and AGE

Dep Var: DSCORE   N: 12   Multiple R: 0.958   Squared multiple R: 0.917
Adjusted squared multiple R: 0.899   Standard error of estimate: 0.634

Effect         Coefficient    Std Error     Std Coef Tolerance     t   P(2 Tail)
CONSTANT             6.927        0.506        0.000      .      13.697    0.000
BOY                 -0.126        0.480       -0.033     0.581   -0.262    0.799
AGE                  0.753        0.097        0.979     0.581    7.775    0.000

Analysis of Variance
Source             Sum-of-Squares   df  Mean-Square     F-ratio       P
Regression                40.002     2       20.001      49.765       0.000
Residual                   3.617     9        0.402

-------------------------------------------------------------------------------
Durbin-Watson D Statistic          2.277
First Order Autocorrelation       -0.313


Appendix B.  Standard Tabular Presentation of Regression Results

1.  Standard Presentation

The standard journal presentation of multiple regression results is aimed in part at facilitating the elaboration model by examining the effect of introducing a new "test" variable in the model.
The following table presents the results of the regression analysis of the D-score data in standard tabular format.
 
Table B1.  Unstandardized Regression Coefficients of Cognitive Development (D-score) on Sex and Age for 12 Children Aged 3 to 10 (t Ratios in Parentheses)
Independent variable
Model 1
Model 2
Constant
10.305***
6.927***
 
(15.109)
(13.697)
Boy ( boy=1, girl=0)
2.288*
-.126
 
(2.372)
(-.262)
Age (years)
--
.753***
   
(7.775)
R2
.360
.917
Adjusted R2
.296
.899
Note:  * p < .05  ** p < .01  *** p < .001  (2-tailed tests)

2.  Suggestions on Preparing Tables of Regression Results

The following guidelines would help prepare tables of results acceptable by most professional journals.

Appendix C.  Multiple Regression in Practice

Instructions to do multiple regression with a variety of options are provided in the following exhibits.

Last modified 6 Mar 2006