Algebra of Specification Bias
Suppose that the correctly specified regression model is
y = X1b1
+ X2b2 +
e
where X1 includes p1 variables (columns)
and X2 includes p2 variables (columns).
For simplicity, assume that a constant term is included in X1.
Suppose y is regressed on X1 without including
X2 (i.e., the model is misspecified). The OLS estimator
of b1 is
b1 = (X1'X1)-1X1'y
b1 = (X1'X1)-1X1'(X1b1
+ X2b2 +
e) (replacing y by its value
in the correctly specified model)
b1 = b1
+ (X1'X1)-1X1'X2b2
+ (X1'X1)-1X1'e
(multiplying out and simplifying)
E{b1} = b1
+ (X1'X1)-1X1'X2b2
(taking expectations, because E{e}
= 0 and X1 is a constant matrix)
Thus b1 is biased by a factor (X1'X1)-1X1'X2b2.
What is the nature of this bias? Suppose for simplicity that the
omitted matrix X2 consists of a single variable (i.e.,
X2 has one column). Then (X1'X1)-1X1'X2
is equal to the vector, call it b12, of estimated coefficients
of the regression of the omitted variable X2 on the variables
in X1. Thus
E{b1} = b1
+ b12b2
so that the bias is seen as equal to the product of b2
(the effect of X2 on y in the true regression
model) by b12, the estimated coefficients of the regression
of X2 on X1.
One way to interpret the bias is to say that omitting X2
has caused the estimated effect of X1 to be inflated
by the indirect path X1 -> X2 -> y,
corresponding to the product b12b2.
The direction of the bias depends on the signs of the coefficients.
When b12 and b2
are both positive, the bias is also positive so that b1
overestimates b1.
Last modified 20 Feb 2002