Algebra of Specification Bias

Suppose that the correctly specified regression model is
y = X1b1 + X2b2 + e
where X1 includes p1 variables (columns) and X2 includes p2 variables (columns).  For simplicity, assume that a constant term is included in X1.
Suppose y is regressed on X1 without including X2 (i.e., the model is misspecified).  The OLS estimator of b1 is
b1 = (X1'X1)-1X1'y
b1 = (X1'X1)-1X1'(X1b1 + X2b2 + e)  (replacing y by its value in the correctly specified model)
b1 = b1 + (X1'X1)-1X1'X2b2 + (X1'X1)-1X1'e  (multiplying out and simplifying)
E{b1} = b1 + (X1'X1)-1X1'X2b2  (taking expectations, because E{e} = 0 and X1 is a constant matrix)
Thus b1 is biased by a factor (X1'X1)-1X1'X2b2.  What is the nature of this bias?  Suppose for simplicity that the omitted matrix X2 consists of a single variable (i.e., X2 has one column).  Then (X1'X1)-1X1'X2 is equal to the vector, call it b12, of estimated coefficients of the regression of the omitted variable X2 on the variables in X1.  Thus
E{b1} = b1 + b12b2
so that the bias is seen as equal to the product of b2 (the effect of X2 on y in the true regression model) by b12, the estimated coefficients of the regression of X2 on X1.
One way to interpret the bias is to say that omitting X2 has caused the estimated effect of X1 to be inflated by the indirect path X1 -> X2 -> y, corresponding to the product b12b2.  The direction of the bias depends on the signs of the coefficients.  When b12 and b2 are both positive, the bias is also positive so that b1 overestimates b1.


Last modified 20 Feb 2002