s209 m5 - specification bias

Algebra of Specification Bias

Suppose that the correctly specified regression model is

y = X₁b₁ + X₂b₂ + e

where X₁ includes p₁ variables (columns) and X₂ includes p₂ variables (columns). For simplicity, assume that a constant term is included in X₁.
Suppose y is regressed on X₁ without including X₂ (i.e., the model is misspecified). The OLS estimator of b₁ is

b₁ = (X₁'X₁)^-1X₁'y
b₁ = (X₁'X₁)^-1X₁'(X₁b₁ + X₂b₂ + e) (replacing y by its value in the correctly specified model)
b₁ = b₁ + (X₁'X₁)^-1X₁'X₂b₂ + (X₁'X₁)^-1X₁'e (multiplying out and simplifying)
E{b₁} = b₁ + (X₁'X₁)^-1X₁'X₂b₂ (taking expectations, because E{e} = 0 and X₁ is a constant matrix)

Thus b₁ is biased by a factor (X₁'X₁)^-1X₁'X₂b₂. What is the nature of this bias? Suppose for simplicity that the omitted matrix X₂ consists of a single variable (i.e., X₂ has one column). Then (X₁'X₁)^-1X₁'X₂ is equal to the vector, call it b₁₂, of estimated coefficients of the regression of the omitted variable X₂ on the variables in X₁. Thus

E{b₁} = b₁ + b₁₂b₂

so that the bias is seen as equal to the product of b₂ (the effect of X₂ on y in the true regression model) by b₁₂, the estimated coefficients of the regression of X₂ on X₁.
One way to interpret the bias is to say that omitting X₂ has caused the estimated effect of X₁ to be inflated by the indirect path X₁ -> X₂ -> y, corresponding to the product b₁₂b₂. The direction of the bias depends on the signs of the coefficients. When b₁₂ and b₂ are both positive, the bias is also positive so that b₁ overestimates b₁.

Last modified 20 Feb 2002