T2. Endogeneity

TOC

1. Definition & Influence 1.1 Endogeneity 1.2 Influence of endogeneity 2. Sources of Endogeneity 2.1 Omitted variable bias (OVB)2.2 Wrong Functional Form 2.3 Measurement Error 2.4 Simultaneity

1. Definition & Influence

1.1 Endogeneity

An endogenous variable is a variable that is correlated with , that is

An exogenous variable is a variable that is uncorrelated with , that is

Endogeneity: The correlation between and implies that the Ceteris Paribus assumption does not hold , where Ceteris Paribus is a Latin phrase meaning “all other things being equal”.

1.2 Influence of endogeneity

When , the consequence is that the OLS estimator is inconsistent and biased.

For simple linear regression

then

If , the effect is shown as below figure

The red/solid line is the true population. The blue/dotted line is the fitted line. Because the errors are positively correlated with the regressor, the fitted OLS line is steeper than the true line: positive bias.

2. Sources of Endogeneity

Main sources of endogeneity include Omitted variable bias (OVB), Wrong functional form, Measurement error, Simultaneous causality, Sample selection, etc.

2.1 Omitted variable bias (OVB)

2.1.1 Definition

when is omitted, we have

Now

if and .

The intuitive reason is that, in addition to its direct effect , has an apparent indirect effect as a consequence of acting as a proxy for the missing . The strength of the proxy effect depends on two factors: the strength of the effect of on , which is given by , and the ability of to mimic , i.e. .

For example:

when has a positive bias;

when has a negative bias.

2.1.2 Solutions to OVB

If the variable can be measured, include it as an additional regressor in multiple regression

Possibly, use panel data in which each entity (individual) is observed more than once

If the variable cannot be measured, use instrumental variable (IV) regression

If the variable cannot be measured, use proxy variable (another variable which is correlated with the omitted variable but can be measured and easily accessed)

Good proxy variables should satisfy

then

2.2 Wrong Functional Form

2.2.1 Definition

Wrong functional form arises if the functional form used in the regression is incorrect. For example, the true relationship between and is

If we run a regression

Then

and

2.2.2 Testing

To test whether there are omitted nonliner terms, we can follow below steps:

Regress

test whether . If so, there are no omitted nonlinear terms. Otherwise, there is.

2.2.3 Solutions to functional form misspecification

For continuous dependent variable: use “appropriate” nonlinear specifications in (logarithms, interactions, etc.)

For discrete (e.g. binary) dependent variable: need an extension of multiple regression methods (”probit” or “logit” analysis for binary dependent variables)

Some other Nonparametric Econometrics methods

2.3 Measurement Error

2.3.1 Definition

In reality, economic data often have measurement error for some reasons:

Data entry errors in administrative data

Recollection errors in surveys (e.g. when did you start your current job?)

Ambiguous questions (e.g. what was your income last year?)

Intentionally false response problems with surveys (e.g. What is the current value of your financial assets?)

Assume the model we want to estimate is

but we can only access measurement , which differs from the true value of by an error , i.e. . It’s intuitive to assume:

Then

the estimation of is

The bias is called Attenuation bias, the bias towards zero (estimated coefficients’ abstract values are always smaller):

When , the OLS estimator is biased upward (positive bias, estimated beta tends to be larger)

When , the OLS estimator is biased downward (negative bias, estimated beta tends to be smaller)

Explanation about the bias towards zero is that we are tring to use the association between and to capture the strength of causal link between and . However, due to the presence of the noise , the association is a dempened measure (having smaller abstract value) of the causal link.

2.3.2 Solutions

Obtain better data

Develop a specific model of the measurement error process

This is only possible if a lot is known about the nature of the measurement error

Instrument variable (IV) regression

Supplement:

when there is noise in , that is we can only access measurement , where is random error. Then

is still ’s consistent and unbiased estimation but has larger variance (recall that )

2.4 Simultaneity

Definition

In structural models, for example, supply and demand model, there may exist endogeneity as well.

There are two variables: quantity and price.

(D): ,

(S): ,

In market equilibrium, . Besides, we assume that prices and quantites are endogenous (by assumptions of ) and they are determined simultanously.

From the market equilibrium condition, we have

thus

and thus

We can explain the endogeneity from another perspective

can be regarded as inputs in a (market) system. can be regarded as output of a (market) system. In general, will be correlated with both and .

Furthermore, if we run the regression

using market data, then we get something that is a mix of supply and demand curves. tends to be between and .

has two effects on :

For producers, larger causes to increase

For consumers, larger causes to decrease

Example

Assume

Regress , prove that

According to market equilibrium, , that is . Therefore, .

since

therefore