EC 320 - Introduction to Econometrics
2025
We estimate because we cannot measure everything
Suppose we want to know the average height of the US population.
How can we use these data to estimate the height of the population?
We will learn what we can do
Let’s define some concepts first:
Estimand
Quantity that is to be estimated in a statistical analysis
Estimator
A rule (or formula) for estimating an unknown population parameter given a sample of data
Estimate
A specific numerical value that we obtain from the smaple data by applying the estimator
Suppose we want to know the average height of the population in the US
So then we can identify our Estimand, Estimator, and Estimate
Estimand: The population mean \((\mu)\)
Estimator: The sample mean \((\bar{X})\)
\[ \bar{X} = \dfrac{1}{n} \sum_{i=1}^{n} X_{i} \]
There are many ways to estimate things and they all have their benefits and costs.
Imagine we want to estimate an unknown parameter \(\mu\), and we know the distributions of three competing estimators.
Which one should we use?
We ask: What properties make an estimator reliable?
Answer (1): Unbiasedness
On average, does the estimator tend toward the correct value?
Formally: Does the mean of the estimator’s distribution equal the parameter it estimates?
\[ \text{Bias}_{\mu} (\hat{\mu}) = E[\hat{\mu}] - \mu \]
Question What properties make an estimator reliable?
A01: Unbiasedness
Unbiased estimator: \(E\left[ \hat{\mu} \right] = \mu\)
Biased estimator \(E\left[ \hat{\mu} \right] \neq \mu\)
We ask: What properties make an estimator reliable?
Answer (1): Efficiency (Low Variance)
The central tendencies (means) of competing distribution are not the only things that matter. We also care about the variance of an estimator.
\[ Var(\hat{\mu}) = E \left[ (\hat{\mu} - E[\hat{\mu}])^{2} \right] \]
Lower variance estimators estimate closer to the mean in each sample
Imagine low variance to be similar to accuracy \(\rightarrow\) tighter estimates
Much like everything, there are tradeoffs from gaining one thing over another.
Should we be willing to take a bit of bias to reduce the variance?
In economics/causal inference, we emphasize unbiasedness
In addition to the sample mean, there are other unbiased estimators we will often use
The sample variance, \(S_{X}^{2}\), is an unbiased estimator of the population variance
\[ S_{X}^{2} = \dfrac{1}{n - 1} \sum_{i=1}^{n} (X_{i} - \bar{X})^{2}. \]
The sample covariance, \(S_{XY}\), is an unbiaed estimator of the population covariance
\[ S_{XY} = \dfrac{1}{n-1} \sum_{i=1}^{n} (X_{i} - \bar{X})(Y_{i} - \bar{Y}). \]
Sample correlation, \(r_{XY}\), is an unbiased estimator of the population correlation coefficient
\[ r_{XY} = \dfrac{S_{XY}}{\sqrt{S_{X}^{2}}\sqrt{S_{Y}^{2}}}. \]
Before we continue, let’s cover some important rules we will need to derive some OLS things in the near future:
Summations \((\sum)\) have certain rules that we cannot violate and are important to hold in mind:
\[\sum_{i=1}^{n} x_{i} = x_{1} + x_{2} + \cdots + x_{n}\]
Let \(x\) be the set of \({1,5,2}\) \((x: \{1,5,2\})\) then using our summation rule we have:
\[ \sum_{i} x_{i} = 1 + 5 + 2 = 8. \]
\[\sum_{i} x_{i} + y_{i} = \sum_{i} x_{i} + \sum_{i} y_{i}\]
Let \(x: \{1,5,2\}\) and \(y: \{1,2,1\}\), then using our summation rule we have:
\[\begin{align*} \sum_{i} x_{i} + y_{i} &= x_{1} + y_{1} + x_{2} + y_{2} + x_{3} + y_{3} \\ &= x_{1} + x_{2} + x_{3} + y_{1} + y_{2} + y_{3} \\ &= 1 + 5 + 2 + 1 + 2 + 1 \\ &= 12 \end{align*}\]
\[\sum_{i} x_{i} y_{i} \neq \sum_{i} x_{i} \sum_{i} y_{i}\]
If we expand \(\sum_{i} x_{i} y_{i} \neq \sum_{i} x_{i} \sum_{i} y_{i}\), we get:
\[\begin{align*} x_{1}y_{1} + x_{2}y_{2} + x_{3}y_{3} \neq (x_{1} + x_{2} + x_{3})(y_{1} + y_{2} + y_{3}) \end{align*}\]
I’ll leave it to you to use the above numbers to show this holds
We will spend the rest of the course exploring how to use Ordinary Least Squares (OLS) to fit a linear model like:
\[ y_{i} = \beta_{0} + \beta_{1}x_{i} + u_{i}, \]
That is, if we wanted to hypothesize that some random variable \(Y\) depends on another random variable \(X\) and that there is a linear relationship between then, \(\beta_{0}\) and \(\beta_{1}\) are the parameters which describe the nature of that relationship.
Given a sample of \(X\) and \(Y\), we will derive unbiased estimators for the intercept \(\beta_{0}\) and slope \(\beta_{1}\). Those estimators help us combine observations of \(X\) and \(Y\) to estimate underlying relationships between them.
We can estimate the effect of \(X\) on \(Y\) by estimating the model:
\[ y_{i} = \beta_{0} + \beta_{1}x_{i} + u_{i}, \]
\(y_i\) is the dependent variable
\(x_i\) is the independent variable (continuous)
\(\beta_0\) is the intercept parameter. \(E\left[ {y_i | x_i=0} \right] = \beta_0\)
\(\beta_1\) is the slope parameter, which under the correct causal setting represents marginal change in \(x_i\)’s effect on \(y_i\). \(\frac{\partial y_i}{\partial x_i} = \beta_1\)
\(u_i\) is an Error Term including all other (omitted) factors affecting \(y_i\).
\(u_{i}\) is quite special
Consider the data generating process of variable \(y_{i}\),
Some error will exist in all models, no model is perfect.
Error is the price we are willing to accept for a simplified model
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
2. Aggregation of Variables
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
2. Aggregation of Variables
3. Model misspecificiation
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
2. Aggregation of Variables
3. Model misspecificiation
4. Functional misspecificiation
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
2. Aggregation of Variables
3. Model misspecificiation
4. Functional misspecificiation
5. Measurement error
Five items contribute to the existence of the disturbance term:
1. Omission of independent variables
2. Aggregation of Variables
3. Model misspecificiation
4. Functional misspecificiation
5. Measurement error
Using an estimator with data on \(x_{i}\) and \(y_{i}\), we can estimate a fitted regression line:
\[ \hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{i} \]
This procedure produces misses, known as residuals \(y_{i} - \hat{y_{i}}\)
Let’s look at an example of how this works
Does the number of on-campus police officers affect campus crime rates? If so, by how much?
Always plot your data first
The scatter plot suggest that a weak positive relationship exists
But correlation does not imply causation
Lets estimate a statistical model
We express the relationship between a dependent variable and an independent variable as linear:
\[ {\text{Crime}_i} = \beta_0 + \beta_1 \text{Police}_i + u_i. \]
\(\beta_0\) is the intercept or constant.
\(\beta_1\) is the slope coefficient.
\(u_i\) is an error term or disturbance term.
The intercept tells us the expected value of \(\text{Crime}_i\) when \(\text{Police}_i = 0\).
\[ \text{Crime}_i = {\color{#BF616A} \beta_{0}} + \beta_1\text{Police}_i + u_i \]
Usually not the focus of an analysis.
The slope coefficient tells us the expected change in \(\text{Crime}_i\) when \(\text{Police}_i\) increases by one.
\[ \text{Crime}_i = \beta_0 + {\color{#BF616A} \beta_1} \text{Police}_i + u_i \]
“A one-unit increase in \(\text{Police}_i\) is associated with a \(\color{#BF616A}{\beta_1}\)-unit increase in \(\text{Crime}_i\).”
Interpretation of this parameter is crucial
Under certain (strong) assumptions1, \(\color{#BF616A}{\beta_1}\) is the effect of \(X_i\) on \(Y_i\).
The error term reminds us that \(\text{Police}_i\) does not perfectly explain \(Y_i\).
\[ \text{Crime}_i = \beta_0 + \beta_1\text{Police}_i + {\color{#BF616A} u_i} \]
Represents all other factors that explain \(\text{Crime}_i\).
How might we apply the simple linear regression model to our question about the effect of on-campus police on campus crime?
\[ \text{Crime}_i = \beta_0 + \beta_1\text{Police}_i + u_i. \]
How might we apply the simple linear regression model to our question?
\[ \text{Crime}_i = \beta_0 + \beta_1\text{Police}_i + u_i \]
\(\beta_0\) and \(\beta_1\) are the unobserved population parameters we want
We estimate
\(\hat{\beta_0}\) and \(\hat{\beta_1}\) generate predictions of \(\text{Crime}_i\) called \(\widehat{\text{Crime}_i}\).
We call the predictions of the dependent variable fitted values.
So, the question becomes, how do I pick \(\hat{\beta_0}\) and \(\hat{\beta_1}\)
Let’s take some guesses: \(\hat{\beta_0} = 60\) and \(\hat{\beta}_{1} = -7\)
Let’s take some guesses: \(\hat{\beta_0} = 30\) and \(\hat{\beta}_{1} = 0\)
Let’s take some guesses: \(\hat{\beta_0} = 15.6\) and \(\hat{\beta}_{1} = 7.94\)
Using \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) to make \(\hat{y}_{i}\) generates misses.
\(\hat{\beta_0} = 60 \;\) Guess
\(\hat{\beta_0} = 30 \;\) Guess
\(\hat{\beta_0} = 15 \;\) Guess
What if we picked an estimator that minimizes the residuals?
Why do we not minimize:
\[ \sum_{i=1}^{n} \hat{u}_{i}^{2} \]
so that the estimator makes fewer big misses?
This estimator, the residual sum of squares (RSS), is convenient because squared numbers are never negative so we can minimze an absolut sum of the residuals
RSS will give bigger penalties to bigger residuals
We could test thousands of guesses of \(\beta_0\) and \(\beta_1\) an pick the pair the has the smallest RSS
We could painstakingly do that, and eventually figure out which one fits best.
Or… We could just do a little math
The OLS Estimator chooses the parameters \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) that minimize the Residual Sum of Squares (RSS)
\[ \min_{\hat{\beta}_{0},\hat{\beta}_{1}} \sum_{i=1}^{n} \hat{u}_{i}^{2} \]
This is why we call the estimator ordinary least squares
Recall that residuals are given by \(y_{i} - \hat{y}_{i}\) and that:
\[ \hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{i} \]
Then
\[ u_{i} = y_{i} - \hat{\beta}_{0} + \hat{\beta}_{1} x_{i} \]
We can find our choices \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) to minimize our residuals using calculus
A minimization problem is essentially the same as an optimization problem where we find the point at which our choices have a slope of zero
To begin, let’s properly write out our minimization problem:
\[ \min_{\hat{\beta}_{0}, \hat{\beta}_{1}} \;\; \sum_{i} u_{i}^{2} \]
\[ \min_{\hat{\beta}_{0}, \hat{\beta}_{1}} \; \sum_{i} (y_{i} - \hat{y}_{i})^{2} \]
\[ \min_{\hat{\beta}_{0}, \hat{\beta}_{1}} \; \sum_{i} (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) (y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1}x_{i}) \]
The calculus we’ll use is by finding the derivatives of the function with respect to \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\).
It’s a lot of algebra but it is simple math, just a lot of it:
\[\begin{align*} \min_{\hat{\beta}_{0}, \hat{\beta}_{1}} &\; \sum_{i} y_{i}^{2} - \hat{\beta}_{0}y_{i} - \hat{\beta}_{1}x_{i}y_{i} - \hat{\beta}_{0}y_{i} + \hat{\beta}_{0}^{2} + \hat{\beta}_{0}\hat{\beta}_{1}x_{i} - \hat{\beta}_{1}x_{i}y_{i} + \hat{\beta}_{0}\hat{\beta}_{1}x_{i} + \hat{\beta}_{1}^{2}x_{i}^{2} \\ \min_{\hat{\beta}_{0}, \hat{\beta}_{1}} &\; \sum_{i} y_{i}^{2} - 2 \hat{\beta}_{0}y_{i} + \hat{\beta}_{0}^{2} - 2 \hat{\beta}_{1}x_{i}y_{i} + 2\hat{\beta}_{0}\hat{\beta}_{1}x_{i} + \hat{\beta}_{1}^{2}x_{i}^{2} \end{align*}\]
Then, we take partial derivatives over our choices \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\) to figure the best choices.
These are called First Order Conditions (FOCs)
To find our choices, we find the partial derivative and set it equal to 0
For our intercept \(\hat{\beta}_{0}\):
\[\begin{align*} &\dfrac{\partial u_{i}}{\partial \hat{\beta}_{0}} = 0 \\ \sum_{i} -2y_{i} + &2\hat{\beta}_{0} + 2\hat{\beta}_{1}x_{i} = 0 \end{align*}\]
For our slope \(\hat{\beta}_{1}\):
\[\begin{align*} &\dfrac{\partial u_{i}}{\partial \hat{\beta}_{1}} = 0 \\ \sum_{i} -2x_{i}y_{i} + &2\hat{\beta}_{0}x_{i} + 2\hat{\beta}_{1}x_{i}^{2} = 0 \end{align*}\]
\[ \sum_{i} -2y_{i} + 2\hat{\beta}_{0} + 2\hat{\beta}_{1}x_{i} = 0 \]
Our task is to find solve the above for \(\hat{\beta}_{0}\):
\[ \sum_{i} -2x_{i}y_{i} + 2\hat{\beta}_{0}x_{i} + 2\hat{\beta}_{1}x_{i}^{2} = 0 \]
Our task is to find solve the above for \(\hat{\beta}_{1}\):
Intercept
\[ \hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1}\bar{x} \]
Slope Coefficient
\[ \hat{\beta}_{1} = \dfrac{ \sum_{i=1}^{n} (y_{i} - \bar{y})(x_{i} - \bar{x}) }{ \sum_{i=1}^{n} (x_{i} - \bar{x})^{2} } \]
These may look slightly different to my derivation. Part of your assignments is to bridge the gap.
There are two stages of interpretation of a regression equation
Interpret regression estimates into words
Deciding whether this interpretation should be taken at face value
Both stages are important, but for now, we will focus on the first
Let’s revisit our crime example
Using the OLS formulas, we get \(\hat{\beta}_{0} = 18.41\) and \(\hat{\beta}_{1} = 1.76\)
How do I interpret \(\hat{\beta}_{0} = 18.41\) and \(\hat{\beta}_{1} = 1.76\)?
The general interpration of the intercept is the estimated value of \(y_{i}\) when \(x_{i} = 0\)
And the general interpretation of the slope parameter is the estimated change \(y_{i}\) for the marginal increase \(x_{i}\)
First, it is important to understand the units:
\(\widehat{\text{Crime}}_{i}\) is measured as a crime rate, the number of crimes per 1,000 students on campus
\(\text{Police}_{i}\) is also measured as a rate, the number of police officers per 1,000 students on campus
Using OLS gives us the fitted line
\[ \widehat{\text{Crime}_i} = \hat{\beta}_1 + \hat{\beta}_2\text{Police}_i. \]
What does \(\hat{\beta_0}\) = \(18.41\) tell us? Without any police on campus, the crime rate is \(18.41\) per 1,000 people on campus
What does \(\hat{\beta_1}\) = \(1.76\) tell us? For each additional police officer per 1,000, there is an associated increase in the crime rate by \(1.76\) crimes per 1,000 people on campus.
Does this mean that police cause crime? Probably not.
This is where deciding if the interpretation should be taken at face value. It now becomes your job to bring reason to the values.
EC320, Lecture 02 | Estimators