Learning R: Tibbles and lm()

EC 320 - Introduction to Econometrics

Jose Rojas-Fallas

2025

Preview

In this lecture, you will:

Learn about holding data in tibbles (tidy tables)
Use those tibbles to run regressions using the function lm()

Tibbles

What is a tibble?

They are tidyverse spreadsheets

Data is still being held in vectors (column vectors specifically), but the rows of a tibble also hold meaning.

The rows are the observations while the columns are the variables

Let’s look at an example

Ex Daily Weather

Let’s write down each day’s high temp, low temp, and rainfall. In words the data is:

Jan 01, 2023: We had a high of \(46^{\circ}\), a low of \(37^{\circ}\) and \(0.07\) in. of rain
Jan 02, 2023: We had a high of \(46^{\circ}\), a low of \(35^{\circ}\) and \(0.00\) in. of rain
Jan 03, 2023: We had a high of \(47^{\circ}\), a low of \(34^{\circ}\) and \(0.08\) in. of rain

What should the observations (rows) be?
- Each day we went outside and observed the weather. So each day should have its own row.
What are the variables (columns) we observe?
- The date, the high temp, the low temp, rainfall

Ex. Daily Weather

So we want our tibble to look like:

Date	High Temp	Low Temp	Rainfall
1/1/23	46	37	0.07
1/2/23	46	35	0.00
1/3/23	47	34	0.08

The mantra “observations as rows, variables as columns” is what we call the tidied data format

There are tons of ways you could format your data, but the tidyverse is compatible with only this way.

Luckily, it turns out to be very experssible

Ex. Daily Weather

Here’s the code to construct our tibble:

tibble(
    date = as.Date(c("2023-01-01", "2023-01-02", "2023-01-03")),
    high_temp = c(46, 46, 47),
    low_temp = c(37, 35, 36),
    rainfall = c(0.07, 0.00, 0.08)
)

# A tibble: 3 × 4
  date       high_temp low_temp rainfall
  <date>         <dbl>    <dbl>    <dbl>
1 2023-01-01        46       37     0.07
2 2023-01-02        46       35     0   
3 2023-01-03        47       36     0.08

How it works:

Use the function tibble()
tibble() takes a list of vectors created with c() that become variable columns
Each varible column has a name

2 Rules for Tibbles

1. Each Column Must be Named

If you try to define a column without giving it a name, tibble() generatses one for you (try it out yourself)
When naming variables, avoid spaces between words

2. Each Column Must Have the Same Number of Rows

If you try to define a column that is shorter than the others, tibble() will throw an error.
- Exception: If you define a column with only one element, tibble() will repeat it to make it the same length as the other columns.

Data Types

There are 3 types of data that you will come across

Cross-sectional
Time Series
Panel

Importantly, we cannot model different data types the same way.

In this class we build models of exclusively cross-sectional data.

Cross-Sectional Data

Cross-sectional Data is data about many individuals at (around) the same time. Here, “individuals” might refer to people, or households, companies, cities, states, countries, etc.

tibble(
    name = c("Kris", "Kourtney", "Kim"),
    study_time = c("< 2hrs", "2-5hrs", "< 2hrs"),
    final_grade = c(69.4, 89.7, 66.3)
)

# A tibble: 3 × 3
  name     study_time final_grade
  <chr>    <chr>            <dbl>
1 Kris     < 2hrs            69.4
2 Kourtney 2-5hrs            89.7
3 Kim      < 2hrs            66.3

The name “cross-sectional” comes from the fact that the data is a cross-section of some population. It is a snapshot in time of a smaple of individuals.

Time Series Data

Time Series Data is data that follows one specific individual (or household, company, city, state, country, etc.) over time. So this dataset would be a time series if it reports quiz grades over the course of a term for one student:

tibble(
    assignment = c("Quiz 01", "Quiz 02", "Quiz 03", "Quiz 04", "Quiz 05"),
    study_hours = c(4,3,2,3,8),
    grade = c(75, 74, 69, 77, 89)
)

# A tibble: 5 × 3
  assignment study_hours grade
  <chr>            <dbl> <dbl>
1 Quiz 01              4    75
2 Quiz 02              3    74
3 Quiz 03              2    69
4 Quiz 04              3    77
5 Quiz 05              8    89

Panel Data

Panel Data describes many individuals over many periods of time. For example, many students’ scores over the course of a term would be panel data

tibble(
    name = rep(c("Kris", "Kourtney", "Kim"), each = 3),
    assignment = rep(c("Quiz 01", "Quiz 02", "Quiz 03"), times = 3),
    grade = c(75, 78, 66, 95, 97, 90, 62, 66, 54)
)

# A tibble: 9 × 3
  name     assignment grade
  <chr>    <chr>      <dbl>
1 Kris     Quiz 01       75
2 Kris     Quiz 02       78
3 Kris     Quiz 03       66
4 Kourtney Quiz 01       95
5 Kourtney Quiz 02       97
6 Kourtney Quiz 03       90
7 Kim      Quiz 01       62
8 Kim      Quiz 02       66
9 Kim      Quiz 03       54

Regressions Using lm()

lm()

Now we will use the function lm() to estimate a linear model

lm() takes 2 important arguments:

A formula created using the tilde symbol ~
Data (a tibble)

lm() outputs an lm object:

A bunch of information about the regression it just ran

Ex Simple Regression

We create the following dataset and pipe it into the lm() to run the regression

tibble(
    x = 1:3,
    y = c(4, 2, 1)
) %>%
    lm(y ~ x, data = .)


Call:
lm(formula = y ~ x, data = .)

Coefficients:
(Intercept)            x  
      5.333       -1.500

Note that we get our standard regression coefficients \(\hat{\beta}_{0}\) and \(\hat{\beta}_{1}\)

We can also unpack the lm object in lots of different ways.

Ex Simple Regression

Let’s get the residuals using the residuals() function

tibble(
    x = 1:3,
    y = c(4, 2, 1)
) %>% 
    lm(y ~ x, data = .) %>%
    residuals()

         1          2          3 
 0.1666667 -0.3333333  0.1666667

We can also get the fitted values using fitted.values()

tibble(
    x = 1:3,
    y = c(4, 2, 1)
) %>%
    lm(y ~ x, data = .) %>%
    fitted.values()

        1         2         3 
3.8333333 2.3333333 0.8333333

Practice: Download “Worksheet 02” From the Site

This worksheet will help you learn coding by doing. You will:

Construct your own tibble
<- Assign it to a variable name in your environment
view() it in a separate tab
Find dimensions of the data
Find variable names
Add new observations
Run a regression and practice interpreting results