I created this dashboard to look at estimating data points between distinct measurement periods. Sometimes, you have data that is measured annually, like US Census population figures. If you have other data that is quarterly, monthly, or more frequently, you can choose to extrapolate. This is a look at different ways to compare techniques.
I live in Miami, so I wanted to take a look a collection of different ways to estimate the continuous population from the annual figure. One key comparison is between linear, probably the most common tool, and cubic splines. Cubic Splines are always going to be the lowest error, but at the cost of poor out of sample estimations. Natural cubic splines will be better for that, but forecasting is another discussion.
Models Explored
- Linear Fit: A straightforward model assuming a constant rate of growth.
- Squared Linear Fit: A polynomial model that accounts for some acceleration or deceleration in the growth trend.
- Polynomial Linear Fit: A 3rd order polynomial model, which might perform better with some other data sets than this one.
- Log-Linear Transformation: A model useful when the growth rate is proportional to the population size.
- Box-Cox Transformation: A flexible model that stabilizes variance and can normalize the data.
- Cubic Spline: A sophisticated interpolation method that fits piecewise cubic polynomials between data points.
- NaturalCubic Spline: A sophisticated interpolation method that fits piecewise cubic polynomials between data points, with an added control of linear fits at the beginning and end of the sample set.
Official data, such as that from the U.S. Census Bureau, is typically released annually. While simpler models like linear regression can capture an overall trend, they often fail to represent the nuanced, non-linear fluctuations that occur between these yearly data points.
Model Formulations
Linear Model
$ P(t) = \beta_0 + \beta_i x_i + \epsilon $ Where: $P(t) \text{ is the population at a specific time} $ $\beta_0 \text{ is the baseline population at time 0 or the start of our data set}$ $\beta_i \text{ is the weight for each data element being used to estimate} P(t)$ $x_i \text{ is each data element in the dataset}$ $\epsilon \text{ is any error between the estimate and actual value}$
Cubic Spline Model
The data is fit with a series of piecewise third-order polynomials.
$\Sigma_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \int_{a}^{b} f''(t)^2 dt $
Where:
$\Sigma_{i=1}^{n} (y_i - f(x_i))^2 \text{ is the residual sum of square error} $
$\lambda \text{ is penalty for the 2nd order Taylor series of the funtion} $
This can also be represented on a 1-dimensional level as: $P_i(t) = a_i + b_i t + c_i t^2 + d_i t^3 $
Where:
$P_i(t) \text{ is population estimate any any ith period}$
$a_i,b_i,c_i \text{ are the weights for the cubic polynomial}$
$t,t^2,t^3 \text{ is the cubic polynomial}$