The Least Squares Regression

For scattered data, you may wish to draw a line of best fit. A straight line regression will be in the form of y = mx + c. We will explore how we can determine the parameters m and c. Regression lines are not just restricted to straight lines but can be used for all polynomials. We will also look at a quadratic regression line example. This will be in the form of y = a + bx + cx².

Important Notation

To simplify notation, we can use S instead of frequently using &Sigma notation;
Sxx = \(Σ_{i=1}^{n} x_{i} x_{i} - {Σ_{i=1}^{n}x_{i}Σ_{i=1}^{n}x_{i}\over n}\) ;
Syy = \(Σ_{i=1}^{n} y_{i} y_{i} - {Σ_{i=1}^{n}y_{i}Σ_{i=1}^{n}y_{i}\over n}\) ;
Sxy = \(Σ_{i=1}^{n} x_{i} y_{i} - {Σ_{i=1}^{n}x_{i}Σ_{i=1}^{n}y_{i}\over n}\) ;

Determining m and c

The error is the sum of the residuals squared. The residual is the difference between the true y value and the estimation using the least squares regression line. The aim is to minimize the error - this involves differentiation in order to find the values of m and c that make the error a minimum. The residuals are squared to avoid a positive residual and a negative residual canceling each other out. For functions that have one variable, we use differentiation to find the minimum. If there are two or more variables then we need to use partial differentiation- it is similar to ordinary differentiation but we just treat the other independent variables like constants.
\( E = (y_{1} - (mx_{1}+c)^{2} + (y_{2} - (mx_{2}+c))^{2} + (y_{3} - (mx_{3}+c))^{2} + ... \)
\( {∂E\over ∂c} = -(y_{1} - (mx_{1}+c)) -(y_{2} - (mx_{2}+c)) -(y_{3} - (mx_{3}+c)) + ... = 0\)
\( {∂E\over ∂c} = (y_{1} - (mx_{1}+c)) + (y_{2} - (mx_{2}+c)) +(y_{3} - (mx_{3}+c)) + ... = 0\)
\( {∂E\over ∂c} = Σ_{i=1}^{n} y_{i} - mΣ_{i=1}^{n} x_{i} -nc = 0\)
c = ȳ - mx̄
\( {∂E\over ∂m} = -2x_{1}(y_{1} - (mx_{1}+c)) -2x_{2}(y_{2} - (mx_{2}+c)) -2x_{3}(y_{3} - (mx_{3}+c)) + ... = 0\)
\( {∂E\over ∂m} = x_{1}(y_{1} - (mx_{1}+c)) + x_{2}(y_{2} + (mx_{2}+c)) +x_{3}(y_{3} - (mx_{3}+c)) + ... = 0\)
\(Σ_{i=1}^{n} x_{i} y_{i} - mΣ_{i=1}^{n}x_{i}x_{i} -cΣ_{i=1}^{n}x_{i} = 0 \)
In order to find the parameter m we just need to substiute expression for c.
\(Σ_{i=1}^{n} x_{i} y_{i} - mΣ_{i=1}^{n}x_{i}x_{i} -[ȳ - mx̄]Σ_{i=1}^{n}x_{i} = 0 \)
\(Σ_{i=1}^{n} x_{i} y_{i} - mΣ_{i=1}^{n}x_{i}x_{i} -Σ_{i=1}^{n} {y_{i}\over n} Σ_{i=1}^{n}x_{i}-mΣ_{i=1}^{n}{x_{i}\over n}Σ_{i=1}^{n}x_{i} = 0 \)
\( m = {Σ_{i=1}^{n} x_{i} y_{i} - {Σ_{i=1}^{n}x_{i}Σ_{i=1}^{n}y_{i}\over n} \over Σ_{i=1}^{n} x_{i} x_{i} - {Σ_{i=1}^{n}x_{i}Σ_{i=1}^{n}x_{i} \over n} }\)
\( m = {S_{xy}\over S_{xx}}\)

The regression line is the line of best fit between the points. The red vertical line is the residual - difference between the true data point and the estimated point on the regression line.

Residuals

The residual is the true value - the estimation
\(r_{i} = y_{i}- y_{e}\)
\( r_{i} = y_{i}- [mx_{i} + c]\)
We can express c in terms of m.
\( r_{i} = y_{i}- [mx_{i} + ȳ-mx̄]\)
Then we can expand the expression for \(r_{i} \)
\( r_{i} = y_{i}- mx_{i} - ȳ+ mx̄\)
\(Σ_{i=1}^{n} r_{i} = Σ_{i=1}^{n}y_{i} - m Σ_{i=1}^{n}x_{i} -Σ_{i=1}^{n}ȳ + mΣ_{i=1}^{n}x̄\)
\(Σ_{i=1}^{n} r_{i} = nȳ - mnx̄ -nȳ+mnx̄\)
\(Σ_{i=1}^{n} r_{i} = 0\)
The sum of all the residuals is 0. This follows that the mean of all the residuals is also 0.
\(Σ_{i=1}^{n} {r_{i}\over n} = 0\)

Least squares for Quadratics

The least squares regression line is not limited to straight lines. If the curve is expected be a quadratic then the regression line would be the following
\(y = a + bx + cx^{2}\)
\( E = (y_{1} - (a+bx_{1}+cx_{1}^2))^{2} + (y_{2} - (a+bx_{1}+cx_{1}^2)))^{2} + (y_{3} - (a+bx_{1}+cx_{1}^2))^{2} + ... \)
\( {∂E\over ∂c} = -2x_{1}^2(y_{1} - (a+bx_{1}+cx_{1}^2)) -2x_{2}^2(y_{2} - (a+bx_{1}+cx_{1}^2))) + -2x_{3}^2(y_{3} - (a+bx_{1}+cx_{1}^2)) + ... = 0\)
\( {∂E\over ∂c} = Σ_{i}(x_{1}^2y_{i} - Σ_{i}[ax_{1}^2 +bx_{i}x_{1}^2+cx_{i}^2x_{1}^2] = 0\)
\( {∂E\over ∂b} = -2x_{1}(y_{1} - (a+bx_{1}+cx_{1}^2)) -2x_{2}^2(y_{2} - (a+bx_{1}+cx_{1}^2))) + -2x_{3}^2(y_{3} - (a+bx_{1}+cx_{1}^2)) + ... = 0 \)
\( {∂E\over ∂b} = Σ_{i}x_{1}y_{i} - Σ_{i};[ax_{1} +bx_{i}x_{1}+cx_{i}x_{1}^2] = 0\)
\( {∂E\over ∂a} = -2(y_{1} - (a+bx_{1}+cx_{1}^2)) -2(y_{2} - (a+bx_{1}+cx_{1}^2))) + -2(y_{3} - (a+bx_{1}+cx_{1}^2)) + ... =0 \)
\( {∂E\over ∂a} = Σ_{i}y_{i} - Σ_{i} [a+bx_{i}+cx_{i}^2] = 0\)
We wish to solve the following system of equations.
\( aΣ_{i}x_{1}^2 +bΣ_{i}x_{1}^3+cΣ_{i}x_{1}^4 = Σ_{i}x_{1}^2y_{i}\)
\( aΣ_{i}x_{1} +bΣ_{i}x_{1}^2+cΣ_{i}x_{1}^3 = Σ_{i}x_{1}y_{i}\)
\( aΣ_{i} +bΣ_{i}x_{i}+cΣ_{i}x_{1}^2 = Σ_{i}y_{i}\)
For finding the coefficients of polynomial regression of order 2 or greater it is much more convenient to use matrices.

Key Takeaways

The least squares regression line is the most accurate line that goes through the data points. This line ensures that the sum of the residuals squared is at its minimum. Residual is the error between the approximation and the true y value. It is not limited to straight line but also can be used for quadratics and other polynomials. For polynomials with order 2 or larger we will need to use matrices to find the parameters.