Have you ever wondered why some people are really good at sports or attain high grades in Mathematics? Maybe there is a link between the number of hours they practice, the quality of sleep, family support, and other circumstances.

There are many reasons! But it is interesting to see which factor has the strongest correlation or even has no relation to their success. Hours playing Sudoku may not help with getting good grades in Statistics.

In real-life situations, such as economics and biology, there are relationships among variables. Statisticians need to explore whether a variable has the tendency to influence another, the observations from the two variables are put together for analysis to establish the nature of relationships and the strength of the relationship.

Introduction Two-Variable Data Statistics

Two-variable data is when we are comparing two data sets and seeing how they are linked for example number of hours studied and exam results. Data can be quantitative such as height, shoe size, or categorical for example race, gender, nationality, etc.

Representing Two Categorical Variables

You can break down data into two categories for example hair, color and gender.

	Male	Female	Total
Blonde	4	8	12
Brunette	3	6	9
Total	7	14	21

Representing Two Quantitative Variables

You can use a scatter plot to show the relationship between two variables. There could be a clear relationship between the variables which can be seen from the pattern formed. If the data is evenly distributed in all 4 quadrants then you may conclude that there is no relationship.

Linear Correlation

Variables that are strongly correlated imply that there is a relationship between them for example students who do well in Mathematics tend to do well in Physics. Two variables can be either positively correlated, negatively correlated, or have no correlation.

The correlation coefficient r is used to determine the relationship between two variables.

\(r = {\mathbb{Cov}[X,Y] \over \sqrt{\mathbb{Var}[X]}\sqrt{\mathbb{Var}[Y]}}\)

r is a value between -1 or 1. When r is positive, this implies positive correlation. When r is negative, the pair of data is negatively correlated. If r is 0, then there is no correlation.

Positive correlation

For positive correlation, most of the data is in the first or third quadrant. An example of a positive correlation would be again a Mathematics test result and a Physics test. The centre of the 4 quadrants is \((\bar{x},\bar{y})\)

data mainly in first and third quadrant. positive correlation

\begin{array}{|c|c|c|} \hline \text{Quadrant} & x_{i}-\bar{x} & y_{i}-\bar{y} & [x_{i}-\bar{x}][y_{i}-\bar{y}] & \Sigma[x_{i}-\bar{x}][y_{i}-\bar{y}]\\ \hline 1 & \text{+ve} & \text{+ve} & \text{+ve} & \text{+ve} \\ \hline 2 & & & & \\ \hline 3 & \text{+ve} & \text{+ve} & \text{+ve} & \text{+ve} \\ \hline 4 & & & & \\ \hline \end{array}

Covariance measures the relationship between two variables.

\(\mathbb{Cov}[X,Y] = {\Sigma [x_{i}-\bar{x}][y_{i}-\bar{y}] \over n}\)

\begin{array}{|c|c|c|} \hline \text{Quadrant} & x_{i}-\bar{x} & y_{i}-\bar{y} & [x_{i}-\bar{x}][y_{i}-\bar{y}] & \Sigma[x_{i}-\bar{x}][y_{i}-\bar{y}]\\ \hline 1 & & & & \\ \hline 2 & \text{-ve} & \text{-ve} & \text{-ve} & \text{-ve} \\ \hline 3 & & & \\ \hline 4 & \text{+ve} & \text{-ve} & \text{-ve} & \text{-ve} \\ \hline \end{array}

For negative correlation, most of the data is in the second or fourth quadrant.

\(\mathbb{Cov}[X,Y] = {\Sigma [x_{i}-\bar{x}][y_{i}-\bar{y}] \over n} < 0 \)

The covariance is negative, hence the correlation coefficient is also negative.

For no correlation, most of the data is evenly distributed in all four quadrants. An example of correlation could possibly be a Mathematics test result and the time to complete a 5km race.

Data is evenly distributed in all 4 quadrants. No correlation

There should be no relationship between the two variables. \begin{array}{|c|c|c|} \hline \text{Quadrant} & x_{i}-\bar{x} & y_{i}-\bar{y} & [x_{i}-\bar{x}][y_{i}-\bar{y}] & \Sigma[x_{i}-\bar{x}][y_{i}-\bar{y}]\\ \hline 1 &\text{+ve} & \text{+ve} & \text{+ve} & \text{+ve} \\ \hline 2 & \text{-ve} & \text{-ve} & \text{-ve} & \text{-ve} \\ \hline 3 &\text{+ve} & \text{+ve} & \text{+ve} & \text{+ve} \\ \hline 4 & \text{+ve} & \text{-ve} & \text{-ve} & \text{-ve} \\ \hline \end{array}

Since the points are evenly distributed in all quadrants the values in the last column will cancel themselves out.

\(\mathbb{Cov}[X,Y] = {\Sigma [x_{i}-\bar{x}][y_{i}-\bar{y}] \over n} = 0 \)

The covariance is 0, hence the correlation coefficient is also 0. An example would be maths results and height. These variables have no relationship, hence the covariance is 0.

Linear Regression Models

Data may be influenced by other factors. An exam result could be caused by the number of hours studied. Linear regression models assume there is a straight-line relationship between the variables.

The line of best fit between data points. Iinear regression model

Residuals

The difference between the true y value and the estimate from the regression line. A point that is on the line of best fit would have a residual of 0. The residuals are indicated by the vertical red lines on the scatter plots.

Least Squares Regression

For scattered data, you may wish to draw a line of best fit. A straight line regression will be in the form of y = mx + c. We will explore how we can determine the parameters m and c.

The error is the sum of the residuals squared. The residual is the difference between the true y value and the estimation using the least squares regression line. The residuals are squared to avoid a positive residual and a negative residual canceling each other out.
\( E = (y_{1} - (mx_{1}+c)^{2} + (y_{2} - (mx_{2}+c))^{2} + (y_{3} - (mx_{3}+c))^{2} + ... \)

You would need to use differentiation and equate the equation to 0 to find the value of the parameters which minimises the residuals squared. For a straight line the parameters values are:

\(c = \bar{y} - m\bar{x}\)

;

\( m = {\Sigma [x_{i}-\bar{x}][y_{i}-\bar{y}] \over\Sigma [x_{i}-\bar{x}]^{2}}\)

Departures from Linearity

A straight line graph may not suit the data. You can use other polynomials such as quadratics, cubics etc. The aim is to find a function so that the sum of residuals squared is a minimum.

If the curve is expected be a quadratic then the regression line would be the following

\(y = a + bx + cx^{2}\)

\( E = (y_{1} - (a+bx_{1}+cx_{1}^2))^{2} + (y_{2} - (a+bx_{1}+cx_{1}^2)))^{2} + (y_{3} - (a+bx_{1}+cx_{1}^2))^{2} + ... \)

You will need to differentiate and equate to 0 in order to find the values of the parameters which would minimise the sum of the residuals squared.

Two-Variable Data - Key takeaways

The covariance of independent variables = 0
Independent variables have a correlation = 0
a pair of data can have positive, negative or no correlation
Correlation coefficient r takes on values from -1 to 1.