Data is only useful when it has meaning. It becomes significant if we compare it to similar data sets.


If we are comparing Grade 12 students sitting an AP Mathematics exam, we may be interested in which class has the highest mark. Perhaps the class is filled with geniuses, or maybe the class has a good teacher. We would also like to know the spread of the exam results.


Do all the students in a class have a similar ability or is the class mixed-ability? Perhaps the school determines classes on sets.


How do we determine if a student should be in that set? Here we are looking at extreme values (outliers).


If a student is way beyond the ability of the class, he will need to move up. This is an example of analyzing one-variable data.

Discrete or Continuous Data

Data can be divided into two types: discrete or continuous.


Data that can only be a specified integer value is discrete.


For example, the number of children in a household is considered discrete data, since the number can only be an integer. Another example would be the height to the nearest meter, as we do not accept decimals, and hence can only be an integer.


When representing discrete data, we tabulate the frequency for specific values.


Eleven adults on the street were asked how many children they had in their household and the following answers were noted down:

\[1, 2, 3, 3, 2, 1, 1, 4, 2, 4, 2.\]

This set of data is composed of integer values, so this data is discrete. We can organize this data in the form of a table of frequencies.


Number of children in a household Frequency
1 3
2 4
3 2
4 2


In contrast to discrete data, there's also:

Continuous data is data that can be any value inside an interval.


Measures such as weight, or time to complete a lap are examples of continuous data.


For this type of data, you record the frequency for intervals.


Twelve people were asked to write their full names as fast as they could. The times, in seconds, obtained were the following:
\[1.2, 1.4, 1.9, 2.0, 2.0, 3.7, 3.9, 4.2, 4.2, 6.0, 6.5, 7.6.\] These times were organized in a table of frequencies, divided into two-second intervals.




Time (seconds) Frequency
\(0≤t<2\) 3
\(2≤t<4\) 4
\(4≤t<6\) 3
\(6≤t<8\) 2


Measures of Central Tendency

The measures of central tendency reveal what is widespread or characteristic about a variable. The mode, median, and mean are the three measurements of central tendency. These averages summarize the data. It is important to know when these measures are most useful and also be aware of their disadvantages.

Mean

The mean is prevalently used in measuring the central points of data. 

Though commonly used for continuous data, it is equally perfect for discrete data.


You will often observe that the computed value of the mean is usually not represented in a given data set. Nevertheless, it reduces the tendency to misrepresent any one observation in the data set. This is because it includes all observations in determining its value.


The mean for a data set is given by the formula below: \[ x̄= {Σ x_{i} \over n},\] where

    \(Σ x_{i}\) is the sum of all values or observations of \(x\), and 

    \(n\) is the total number of observations.


The major problem with the mean is that it considers extreme values –the outliers —, which could be erroneous. The mean may therefore not represent the majority of the data.


1. Find the mean for \(x = \{6, 9, 12, 12, 13, 16, 17\} \).

\[x̄= {Σ x_{i} \over n} = {6+9+12+12+13+16+17\over 7} \]

Therefore, the mean is \(12.14\) approximately.


2. Find the mean for \(x = \{1, 1, 1, 1, 11\} \).

\[x̄= {Σ x_{i} \over n} = {1+1+1+1+11\over 5} \]

Therefore, the mean is 3 approximately.

Even though most of the data is 1, the outlier has increased the mean.

Median

The median is the data point in the middle, when a data set is arranged in order of magnitude.


When determining the median, it is important to take into account the total number of data points in a data set:



1. To find the median of the data set \(\{5, 9, 1, 3, 8\}\), first rearrange the data from the lowest point to the highest and pick the middle value because the total of the data points is odd.


Thus, \(1, 3, 5, 8, 9\). Therefore, since \(5\) occurs in the middle of the dataset, it happens to be the median.


2. To find the median of the dataset {5, 9, 1, 3, 8, 7} first rearrange the data from the lowest point to the highest and pick the two middle values, add them, and divide the result by 2.


Thus, \(1, 3, 5, 7, 8, 9\).


The values \(5\) and \(7\) occur in the middle of this data set, so

\[ {5+7 \over 2}={12 \over 2}=6.\]

The median of this data set is \(6\).


There is little or no impact of outliers in this measure.

If you have the ordered data \(5, 5, 5, 5, 10\), here the outlier is \(10\), but the median will still be \(5\). This contrasts with the mean, which has to take into consideration the outliers. 

Mode

The mode is the most frequently occurring data point in a data set.


Here's an example.

From the dataset \(\{1, 1, 1, 3, 5\}\), \(1\) is the most occurring and hence the mode of the distribution. 


The mode is often used for categorical variables and rarely used for continuous variables. When the most occurring data point is also farther away from the rest of the data set – it's an outlier –, the mode does not give a good measure.


From the dataset 0.5,1,2,3,4,19,19,  the mode is 19 but this is far from the other data items.


When a data set also has two modal points (bimodal), it is inappropriate to use mode as a measure of central tendency. 

Measures of Spread

Measures of spread refers to the variability within a data set. 


The measure of dispersion is frequently jointly used with a measure of central tendencies such as the mean or median. It elaborates, for example, on how well the mean represents the data.

Some measures of spread are  range, variance, and/or standard deviation.

Range

Range  is obtained by finding the difference between the between the highest value and the lowest values


Find the range of 2,1,7,8,6.

The highest value is 8 and the lowest value is 1 so the range is 8 - 1 = 7

Variance and Standard Deviation

The variance is the average of the deviations squared. The deviation is the difference between a data point and the population mean.

\(\mathbb{Var}[X]= {Σ (x_{i}-\mu)^2\over n}\)

The deviations are squared to ensure that the positive deviations do not cancel out the negative deviations. Otherwise, it would give the impression there is no spread.

\(Σ (x_{i}-\mu) = 0\)

The sum of all deviations from the population mean is equal to 0.

Find the variance of 1,3

Here there is just two points. The population mean \(\mu = 2\)

\(\mathbb{Var}[X]= {Σ (x_{i}-\mu)^2\over n}\)

\(\mathbb{Var}[X] = {(1-2)^2+(3-1)^2\over 2}\)

\(\mathbb{Var}[X] = {5\over 2} = 2.5\)


Variance, though it measures spread, is quite a poor metric of spread but significant in determining the standard deviation of a distribution; which gives a better measure.

For example, if the variance of a distribution is 32, 000, it only indicates that it is large but tells us nothing more about the distribution. Taking the square root of the variance gives the standard deviation which explains the margin of deviation of each data point from the mean value of a data set.


The variance of a sample is computed using the formula;

\(s^{2}= {Σ (x_{i}-x̄)^2\over n-1}\)

where \(x_{i}\) represents each data point and \(\bar{x}\) is the computed mean.


n is the sample size, and ∑ is summation notation. You need to divide by n-1 because we are using a sample rather than the entire population. This sample will not consider all the extreme values in the population so in order to be a better estimate of the population variance you need to increase the value of the estimate by dividing by n -1.


Given X = (12, 13, 24, 24, 25, 34). Find the sample variance and standard deviation.


Solution:

\(s^{2}= {Σ (x_{i}-x̄)^2\over n-1}\)


First, find the mean:

\bar{x} = \({ 12+13+24+24+25+34 \over 6 } = 22 \)

Second is to find variance (subtract the mean from each observation, squared and divide by the degrees of freedom): 

\(s^{2}= {Σ (x_{i}-x̄)^2\over n-1}\)


\(s^{2}= {(12-22)^2+(13-22)^2+(24-22)^2+(24-22)^2+(25-22)^2+(34-22)^2\over 5}\)

\(s^{2}= {100+81+4+4+9+144\over 5} = 68.4\)

Third, is to find the standard deviation by taking the square root of variance:

Therefore, standard deviation is approximately 8.27.

Position of a Term in a Distribution

To determine the position of a term in a distribution, the data set must be arranged in ascending order if not already. When data is divided into four equal halves, it is termed quartile. Thus, each half contains 25% of the entire data set. This means the dataset is divided into 100 parts or observations.


The first quartile is given by \({N\over 4}\).

The second quartile (the median) is given by \({N\over 2}\)

The third quartile is given by \({3(N)\over 3}\)

Where N represents the total number of elements in the given data set.

For each quartile if the number is a non-integer then we round up. Otherwise, we choose the value between the quartile and the next term. 

Find the second quartile of the following data: 1,3,4,7,9

To find the second quartile we compute \({N\over 2} = {5\over 2} = 2.5\)

Since this is non-integer we round up to the next term which is the 3rd term. The third term 4

On the other hand, 1,3,4,7,9,10

To find the second quartile we compute \({N\over 2} = {6\over 2} = 3\)

Since 3 is an integer we choose the value between the 3rd and 4th term which is 5.5

When a data set is divided into 10 equal parts, it is referred to as deciles. The decile of any given data is given by \({N\over 10}\)


We can use the quartiles to find the interquartile range. This measures the spread of the data set.

IQR = Q3 - Q1



Given the dataset in the table below, 

     

Students

Isaac

Mira

Terry

Sam

Mark

Weight (Kg)

75

80

81

90

60


  1. Find the mean.

\( \bar{x}= {Σ x_{i} \over n} = {75+ 80 + 81+ 90+ 60\over 5}\)

Therefore, mean = 77.2


  1. Find the median.

Solution: 

Rearranging in ascending order = 60,75, 80, 81, 90. 


Therefore, the median is 80.


  1. Find the interquartile range.

Solution: 

First quartile = \({N \over 4} = {5 \over 4} = 1.25\)

Since 1.25 is a non-integer, therefore we round up to the 2nd term which is 75

Third Quartile = \({3N \over 4} = {15 \over 4} = 3.75\)


Since 3.75 is a non-integer, therefore we round up to the 4th term which is 81

IQR = Q3 – Q1

IQR = 81 – 75 = 6


  1. Find the standard deviation.

Solution: 

Compute the variance and take the square root of the result.


\(s^{2}= {Σ (x_{i}-\bar{x})^2\over n-1}\)

\(s^{2}= {(75-77.2)^2+(80-77.2)^2+(81-77.2)^2+(90-77.2)^2+(60-77.2)^2\over 5-1}\)

\(s^{2}= {4.84+7.84+14.44+163.44+295.84\over 4}\)

\(s^{2}= 121.7\)

The standard deviation of the sample is 11.03.

In order to find the deciles, we use exactly the same procedure as calculating the quartiles.

For example, let us find the 2nd decile for the ordered data

2,4,5,6,8,9,10,11,13,14,15,17,18,19,22,23,25,30,31,32


The first decile is \({N\over 10} = {20\over 10} = 2\) 

The second decile is \({2N\over 10} = {2(20)\over 10} = 4\) 


Since this is a whole number we look for the value between the 4th and 5th term which is 7.



Representation of Data

Graphs and diagrams are a great way to represent the data so that we can compare the measure of central tendency and also the spread of the data. One-variable data can be represented using stem and leaf diagrams, and bar charts. Histograms are used for continuous data. The nature of the data or the objective of the analysis may influence the type of graph to select. For example, box plots are mostly used when one wants to check for outliers (extreme values) in a given dataset or to know the skewness of the data.


One-Variable Data - Key takeaways

  • When observations are gathered on a single attribute or characteristic, it is termed “single variable data”.

  • When finding the quartiles, deciles of a data set, firstly the data must be ordered.

  • To summarize single variable data's measure of centre tendancy and spread, we use statistical measures such as mean, variance , range and quartiles.

  • Box plots, histograms, and pie charts, among the others aforementioned, are common ways to represent single variable data.


Free Web Hosting