13Cboxplots

= 13C Box and Whisker Plots =


 * Five Figure Summary **

A ** five figure summary ** (or five point summary) is a list containing (in order)
 * minimum : the lowest value
 * Q 1 : the first quartile
 * median : the middle value (or second quartile)
 * Q 3 : the third quartile
 * maximum : the largest value

How to find the quartiles was explained in the section about Interquartile Range.


 * Example 1 **

Find the five figure summary for the following data: ... ... 24, 32, 19, 32, 16, 28, 30, 18, 21


 * Solution:**

... ... Put in ascending order: ... ... 16, 18, 19, 21, 24, 28, 30, 32, 32

... ... min = 16

... ... n = 9 ... ... median is the 5th value ... ... { (9+1)/2 = 5 } ... ... median = 24

... ... First quartile is middle of : 16, 18, 19, 21 ... ... Q 1 = 18.5

... ... Third quartile is middle of : 28, 30, 32, 32 ... ... Q 3 = 31

... ... max = 32

... ... ** Hence five figure summary is : ** ... ... ** 16, 18.5, 24, 31, 32 **

**Box and Whisker Plot** (or boxplot)


 * A box and whisker plot (boxplot) is a graph showing the five figure summary.
 * It is a simple and powerful way to show the centre and spread of the data


 * The boxplot MUST be drawn over a regular scale
 * The box (rectangle) spans the interquartile range : from Q1 to Q3
 * The median is shown with a vertical line inside the box
 * The whiskers (horizontal line) show the first and last quarters of the data


 * Each section of the boxplot represents one quarter (25%) of the data.


 * Example 2 **

Draw a box and whisker plot of the data with a five figure summary: ... ... 16, 18.5, 24, 31, 32


 * Solution:**




 * This boxplot has very short whiskers compared to the size of the box
 * That tells us that the central 50% of the data is quite spread out compared to the first quarter and the last quarter.


 * Outliers (Extreme Values) **


 * An ** outlier ** is an extreme value -- a significant distance from the rest of the data
 * An outlier can occur naturally or it could result from a measurement error.
 * Outliers in raw data should be examined carefully,
 * if it is a measurement error it may be removed from the data set
 * if it is not a measurement error, sometimes we analyse the data excluding the outlier and then mention the outlier specifically.
 * On a boxplot an outlier will make the whisker appear very long
 * Instead, the outlier should be indicated by a small cross
 * And the whisker shortened to the next largest (or smallest) value.


 * Example 3 **

Comment on the information that can be gained from the following boxplot


 * Solution:**
 * min =15
 * Q 1 = 26
 * median = 35
 * Q 3 = 44
 * max = 54
 * there is an outlier at 70


 * excluding the outlier, the data is fairly evenly spread across the range of 39 (from 15 to 54)
 * The interquartile range is
 * IQR = 44 – 26 = 18


 * Identifying Outliers **


 * There is no widely accepted single rule for identifying when a value is an outlier.
 * Sometimes it boils down to simply the person analysing the data having to make a judgement


 * One rule that is sometimes used is to establish the following limits
 * lower limit: Q 1 – 1.5 × IQR
 * upper limit: Q 3 + 1.5 × IQR
 * anything outside of those limits is an outlier

.
 * In the above example: **
 * Q 1 = 26
 * Q 3 = 44
 * IQR = 18
 * Lower limit = 26 – 1.5 × 18 = –1
 * Upper limit = 44 + 1.5 × 18 = 71
 * anything below –1 or above 71 would be an outlier (using this rule)
 * 70 is __not__ above the upper limit
 * So, using this rule, 70 would __not__ be an outlier