Summarising information
In the third article in our series on medical statistics, Wai-Ching Leung discusses the best way to condense different types of data
In my previous articles I noted that we can use statistics when events are not entirely predictable and depend on a large number of factors.1 2 Statistics can also be used to test hypotheses and to estimate how likely it is that something will happen from information gathered from a large number of observations.
Why summarise?
Effective techniques to summarise a large amount of information are essential for several reasons. Firstly, this will help us to make sense and get a feel of the information we have. Graphs and charts often prove useful. Secondly, it may help us to formulate the hypothesis we wish to test. For example, if we have the heights of a large number of men and women and we want to show that, overall, men are taller than women, we must have a method of calculating "average" height (averaging is an example of summarising information from a large number of observations). It would be ideal if we could summarise the information both accurately and simply. In practice, however, a compromise is often necessary. Information is often summarised at the expense of accuracy.
It is important to note that there are three types of data, which differ in the kind of information they provide and in the ways they can be summarised and analysed.
Fig 1: Bar chart of eye colour
Types of data
The three different types of data are categorical, ranked, and interval data. You may have seen them called by different names in older textbooks so these names are given in brackets.
Categorical (nominal) data
Observations are grouped by name into categories, but they are not graded in any way. In other words, it is not possible to say one category is higher or lower than another category. This type of data contains the least information. Examples include eye colour, nationality, and blood group. The simplest type of categorical data consists of only two categories (that is, binary data), such as male and female sex.
Suppose we carry out a survey of 200 students and find that the numbers of students with blue, brown, and green eyes are 120, 60, and 20, respectively. We can summarise this type of data by drawing a bar chart (figure 1) or a pie chart (figure 2). The bar chart gives a visual comparison of the number of students in each category. Note that the categories can be arranged in any order. The pie chart lets us know at a glance about the percentage of students in each category.
We might say that blue eyes are the commonest category. In other words, the "mode" is blue. However, it is clearly not meaningful to say what the "average" eye colour is.
Fig 2: Pie chart of eye colour
Ranked (ordinal) data
These are similar to categorical data, except that the categories can be put in a certain order. Examples include social classes and A level grades.
Suppose in a sixth form class of 40 students, the results of their A level biology exam are: grade A (8 students), grade B (14 students), grade C (12 students), grade D (4 students), and grade E (2 students). Again, we can represent these results using a bar chart (figure 3). Note that we cannot change the order of the categories, although we could put them in reverse order.
We can say that the results range from grade A to grade E and grade B is the commonest grade--the mode is grade B. If we line the 40 students up with the best students at the front, the two students in the middle, that is, the 20th and 21st students, score grade B. In other words, the "median" is also grade B.
We cannot, however, calculate the "mean," or average, value because we cannot assume that the differences between consecutive grades are the same. For example, the difference between grades A and B is not the necessarily the same as the difference between grades B and C.
Fig 3: A level results
Interval (scale) data
These are data usually represented by continuous numbers. Examples are temperature, weight, height, and systolic blood pressure. The key point is that unlike the categorical or ranked data, the difference between two observations is measurable. For example, the difference between 10 kg and 20 kg is precisely the same as the difference between 20 kg and 30 kg. Interval data gives us more information than both the categorical and ranked data.
Suppose you measure the weight of nine students in your class, and you obtain the following results (to the nearest kg) and put them in ascending order:
45, 47, 49, 50, 50, 50, 51, 53, 55
We can summarise these observations in the same way we did for ranked data. For example, we could draw a histogram similar to the bar chart in figure 1 to give us an idea about the distribution of the weights. We have placed the numbers in ranked order and the middle number (the 5th number, the median) is 50 kg. In addition, we can also work out the mean weight. Most students know that this is calculated by totalling all the numbers and dividing the result by the number of observations. In this case, the total is 450. So the mean is 450/9=50 kg.
In this example, the median and the mean are identical. This is because the data are perfectly symmetrical, about 50 kg. You will see that the observation 49 kg is balanced by the observation 51 kg, 45 kg by 55 kg, etc.
Asymmetrical data
Data are not always symmetrical. Suppose a teacher takes a group of four A level students out for a trip. You record the ages of the teacher and the four students and arrange them in ascending order:
17, 18, 18, 18, 54
The median age is 18. The mean age is 125/5=25.
In this example, although the ages of four out of five people are 18 or less, the mean age is 25. The reason is that the age distribution is not symmetrical because the teacher is much older than the students. In such cases, the median is less affected than the mean by the value of a few unusual observations. But neither the median nor the mean alone, gives an accurate summary of all data available.
Measuring spread of data by interquartile range
Even if the distribution is symmetrical, the mean alone does not accurately summarise all available information. For example, the age distributions of the following two groups of nine people are symmetrical but clearly very different. Yet the mean and median ages of both groups are 18.
Group A: 16, 17, 17, 18, 18, 18, 19, 19, 20
Group B: 2, 4, 8, 14, 18, 22, 28, 32, 34
It is clear that the spread of ages in group B is much wider than in group A. So, a measure of the spread of the observations together with the median might give a better summary of the data.
One simple method of expressing the spread of data is the interquartile range, or the difference between the highest and lowest value of the "middle half" of the observations (between the first quarter and third quarter of the observations). Imagine the observations lined up from the lowest to the highest value. Divide the data into four equal sets. Then, find the value below which a quarter of all observations lie (the lower quartile) and the value above which a quarter of all observations lie (the upper quartile).
In a group of nine observations, the third, fifth, and seventh observations conveniently divide the data into four equal groups. In group A, the lower quartile (third observation) is therefore 17. The upper quartile is the 7th observation and is therefore 19. Therefore, the interquartile range in group A is 19-17=2. In group B, the lower quartile (third observation) is 8. Similarly, the upper quartile (7th observation) is 28. Hence, the interquartile range in group B is 288=20. We can compare groups A and B by using the box and whisker plots (figure 4). The thick lines in the middle represent the medians. The red areas represent the interquartile ranges. The whiskers represent the lowest and highest values in each group.
Fig 4: Comparison of two groups in a box and whisker plot
Summarising interval data simply and accurately
Although the median and interquartile range together describe a set of observations quite well, they do not give the entire picture. For example, even if we know that the median age is 18 and the interquartile value is 26, it is not possible to work out the proportion of people aged between, say, 13 and 23.
We will be able to summarise the data even more accurately and simply if we make some reasonable assumptions. Many biological characteristics, such as height, weight, and IQ, are very close to the bell shaped normal curve, sometimes called the Gaussian distribution, as shown in figure 5. It has been found that this bell shaped curve fits very closely to a rather complex mathematical formula. You don't need to know it, but we will see that this formula proves very useful. The standard deviation is a measure of the spread of the observations. It is not difficult to calculate the standard deviation from your data and it is not important to know it.
The formula is
Standard deviation (s)=[S(xm)2/n] where m is the mean and n is the number of observations. Although the formula
looks complex, it is much easier to apply. s=[(Ex2)/nm2]
In this example (figure 5), the mean is 0 and the standard deviation is 1.
Fig 5: Example of normal distribution
Now, since this curve fits very closely to a known mathematical formula, one advantage of assuming that our observations are normally distributed is that we can summarise all our data almost perfectly by specifying the mean and the standard deviation alone.
For example, we know from the formula for the normal distribution that about 68% of all observations fall between 1 and 1 standard deviations from the mean, and 95% of all observations fall between 2 and 2 standard deviations from the mean. We can work out the proportion of observations between any two given values simply by looking up a statistical table. It is, however, important to check to make sure that our assumption is reasonable. Although sophisticated statistical techniques are available to check this, a rough and simple way to check whether your observations are approximately normally distributed is to draw a histogram. Another advantage of assuming the observations are normally distributed is that it allows us to analyse our data more easily.
Glossary
Graphical representation of information
Pie chart--A chart showing the proportion of observations assigned to each category (showing "how the cake is divided"). Useful for categorical data
Bar chart--A chart with vertical (sometimes horizontal) bars showing the numbers of observations in each category. Useful for both categorical and ranked data
Histogram--A chart with vertical bars showing the numbers of observations in each "band" of values (for example, age bands of <20, 20-40, 40-60, 60-80, >80). Useful for interval data
Box and whisker plot--A chart showing the mean, interquartile ranges, highest and lowest observations
Measurement of averages
Mode--The category or value that occurs most frequently
Median--The category or value below which half of all observations lie
Mean--Sum of all observations divided by/number of observations
Measurement of spread
Interquartile range--Difference between the value below which a quarter of all observations lie and the value above which a quarter of all observations lie
Standard deviation--Useful for measuring degree of spread in a normal distribution. About 68% fall within 1 standard deviation of the mean and 95% fall within 2 standard deviations
Distribution
Normal (Gaussian) distribution--A bell shaped curve with which the distribution of most biological characteristics fits closely
Wai-Ching Leung, locum general practitioner, Norwich
Email: wai_chingleung@hotmail.com
studentBMJ 2002;10:303-352 September ISSN 0966-6494
- Leung W-C. Why and when do we need medical
statistics. studentBMJ 2002;10:227. (July.) www.studentbmj.com/back_issues/0702/education/227.html
- Leung WC. Measuring chances. student
BMJ 2002;10:268. (August.) (www.studentbmj.com/
current_issue/education/268.html).