Chapter 10

Some Data Handling Techniques

There are two main ways in which researchers benefit from being able to handle data. The first is that grouping and analysing can provide new insights into a topic and indicate areas where further investigation would be beneficial. The second is relevant to anyone who wishes to communicate their results to others. Numbers have a definite meaning that is missing from subjective terminology such as ‘many’, plenty’ or ‘a few’ which people can understand in different ways. Good history needs evidence and numbers can provide certainty in a way that vague or imprecise expressions do not.

In the investigation and data collection phases of a project, researchers may develop a gut feeling about what the study will reveal. This is not always confirmed when the data is checked. It is surprising how often ‘almost everything’ turns out to be a measurement such as 80 per cent, 7/10 or 3/4. These are all high proportions that will give plenty of insight into the topic but a substantial minority of data does not have this attribute. If the other 20 per cent, 3/10 or 1/4 also share similar characteristics, concentrating on ‘almost everything’ probably means overlooking pertinent issues.

As well as providing a degree of exactitude, numbers are a useful way of presenting information visually. In some situations, the charts and graphs that can be produced from them convey information more clearly than a piece of written text.

Some people are not comfortable with using numbers but historical investigations are not like scientific and engineering subjects where calculating a precise number may be the difference between the success and failure of a project. Historical research does not involve computing or assigning a definite number to anything and, in most cases, striving for absolute precision would add nothing to informed understanding of the past. There is no benefit in trying to find out whether 23 per cent, 24 per cent or 26 per cent of a data set conform to a particular parameter. What is significant is that around a quarter of the items being studied have this feature and the questions this knowledge could trigger. For example, was this quarter the smallest category in the analysis or was it the highest individual proportion of several other elements? What were those other elements? What might have caused the differences between them?

Confidence with just a few techniques can open up many new lines of enquiry about the past. Initially, the most useful ones for a historical researcher to be able to calculate and interpret are percentages and averages.

Percentages

Percentages, denoted by the sign %, express any number as a proportion of 100. This is a useful way of comparing numbers. Without this it would be difficult to say whether 29/83 is greater or less than 34/97 and, as a consequence, to reach meaningful conclusions.

In order to express a percentage, two numbers are needed. One is the total number of values in the data set – the denominator. The other is the total number of values in the data set that are relevant – the numerator.

To convert a number to a percentage, divide the numerator by the denominator and multiply the answer by 100. It is given by the formula:

Example

29/83 boys and 34/97 girls from a school passed the 11+ examination. Did the boys or girls perform better?

29 x 100% = 34.9%

83

34 x 100% = 35.1%

97

Converting both results to percentages demonstrates that the boys and girls performed almost the same.

As shown in this example, most calculations of percentages do not work out to a whole number and a researcher has to decide how many decimal places to use.

Example

When considering the exam performance of boys and girls, it is reasonable to round both to 35%.

If the data related to the school’s exam performance in two consecutive years, and the rate was being tracked over several years, rounding to a whole number might not be sufficiently sensitive. When change is being studied, it is usually necessary to work to one, or perhaps two, decimal places and state the results as 34.9% and 35.1%.

Benefits of Percentages

•   They make it easy to compare numbers.

•   They make it easy to quantify how much something has changed.

•   They make it easy to understand the relative importance when there are several elements represented in a data set.

•   They are widely understood which makes them useful for communicating findings.

Disadvantages of Percentages

•   They may not be meaningful when there are only a small number of values in the data set.

•   It is essential to be clear about what is being measured and use the right numerator and denominator.

Example

Suppose in three consecutive years the price of a dress was £2.60, £2.75 and £2.95.

Averages

The average is a figure that is most representative of all the values in a set of data. There are different methods of determining this figure and they usually produce different results. Irrespective of the method used, better insight is available when there are plenty of values in the data set. If there are only a few values, it is possible to perform the calculations but the result may not reveal anything about the subject. In historical research, understanding what any answer means, in the context of the data being averaged and the research topic, is as important as calculating a figure.

Mean Average

The mean average is the most widely used and is what many people understand an average to be.

To calculate the mean average, total all the values in the data set and divide the answer by the number of values in the data set.

Example

The number of prisoners jailed at a Victorian Court House from January to June were:

8+3+24+7+5 +4 = 51

The number of terms in the set is 6

The mean average is 51 / 6 = 8.5

As in this example, the mean average is often a term that is not found in the data set. It can be a fraction, even if all the individual terms are whole numbers. When this happens, it is necessary to consider whether the item being averaged can be split in this manner when using the information.

Example

The above calculation reveals that 8.5 prisoners were jailed. In reality prisoners only exist as whole persons. When presenting findings a researcher would have to use appropriate terminology to describe the position and not refer to half of an actual person being jailed.

If the figures related to a fine imposed, or the length of a jail sentence, it would reasonable to refer to £8.50 or 8 ½ weeks.

The value of the mean average is in the questions it generates. There are plenty of lines of enquiry to pursue raised by the data about convicted prisoners.

Example

An obvious point is that 8.5 is higher than all but one of the terms in the data set. This should trigger a question about whether the data is correct. It would be sensible to check that no error has been made with 24 by going back to the source of the figure, or even confirming it with another source.

Understanding why there was such an extreme value in March is important and checking the newspaper reports should provide an answer. There may have been a serious incident that month that resulted in a large number of people being taken to court, or perhaps an additional policeman had been employed. Alternatively, the answer could lie in the practices of the age. In Victorian times, special courts, known as the Assizes, were held in many towns twice a year to deal with serious crimes. If the Assizes were held in March it could explain the higher number.

Whatever the reason (or reasons), this investigation would probably split into two elements, the high number in March and the more usual position. If it was established that twenty of the prisoners were convicted at the Assizes and four at the local court it would be reasonable to recalculate the mean average to cover the local court only. This would give an indication of the likely level of a petty offenders being convicted in any month.

Benefits of the Mean Average

•   It is straightforward to calculate.

•   It uses all the values in a data set, eradicates the effect of high and low values and produces a more typical one.

•   When a large number of values are being studied it can be difficult to keep them all in mind. Using one representative number can be helpful in understanding and explaining findings.

•   It is widely used in more advanced statistical calculations.

Disadvantages of the Mean Average

•   It can produce a result that is not possible for the topic being investigated.

•   If there is an extremely high or low value in the data set (sometimes termed an outlier) the result may give a false impression of the topic being investigated.

Median Average

The median average is the middle value in a set of data and splits it into two equal parts. When there are an odd number of values there is one unambiguous median average. When there is an even number, there are two adjacent values from which the median is calculated.

To calculate the median average, list the data either from lowest to highest (or highest to lowest).

When there is an odd number of values in the list, divide the number of terms by 2 and add 0.5 to the answer. This gives the position of the middle value. Count to this position in the list to find the median item.

When there is an even number of values in the list, divide the number of values by 2. Count to this position in the list. The median average is the mean average of this item and the one that follows it.

Example

The following are the weekly wages of 6 Victorian coal miners, expressed in shillings.

19, 21, 23, 24, 26, 70.

To find the middle point, divide the number of values, 6, by 2 = 3

The third value highest to lowest is 24. The next value is 23.

The third value lowest to highest is 23. The next value is 24.

The difference between 24 and 23 = 1

1 divided by 2 = 0.5

24 – 0.5 = 23.5

23 + 0.5 = 23.5

The median average is 23.5

The median average, like the mean, can be a number that is not found within the data set. It can also be a fraction, even though all the values are whole numbers.

The relevance of the median average to a researcher can be demonstrated by comparing it with the mean average, which is (19+21+23+24+26+70)/6 = 30.5

At this pit, the very high salary of the site manager produces a result where five of the six employees earn less than the mean average wage. The median gives a much more representative position and allows a researcher to draw more meaningful conclusions about the worker’s probable spending power and standard of living.

Benefits of the Median Average

•   By ignoring extremes and focussing on a middle point, the median average is useful for identifying a typical situation.

•   It can produce a more realistic answer than the mean when the data contains an extreme value.

Disadvantages of the Median Average

•   It can produce a result that is not possible for the topic being investigated.

•   The more extreme values in the data set may be overlooked. In historical research, the less typical results may contain important information.

A data set can be split into other numbers of equal parts. Quartiles (four), deciles (ten) and percentiles (a hundred) can sometimes be used in advanced investigations.

Mode Average

The mode average is the number that occurs most often in a set of data. This means that it will always be a value that is in the set.

To calculate the mode average, group similar items together.

Example

The number of occupants in a row of Victorian worker’s cottages are

2,8,4,3,2,4,6,4,2,3

There are two methods for finding the mode.

Either list the items from lowest to highest (or highest to lowest)

2,2,2,3,3,4,4,4,6,8

Or list each item as a group.

2, 2, 2

8

4,4,4

3,3

6

Counting each group reveals which item is repeated most often, in this case 2 and 4.

Sometimes, as in the above example, there is not one single mode average. Statisticians disagree about whether, in this situation, there is no mode average or whether there are two. For a historical investigation, treat the data set as one that has two mode averages and study both. It is likely that the data was influenced by more than one issue. Even when a data set does produce a single mode, it can be revealing to work out the second and perhaps third most frequent items, to check what are the similarities and differences between them.

Example

The number of people living in the worker’s cottages should trigger some questions. The popular view of the living conditions of Victorian workers is of large families crammed into tiny dwellings. In this row, most homes were occupied by either 2 or 4 people and the next most frequent occupancy was 3. Although the cottage with 8 people may have been overcrowded and the one with 6 cramped, it does not appear that overcrowding was the norm. Why was occupancy of these cottages low? Was it typical for the time and place?

Benefits of the Mode Average

•   It pinpoints the usual position. This enables a researcher not to place too much emphasis on atypical examples.

•   When two modes are not adjacent, or the first and second most frequent values are well separated, the data may be affected by more than one factor. This knowledge can help to focus an investigation.

•   The mode can be used to explore data that is not expressed in numbers. Grouping and counting all the items in a set of data may reveal information that is not readily apparent from reading a list.

Example

A horticultural researcher who was interested in fashions in plants discovered that the favourite flower of twelve Edwardian gardeners was

Tulip, rose, sweet pea, carnation, lily, rose, lily, sweet pea, tulip, rose, sweet pea, rose,

This can be grouped

Tulip

2

Rose

4

Carnation

1

Lily

2

Sweet pea

3

It is now obvious which are the most and least frequent items in the list.

Disadvantages of the Mode

•   There are situations when it is unlikely to produce valuable insights, for example, if several categories have similar numbers and one happens to have an extra entry. If a sample is small it may be worth adding some more examples where possible. If this does not alter the position, pursuing this line of investigation may not be worthwhile.

Putting an Investigation into Practice

The following demonstrates how a few straightforward calculations on a data set, and some thought about the results, can identify relevant patterns, and indicate where further research would be appropriate.

Example

Suppose the outcome of 11 successful libel claims resulted in awards of damages of:

£0.01, £75, £100, £100, £100, £100, £125, £200, £200, £200, £1000.

The mean average is £200

(£0.01+£75+£100+£100+£100+£100+£125+£200+£200+£200+£1000) /11

The median average (6th number in series, from top and bottom) is £100

The mode average (4 instances) is £100.

Comments:

In this case the mean average does not provide good insight. It is twice as high as the median and mode averages and the same as the second highest value, £200. It has also been affected by an extremely high and an extremely low value. It would be misleading to contend that £200 was the typical amount of damages that a claimant could expect to receive in damages.

This median and the mode averages are each £100. As the median represents the middle value and the mode represents the most frequent, this suggests that the 6 cases where damages range from £75–£125 are likely to be typical and may have features in common.

The second most frequent award was £200, which was given in 3 cases. These cases could be checked for any similarities between them and also any ways in which they differed from those where the damages were £75–£125.

The lowest and the highest results can be seen to be outliers, just by looking at the data. Discovering why these two cases had the outcomes they did is a valuable exercise. It may simply be that the person was lucky or unlucky. However, courts sometimes awarded a very small amount of compensation for a claim that was correct in law, but offended the moral standards of the age. Similarly a jury might be very generous with damages if the claimant had suffered exceptional harm, or if the defendant had behaved very badly. Understanding non-typical results may reveal something about the individual case, or about wider social values.

The mode average, £100, which has 4 occurrences, could be expressed as a percentage of the data set of 11. This is 36.4%. The next most common outcome, £200, has 3 occurrences and makes up 27.3% of the data set. Although 11 is too small a sample to draw general conclusions from, if there are common features in the two groups and distinguishing ones between them, it may indicate points to be alert to in further research.

More Advanced Techniques

Averages and percentages will enable a researcher to begin to investigate or quantify their data. For some studies this level of analysis will be sufficient.

As confidence with handling and interpreting numbers grows, a researcher may wish to carry out other investigations.

Frequency

Working out how often several different outcomes occur, will provide much more information about a subject than looking at just the one or two most popular. It also enables results to be shown in a graphical or tabular format, which may help in presenting them to other people. Expressing each category also as a percentage enables the researcher to remain aware of the relative proportion of each frequency to the whole.

Example

The following numbers represent the amount of money in £’s spent by ten customers on food and drink in a hostelry on a bank holiday during the 1920s.

1, 1, 1

2, 2

3

5, 5

7

8

This can be converted into a frequency chart or table

Amount

Number of Customers (frequency)

Percentage %

£1

  3

  30

£2

  2

  20

£3

  1

  10

£5

  2

  20

£7

  1

  10

£8

  1

  10

 

10

100

Once the data is in this form, it should trigger questions. Were the customers who spent £5, £7 or £8 in a group and those who spent less sole travellers? Were the customers who spent the lower amounts too poor to afford anything else, or were they in a hurry, or did they have just drinks rather than food? What do the reasons indicate about the type of person who frequented the hostelry? Why did no-one spend £4 or £6? How did the trade on a bank holiday differ from that on days that were not holidays? How did the takings compare with those for the hostelry across the road? It may not be possible to pursue each line of enquiry but the answers to some may reveal new insights.

Finding the frequency means developing work that has been done to discover the mode average. When creating any frequency chart or table, it is sensible to total the frequency column as this will guard against missing something, or counting an item twice.

Sometimes there are a very large number of individual frequencies and it may be appropriate to group them in bands.

Example

The amount spent in the restaurant could be expressed as

Amount

Number of Customers (frequency)

Percentage %

£0.01–£2

  5

  50

£2.01-£4

  1

  10

£4.01-£6

  2

  20

£6.01-£8

  2

  20

 

10

100

Moving Averages

Moving averages can be used when analysing the same piece of information that is collected at different times, such as the highest daily temperature or the closing price of a commodity on a stock exchange. Moving averages need enough items to demonstrate an unambiguous trend, so they are not useful with small data sets.

To calculate a moving average, decide what period to average. There are no rules about this, it is a matter of judgement in each case. Then add the first values together until that number is reached and calculate the mean average. Continue by repeating the process but beginning with the second, and then the third value until all the values have been used.

Example

The following numbers represent the number of pots per batch that were broken whist they were being fired in a Staffordshire kiln in 1902.

12, 9, 15, 6, 10, 23, 9, 11

For a three period moving average, calculate

(12+9+15)/3, (9+15+6)/3 (15+6+10)/3 (6+10+23)/3 (10+23+9)/3 (23+9+11)/3

The moving average is 12, 10, 10.33, 13, 14, 14.33

This has smoothed the potentially distorting effect of 6 and 23. It suggests that the number of damaged pots is increasing, though the data set is small. In practice more data would need to be studied in order to come to a good conclusion.

In a moving average, there are always fewer resultant values than original data. The initial and final values in the data set do not have an average to set against them. If one is required, find the comparable information for the periods immediately prior to or after the data set to establish the missing terms.

Benefits of Moving Averages

•   They are useful when investigating change over a period of time because they help to smooth peaks and troughs in the data. This will enable the underlying position to be established.

•   They can be helpful when analysing data that is subject to regular seasonal fluctuation.

•   They are a useful way of presenting fluctuating data when a trend has been proven.

Disadvantages of Moving Averages

•   They may mask extreme values. The reason for any extreme may be relevant to understanding the topic.

•   They are not appropriate for small data sets.

•   The moving average may not reveal anything significant.

Other Statistical Measures

There are many other statistical techniques available to a researcher but these are beyond the scope of this book. Using some of these techniques may be necessary if working with an incomplete data set or to determine degree of accuracy provided by a calculation. Statistics can be used to identify the number of items to include in a sample, whether the conclusions revealed by a study of a sample are representative of the whole population from which the sample was drawn, or that an identified difference is significant rather than caused by chance.

Conclusion

For the historian, numbers are a tool for understanding an aspect of the past, not an end in themselves. Not all analyses will yield meaningful results and being able to identify which data to investigate is more important than carrying out the calculations. Used perceptively, data analysis can offer insight, provide evidence, identify where more research would add depth and even reveal where no additional understanding would be gained.

Data analysis will enable a researcher to communicate the results of their investigation in an objective rather than a subjective manner. In studies where numbers are not presented, it is helpful to know that any assertions made in a piece of work can be backed up with appropriate evidence, if the conclusions are challenged.