Computer Laboratory

Course pages 2013–14

Experimental Methods

An introduction to data analysis using Excel

Many simple statistical analyses can be performed using Microsoft Excel.

Preliminaries

For some statistical tests in Excel, it is necessary to load the Analysis ToolPak:

  • Office button - Excel Options - Add-Ins - Go...
  • Select Analysis ToolPak and click OK (may need to install)

Examples of different techniques will be illustrated with some data from Chapter 4 of Lazar (Table 4.11). It could be entered in Excel as follows:

A B C D
1 Lazar et al
2 Table 4.11
3
4 Participant Standard Prediction Speech
5 1 245 246 178
6 2 236 213 289
7 3 321 265 222
8 4 212 189 189
9 5 267 201 245
10 6 334 197 311
11 7 287 289 267
12 8 259 224 197
13 Mean
14 SD

Warning: The same set of data will be analysed here as if it has been collected in several different ways. Any real data could only be analysed in ways that were appropriate for the design of the experiment.

Visual inspection

Inspect the data by grouping it into bins and plotting the resulting histogram.

Augment the previously entered data with a set of bins sizes by typing 175 into cell f5, 200 into f6 and so on up to 350 in f12. Then:

  • Go to the Data tab and click on Data Analysis
  • Select Histogram and click OK
  • Select Input Range: (b5:b12 for this example)
  • Select Bin Range: (f5:f12 for this example)
    If you omit this, Excel will choose defaults that span the range of values.
  • Select Output Range: (the top left cell: h4 for this example)
  • Tick Chart Output
  • Click OK

The charts are easier to read if the heading and legend are removed.

Repeat this for the Prediction and Speech timings (putting the results at h15 and h16 for this example). Note that there are no values in any of the charts below the bottom bin value of 175 or above the top value of 350.

Superimpose normal distribution

chart showing bins and superimposed normal distribution

Most of the statistical tests rely on the underlying population data having a normal distribution. This can be checked visually by superimposing a normal distribution curve over the histogram.

Calculate the mean and standard deviation of each data series using the built-in Excel functions:

  • AVERAGE(range)
  • STDEV(range)

putting the results in b13:d14.

Calculate the bin values for a normal distribution with the same mean, standard deviation and number of samples using:

  • NORMDIST(value,mean,sd,TRUE)

putting the results in r5:r12. It is also necessary to invent extra values at either end of the range in r4 and r13. Calculate the differences multiplied by the number of samples in s5:s13.

Add these as a line on the chart:

  • Right click on a histogram bar and select Select Data...
  • Click Add
  • Give the series a name like Normal, select the values just calculated and click OK twice to give a second set of histogram bars.
  • Right click on one of the new bars and select Change Series Chart Type...
  • Select Line and click OK.
  • Right click on the line and select Format Data Series...
  • Under Line Style tick Smoothed line and click Close

Repeat this for the other two measures. Excel scales the data to fill the chart space, but it would be better if the three graphs had the same ranges on their axes to allow comparison. Select the vertical axis in the first chart, right click and use Format Axis... to fix the maximum value at 4.

There are not really sufficient samples to be definite, but it seems plausible to treat the underlying distributions as normal.

Investigating relationships

Scatter plot and regression of time with experience

Each participant's number of years' experience using computers can be copied from Lazar Table 4.20, and then the correlation with the standard data entry time investigated as follows:

  • Enter the data (12,6,3,19,16,5,8,11) in cells a5:a12 and change the heading in a4 to Experience
  • Select a4:b12
  • Go to the Insert tab and click on Scatter with only markers
  • Right click on the y axis and select Format Axis...
  • Fix the minimum value at something like 150 and the maximum at something like 350
  • Right click on a data point and select Add Trendline...
  • Tick the Display Equation and Display R-squared boxes

Observe that the slope of the least-squares line is negative, and that the squared Pearson correlation coefficient R2 is 0.5222.

Box plots

Box plots of the three distributions

The distributions can be compared using box plots:

  • For each of the three data series, calculate the mean minus the standard deviation, the maximum value, the minimum value, and the mean plus the standard deviation, putting the results in b17:d20.
  • Replicate the headings from b4:d4 in b16:d16.
  • Select the table including the row and column headings, and insert an Open-High-Low-Close stock chart selecting the Switch Row/Column option.
  • Move and resize the chart somewhere convenient, say in a22:f35, and remove the legend from the right.
  • The high-low lines are clearer with diamond end points.

A box covering the range from mean-sd to mean+sd will show approximately the 15th and 85th percentiles. Use mean±2sd/3 for an approximation to quartiles.

Z-value transformation

It is often convenient to normalise data by subtracting its mean and dividing by its standard deviation to give a Z-value. Excel has a built-in function for this:

  • STANDARDIZE(value,mean,sd)

The standardized values should have a mean of 0 and a standard deviation of 1.

t test

The graphs and box plots suggest that the task completion time for predictive entry is less than for standard entry. This can be checked with a t test:

  • Go to the Data tab and click on Data Analysis
  • Select t-Test: Two-Sample Assuming Unequal Variances and click OK
  • Select Variable 1 Range: (b4:b12 for this example)
  • Select Variable 2 Range: (c4:c12 for this example)
  • Select Labels in First Row
  • Select Output Range: (the top left cell: h36 is convenient for this example)
  • Click OK

Observe that t Stat is 2.16878 and P(T<=t) one-tail is 0.023906. This would be reported as:

An independent-samples t test showed a significant effect of the predictive software against the standard input on the task completion time (t(14) = 2.17, p < 0.05).

If the data had been collected in a paired-samples experiment, the analysis would use t-Test: Paired Two Sample for Means. In this case, t Stat is 2.631264 and P(T<=t) one tail is 0.016925. This would be reported as:

A paired-samples t test showed a significant effect of the predictive software against the standard input on the task completion time (t(7) = 2.6, p < 0.05).

The graphs also suggest that there is little difference between predictive and speech-based entry. A t test shows that t Stat is -0.43246 and P(T<=t) one-tail is 0.33625. This would be reported as:

An independent-samples t test showed no significant difference between the speech and predictive software on the task completion time (t(13) = 0.432, p > 0.05).

One-way ANOVA

Analysis of variance compares the means of two or more groups. (If there are only two groups, this is the same as a t test.)

  • Data - Data Analysis
  • Select Anova: Single factor and click OK
  • Select Input Range: (b4:d12 for this example)
  • Select Grouped by Columns
  • Select Labels in First Row
  • Select Output Range: (the top left cell: h51 is convenient for this example)
  • Click OK

Observe that F is 2.173781 and P-value is 0.138667. This would be reported as:

A one-way ANOVA test showed no significant effect of the three input methods on the task completion time (F(2,21) = 2.17, p > 0.05).