Ian Davies

Tadas: Data Analysis

Using R

Download the data (to make sure we're working from the same data!)

Load the data:

	data = read.table("C:\\tadas_data.csv", sep=",", header=TRUE)

Change the users, conditions and trials into factors (categories)

	
	data$condition = factor(data$condition)
	data$trial = factor(data$trial)
	data$user = factor(data$user)

Just for fun (and to see a very pretty normal distribution), plot a histogram of the errors

	hist(data$error)

Take the absolute value of the errors (so that we get difference in means rather than just variance

	data$error = abs(data$error)

Now plot a histogram of the absolute error

	hist(data$error)

See that the distribution is no longer normal, so we can't do a standard analysis of variance (which assumes a normal distribution). Draw a box-and-whisker plot to see the variation between users

	plot(data$user, data$error)

Clearly there is a huge variation, so we're going to do a within-subjects analysis. To overcome the non-normally-distributed data problem, we'll just rank the errors within subjects (i.e. the smallest error is ranked 1, the largest up to 48 depending on how many ties there are). First split up the data so that each user can be ranked separately

	user_groups = split(data, data$user)

Now add an extra "rank" column to each group, ranking the errors

	user_groups$`0`$rank = rank(user_groups$`0`$error)
	user_groups$`1`$rank = rank(user_groups$`1`$error)
	user_groups$`2`$rank = rank(user_groups$`2`$error)
	user_groups$`3`$rank = rank(user_groups$`3`$error)
	user_groups$`4`$rank = rank(user_groups$`4`$error)
	user_groups$`5`$rank = rank(user_groups$`5`$error)
	user_groups$`6`$rank = rank(user_groups$`6`$error)
	user_groups$`7`$rank = rank(user_groups$`7`$error)
	user_groups$`8`$rank = rank(user_groups$`8`$error)
	user_groups$`9`$rank = rank(user_groups$`9`$error)
	user_groups$`10`$rank = rank(user_groups$`10`$error)
	user_groups$`11`$rank = rank(user_groups$`11`$error)

(Yes I'm sure there's a quicker way of doing this!)

Now merge the groups back together

	rank_data = unsplit(user_groups, data$user)

To find out whether there's a significant difference between the conditions, do an analysis of variance on the conditions and ranks. rank ~ condition means "Rank is dependent on condition"

	summary(aov(rank_data$rank ~ rank_data$condition))

The output is fairly confusing, but basically the single * you get at the end of the condition line means there there is a significant effect between two or more conditions with p < 0.05. Now we need to do pairwise comparisons to find out which conditions differ significantly from the others. First split up the data again, this time by condition.

	condition_groups = split(rank_data, rank_data$condition)

Now create all possible pairwise concatenations of the condition groups.

	rank_conditions01 = c(condition_groups$`0`$rank, condition_groups$`1`$rank)
	rank_conditions02 = c(condition_groups$`0`$rank, condition_groups$`2`$rank)
	rank_conditions03 = c(condition_groups$`0`$rank, condition_groups$`3`$rank)
	rank_conditions12 = c(condition_groups$`1`$rank, condition_groups$`2`$rank)
	rank_conditions13 = c(condition_groups$`1`$rank, condition_groups$`3`$rank)
	rank_conditions23 = c(condition_groups$`2`$rank, condition_groups$`3`$rank)

We now have individual vectors, each one just containing the rankings for two conditions. The first 144 elements are for the first condition, the next 144 elements are for the second condition. We create another vector that labels these two halves of the array appropriately. If this bit doesn't make sense, let me know, it might be easier to explain over the phone.

	condition_labels = factor(floor(0:287 / 144))

Now do individual AOVs on these data sets to see where the statistically significant differences are.

	summary(aov(rank_conditions01 ~ condition_labels))
	summary(aov(rank_conditions02 ~ condition_labels))
	summary(aov(rank_conditions03 ~ condition_labels))
	summary(aov(rank_conditions12 ~ condition_labels))
	summary(aov(rank_conditions13 ~ condition_labels))
	summary(aov(rank_conditions23 ~ condition_labels))

And, as if by magic, we find that there are significant differences between condition 0 and all the others, but no difference between the others. Phew. There is one last thing to show, and that's that the direction. We need to just show that the mean ranking was higher (= bigger error) for condition zero. This is easy.

	condition_0_mean_rank = mean(condition_groups$`0`$rank)
	condition_1_mean_rank = mean(condition_groups$`1`$rank)
	condition_2_mean_rank = mean(condition_groups$`2`$rank)
	condition_3_mean_rank = mean(condition_groups$`3`$rank)

Now just observe that condition_0_mean_rank is higher than the others, as we hoped.

Computer Laboratory

Tadas: Data Analysis

Using R