Computer Laboratory

Course pages 2013–14

Experimental Methods

Design of experiments

What is research?

  • Not just diligent search to collect information about a topic
  • Development and revision of theories
  • Enquiry and examination using experiments aimed at the discovery or interpretations of facts
    • Originating in observations or experience
    • Capable of being verified or disproved by observation or experiment
  • Practical application of theories

Observations and inferences

  • Descriptive
    • X is happening
    • Observations, field studies, focus groups, interviews
  • Relational
    • X is related to Y
    • Observations, field studies, surveys
  • Experimental
    • X is responsible for Y
    • Controlled experiments

Theory and hypothesis

  • A theory can be very broad
    • In movement tasks, the movement time increases as the movement distance increases and the size of the target decreases.  The movement time has a log linear relationship with the movement distance and the width of the target. (Fitts' Law)
  • A concrete research hypothesis lays the foundation for an experiment and can the basis for testing of statistical significance
    • Fitts' Law predicts navigation times successfully for a mouse and for an eye tracker.

Phrasing a hypothesis

  • Hypothesis should be testable
  • Strength
    • Are pop-up menus any good?
    • Are pop-up menus better than pull-down menus?
    • Are pop-up menus faster than pull-down menus?
    • Is the time taken for an experienced user to invoke a command using a pop-up menu less than the time taken using a pull-down menu?

The scientific method

  • Formulate hypothesis
  • Design experiment
  • Test with pilot, revising design if necessary
  • Run experiment and collect data
  • Analyse data
  • Draw conclusions
  • Start again with a revised hypothesis if necessary

Null and alternative hypotheses

  • Null hypothesis – H0
    • Treatment has no effect
    • Any difference in measurements can be explained by random variation resulting from experimental procedure
  • Alternative hypothesis – H1
    • Treatment has an effect
    • Difference in measurements is unlikely to be explained by random variation
  • Experiment and statistical analysis determines whether to accept or reject the null hypothesis

Comparing pull-down and pop-up menus

  • Speed
    • H0: There is no difference between the times taken to select an item using pull-down and pop-up menus
    • H1: There is a difference between the times taken to select an item using pull-down and pop-up menus
  • Satisfaction
    • H0: There is no difference between user satisfaction selecting an item using pull-down and pop-up menus
    • H1: There is a difference between user satisfaction selecting an item using pull-down and pop-up menus
  • Two different measures – timing and questionnaire

Errors

H1 true H0 true
Reject H0 True positive decision
Probability 1-β
(power)
False positive decision
(Type I error)
Probability α
Accept H0 False negative decision
(Type II error)
Probability β
True negative decision
Probability 1-α
(confidence)
  • Aim for α < 0.05 (95% confidence level) or α < 0.01 (99% confidence level)
  • Aim for β < 0.20 for power greater than 80% (correctly rejecting the null hypothesis)

Measures of accuracy

  • Precision = TP ÷ (TP + FP)
    • probability of a detected positive being true
  • Recall = TP ÷ (TP + FN) = 1 - β
    • probability of true positive being detected
    • Power or Sensitivity
  • Specificity = TN ÷ (FP + TN) = 1 - α
    • probability of true negative being detected
  • Accuracy = (TP + TN) ÷ (TP + TN + FP + FN)
  • F1 score = 2 × Precision × Recall ÷ (Precision + Recall)
  • Fλ = (1 + λ²) × Precision × Recall ÷ (λ²×Precision + Recall)
    • recall λ times more important than precision

Receiver operating characteristic

  • Plots recall (1-β) against false positive rate (α)
  • Area under the curve (A') is the probability that a classifier will rank a random positive instance higher than a random negative one

Controlling errors

  • Control groups with no treatment
    • Randomisation
  • Single and double blinding
    • Unconscious bias from well intentioned evaluators
    • Placebos
  • Learning and order effects
  • Confounding
    • Two different factors give rise to the same effect
    • e.g. age and experience may both contribute to ability

Fair testing

  • Take care to match ratios of samples to population size in different strata

Validity

  • External validity
    • The extent to which results can be generalised to other people in other situations
    • Requires representative participants and representative environment
  • Internal validity
    • The extent to which effects observed can be attributed to the test conditions
    • Differences caused by conditions and variance caused by participants

Increasing validity

  • Relaxing the test environment and experimental procedures to mimic the real world is likely to introduce uncontrolled variation from sources such as distractions or secondary tasks
  • Pose several narrow (testable) questions that cover a range of broader outcomes that cover the broader (untestable) questions
    • A technique that is faster, is more accurate, is easier to learn and is easier to remember is generally better
  • Testable and untestable questions are usually correlated
  • Comparative evaluations are more informative than user studies to identify strengths and weaknesses in a single technique

Variables

  • Independent variables
    • Different treatments being compared
    • Controlled by the experimenter
    • e.g. type of menu
  • Dependent variables
    • Effects being observed
    • Measured during the experiment
    • e.g. time taken to select an item and user satisfaction

Qualitative variables

  • Nominal or categorical
    • e.g. pull-down or pop-up menu
  • Ordinal or ranked
    • e.g. computer experience: < 1 year, 1-5 years, 5-15 years, > 15 years
    • Likert scale
      • e.g. pull-down was better: strongly disagree, disagree, neutral, agree, strongly agree
      • Even number of answers excludes neutral: strongly disagree, disagree, slightly disagree, slightly agree, agree, strongly agree
      • Balanced, so not: poor, average, good, very good, excellent

Quantitative variables

  • Discrete or continuous
  • Interval
    • Equally spaced
    • e.g. temperature
  • Ratio
    • Zero based
    • e.g. time taken to select an item, number of errors made

Other variables

  • Control variables
    • Factors that may influence a dependent variable but are not under investigation – so control them
    • Improve internal validity at the expense of external validity
  • Random variables
    • Factors that are allowed to vary randomly
    • Improve external validity at the expense of internal validity
  • Confounding variable
    • Factor that varies systematically with an independent variable
      • e.g. Prior experience with a particular technique

Design

  • Between subjects
    • Each participant is exposed to a single condition
    • No risk of learning and skill transfer so no need to counterbalance
    • Variance is not controlled
  • Within subjects or repeated measures
    • Each participant is exposed to all conditions
    • Variance is controlled
    • More demanding for participants and need to counterbalance
  • Mixed design or split plot
    • Balance load on participants and control of variance
Fisher balanced Latin square

Counterbalancing

  • Participants' performance may improve with practice during a repeated measures experiment
  • Counterbalance the order of presenting conditions
  • Latin square has each condition appearing once in each row and in each column
  • Balanced Latin square has each condition preceding and following each other condition equal numbers of times

Succinct design statement

  • '3×2 repeated measures design'
  • An experiment with two different factors, having three levels for the first and two levels for the second
    • e.g. Three different text entry systems and two different tasks
  • Factorial design tests all participants on all six conditions
  • Mixed design might test half the participants on all three systems for one task, and the other half on all three systems for the other task

Participants

  • 'Participants' preferable to 'subjects'
  • Distinguish from users of the resulting system
  • Report recruitment procedure and selection criteria
  • Report relevant demographic information
    • Number of participants
    • Age (range, mean, standard deviation)
    • Balance of sexes
    • Prior experience

Ethical procedures

  • Appropriate experimental design
    • Test with a pilot
  • Recruitment of participants
  • Informed consent with signed form
  • Briefing with full written instructions for participant and experimenter
  • Treatment of participants
    • Dignity and respect, right to withdraw without penalty
  • Debriefing interview and further explanation
  • Data retention subject to Data Protection Act
  • Incentives and compensation

Ethical problems

  • Participants who are minors or have disabilities
  • Experiments that are likely to cause physical or mental distress or embarrassment (in any case, consent forms should make it clear that subjects can withdraw at any time)
  • Experiments involving deception or emotional manipulation
  • Keeping data in any form that would allow individuals to be identified
  • Auditing compensation
  • Medical experiments
  • Experiments on non-human species

Ethical approval

  • Laboratory Research Ethics Committee
  • Application
    • Description of the experiment
    • Consent form
    • Questionnaires
    • Details of remuneration

Reporting experiments

  • Method
    • Participants
    • Apparatus
    • Procedure
    • Design
  • Results
  • Discussion

Further information