Jones, 2019
t-tests
This workshop will cover 3 types of t-test, the one sample, two sample and paired t-tests. It
will cover when to use each test, the assumptions of each test, how to perform the test and
how to accurately report each test. Work through each t-test, making sure to understand
what each line of code does. In some places, code is not provided as you have covered this
already e.g. reading in your data and producing histograms. It is vital that you understand and
can perform these statistical tests yourselves, as you will be alone when undertaking the exam
and will be expected to perform statistical tests.
As you will have discovered, there are many ways of doing the same thing in R e.g. reading in
data and subsetting, so there is no ‘right’ answer. Please just use the method that works for
you and which you understand. However, there are specific ways of reporting your results
and performing these tests. For more information on t-tests, please refer to sections 9.4-9.6
of the core text book.
For this workshop, I have highlighted code in grey and things you have to do in red. It is also
important you read through the background material too, as this may be examined.
This workshop is based on material by Childs (2018).
Jones, 2019
One sample t-test
When do we use one-sample t-test?
The one-sample t-test is the simplest of statistical tests. It is used in situations where we have
a sample of numeric variable from a population, and we need to compare the population
mean to a particular value. The one-sample t-test uses information in the sample to evaluate
whether the population mean is likely to be different from this value. The expected value
might be something predicted from theory, or some other prespecified value we are
interested in. Here are a couple of examples:
• We have a theoretical model of foraging behaviour that predicts an animal should
leave a food patch after 10 minutes. If we have data on the actual time spent by 25
animals observed foraging in the patch, we could test whether the mean foraging time
is significantly different from the prediction using a one-sample t-test.
• We are monitoring sea pollution and have a series of water samples from a beach. We
wish to test whether the mean density of faecal coliforms (bacteria indicative of
sewage discharge) for the beach can be regarded as greater than the legislated limit.
A one-sample t-test will enable us to test whether the mean value for the beach as a
whole exceeds this limit.
How does the one-sample t-test work?
Imagine we have taken a sample of a variable (called ‘X’) and we want to evaluate whether
the mean is different from some number. Here’s an example of what these data might look
like, assuming a sample size of 50 was used:
Figure 1: Example of data used in a one-sample t-test
Jones, 2019
The red line shows the sample mean, and the blue line shows the expected value (this is 10,
so this example could correspond to the foraging study mentioned above). The observed
sample mean is about one unit larger than the expected value. The question is, how do we
decide whether the population mean is really different from the expected value? Perhaps the
difference between the observed and expected value is due to sampling variation. Here’s how
a frequentist tackles the question:
• We have to first set up an appropriate null hypothesis, i.e. an hypothesis of ‘no effect’
or ‘no difference’. The null hypothesis in this instance is that the population mean is
equal to the expected value.
• We then have to work out what the sampling distribution of the mean looks like under
this null hypothesis. This is the null distribution. We use the null distribution to assess
how likely the observed result is under the null hypothesis.
The new idea is that now we will make an extra assumption. The key assumption of onesample t-test that the variable is normally distributed in the population. The distribution
above look roughly bell-shaped, so it seems plausible that it was drawn from a normal
distribution.
Now, because we’re prepared to make the normality assumption, the whole process of
carrying out the statistical test is very simple. The consequence of the normality assumption
is that the null distribution will have a known mathematical form—it’s related to the tdistribution. We can use this knowledge to construct the test of statistical significance. But
instead of using the whole sample, as we did with bootstrapping, we only need three pieces
of information to construct the test: the sample size, the sample variance, and the sample
means. No resampling of data is involved.
So how does a one-sample t-test it actually work? It is carried out as follows:
Step 1. Calculate the mean. That’s simple enough. This is our ‘best guess’ of the unknown
population mean. However, its role in the one-sample t-test is to allow us to construct a test
statistic in the next step.
Step 2. Estimate the standard error of the sample mean. This gives us an idea of how much
sampling variation we expect to observe. The standard error doesn’t depend on the true value
of the mean, so the standard error of the sample mean is also the standard error of any mean
under any particular null hypothesis. This step boils down to applying a simple formula
involving the sample size and the standard deviation of the sample:
Jones, 2019
…where s2 is the square of the standard deviation (the sample variance) and n is for the
sample size. The standard error of the mean gets smaller as the sample sizes grows or the
sample variance shrinks.
Step 3. Calculate a ‘test statistic’ from the sample mean and standard error. We calculate this
by dividing the sample mean (step 1) by its estimated standard error (step 2):
Why is this useful? If our normality assumption is reasonable this test-statistic follows a tdistribution. This is guaranteed by the normality assumption. So this particular test statistic is
also a t-statistic. That’s why we label it t. This knowledge leads to the final step…
Step 4. Compare the t-statistic to the theoretical predictions of the t-distribution to assess
the statistical significance of the difference between observed and expected value. We
calculate the probability that we would have observed a difference with a magnitude as large
as, or larger than, the observed difference, if the null hypothesis were true. That’s the p-value
for the test.
We could step through the actual calculations involved in these steps in detail, using R to help
us, but there’s no need to do this. We can let R handle everything for us. But first, we should
review the assumptions of the one-sample t-test.
Assumptions of the one-sample t-test
There are a number of assumptions that need to be met in order for a one-sample t-test to
be valid. Some of these are more important than others. We’ll start with the most important
and work down the list in order of importance:
1. Independence. People tend to forget about this one. We’ll discuss the idea of
independence later when we consider principles of experimental design. For now, we
just need to state why the assumption matters: if the data are not independent the pvalues generated by the one-sample t-test will smaller than they should be.
2. Measurement scale. The variable being analysed should be measured on an interval
or ratio scale, i.e. it should be a numeric variable. It doesn’t make much sense to apply
a one-sample t-test to a variable that isn’t measured on one of these scales.
3. Normality. The one-sample t-test will only produce completely reliable p-values if the
variable is normally distributed in the population. This assumption is less important
than many people think. The t-test is fairly robust to mild departures from normality
when the sample size is small, and when the sample size is large the normality
assumption matters even less.
Jones, 2019
We don’t have the time to properly explain why the normality assumption is not too
important for large samples, but we will at least state the reason: it is a consequence of
something called the ‘central limit theorem’.
How do we evaluate these assumptions? The first two are really aspects of experimental
design, i.e. we can only evaluate them by thinking carefully about how the data were gathered
and what was measured. What about the 3rd assumption? One way to evaluate the normality
assumption is by plotting the sample distribution using something like a histogram or a dot
plot. If the sample size is small, and the sample looks approximately normal when we visualise
its distribution, then it is probably fine to use the t-test. If we have a large sample we don’t
need to worry much about moderate departures from normality. It’s hard to define what
constitutes a ‘large’ sample, but 100s of observations would often be safe.
Carrying out a one-sample t-test in R
We’re going to use the plant morph example to learn how to carry out a one-sample t-test in
R. The data were ‘collected’ to compare the mean dry weight of purple and green morphs.
This question can’t be tackled with a one-sample t-test. Instead, let’s pretend that we have
unearthed a report from 30 years ago that found the mean size of purple morphs to be 710
grams. We want to evaluate whether the mean size of purple plants in the contemporary
population is different from this expectation, because we think they may have adapted to
local conditions.
Read the data in MORPH_DATA.CSV into an R data frame, giving it the name morph.data. We
only need the purple morph data for this example, so we need to subset the data to get hold
of only the purple plants. Call this morph.purple.
Next, we need to explore the data. Perform a histogram on the purple morph weight data to
test the normality assumption. You have learnt how to do this in the data visualisation
workshop.
Carrying out the test
It is fairly straightforward to carry out a one-sample t-test in R. The function we use is
called t.test (no surprises there). We read the data into a data frame called morph.data, then
subsetted this for purple only. This has two columns: Weight contains the dry weight biomass
of purple plants, and Colour is an index variable that indicates which sample (plant morph) an
observation belongs to. We don’t need the Colour column at this point.
Here’s the R code to carry out a one-sample t-test:
t.test(morph.purple$Weight, mu = 710)
Jones, 2019
We have suppressed the output because we want to first focus on how to use t.test function.
We have to assign two arguments to control what the function does:
1. The first argument (morph.purple$Weight) is simply a numeric vector containing the
sample values. We can’t give t.test a data frame when doing a one-sample test.
Instead, we have to pull out the column we’re interested in using the $ operator, in
this case weight.
2. The second argument (called mu) sets the expected value we want to compare the
mean to, so mu = 710 tells the function to compare the mean to a value of 710. This
can be any value we like, depending on the question we’re asking.
That’s it for setting up the test. Let’s take a look at the output:
The first line tells us what kind of t-test we used. This says: One Sample t-test. So we know
that we used the one-sample t-test. The next line reminds us about the data. This says: data:
morph.data$Weight, which is R-speak for ’we compared the mean of the Weight variable to
an expected value. Which value? This is given later.
The third line of text is the most important. This says: t = 3.1811, df = 76, p-value = 0.002125.
The first part of this, t = 3.1811, is the test statistic, i.e. the value of the t-statistic. The second
part, df = 76, summarise the ‘degrees of freedom’. This is essentially a measure of how much
power our statistical test has (see the box below). The third part, p-value = 0.002125, is the
all-important p-value.
The p-value indicates that there is a statistically significant difference between the mean dry
weight biomass and the expected value of 710 g (p is less than 0.05). Because the p-value is
less than 0.01 but greater than 0.001, we report this as ‘p < 0.01’. Read through
the hypotheses and p-values document for more information here.
Jones, 2019
Don’t ignore the fourth line of text (alternative hypothesis: true mean is not equal to 710).
This reminds us what the alternative to the null hypothesis is (H1). It tells us what expected
value was used in the test (710).
The next two lines show us the ‘95% confidence interval’ for the difference between the
means. We don’t really need this information, but we can think of this interval as a rough
summary of the likely values of the true mean. In reality, a confidence interval is more
complicated than that.
The last few lines summarise the sample mean. This is useful to see the mean of your data
and direction of the test i.e. is the mean of the purple morph data higher or lower than 710?
Summarising the result
Having obtained the result we need to write the conclusion. Remember, we are testing a
scientific hypothesis, so always go back to the original question to write the conclusion. In this
case the appropriate conclusion is:
The mean dry weight biomass of purple plants is significantly different from the expectation
of 710 grams (t = 3.18, d.f. = 76, p < 0.01).
This is a concise and unambiguous statement in response to our initial question. The
statement indicates not just the result of the statistical test, but also which value was used in
the comparison. It is sometimes appropriate to give the values of the sample mean in the
conclusion:
The mean dry weight biomass of purple plants (767 grams) is significantly different from the
expectation of 710 grams (t = 3.18, d.f. = 76, p < 0.01).
Jones, 2019
Notice that we include details of the test in the conclusion. However, keep in mind that when
writing scientific reports, the end result of any statistical test should be a conclusion like the
one above. Simply writing t = 3.18 or p < 0.01 is not an adequate conclusion.
There are a number of common questions that arise when presenting t-test results:
1. What do I do if t is negative? Don’t worry. A t-statistic can come out negative or
positive, it simply depends on which order the two samples are entered into the
analysis. Since it is just the absolute value of t that determines the p-value, when
presenting the results, just ignore the minus sign and always give t as a positive
number.
2. How many significant figures for t? The t-statistic is conventionally given to 3
significant figures. This is because, in terms of the p-value generated, there is almost
no difference between, say, t = 3.1811 and t = 3.18.
3. Upper or lower case The t statistic should always be written as lower case when
writing it in a report (as in the conclusions above). Similarly, d.f. and p are always best
as lower case. Some statistics we encounter later are written in upper case but, even
with these, d.f. and p should be lower case.
4. How should I present p? There are various conventions in use for presenting p-values.
Have a look at the hypotheses and p values document. Learn them! It’s not possible
to understand scientific papers or prepare reports properly without knowing these
conventions
Jones, 2019
Two-sample t-test
When do we use a two-sample t-test?
The two-sample t-test is used to compare the means of a numeric variable sampled from two
independent populations. The aim of a two-sample t-test is to evaluate whether or not this
mean is different in the two populations. Here are two examples:
• We’re studying how the dietary habits of Scandinavian eagle-owls vary among
seasons. We suspect that the dietary value of a prey item is different in the winter and
summer. To evaluate this prediction, we measure the size of Norway rat skulls in the
pellets of eagle-owls in summer and winter, and then compare the mean size of rat
skulls in each season using a two-sample t-test.
• We’re interested in the effect of a drug on weight loss in separate groups of patients.
Participants were given either drug A or drug B and after two weeks they were
weighed. They were not weighed before the treatments so we have two means for
each separate group of participants for those that took drug A vs drug B. We would
then compare the mean weights of participants in each group using a two-sample ttest.
How does the two-sample t-test work?
Imagine that we have taken a sample of a variable (called ‘X’) from two populations, labelled
‘A’ and ‘B’. Here’s an example of how these data might look if we took a sample of 50 items
from each population:
Figure 2: Example of data used in a two-sample t-test
Jones, 2019
The two distributions overlap quite a lot. However, this particular observation isn’t all that
relevant here. We’re not interested in the raw values of ‘X’ in the two samples. It’s the
difference between the means that matters. The red lines are the mean of each sample—
sample B obviously has a larger mean than sample A. The question is, how do we decide
whether this difference is ‘real’, or purely a result of sampling variation?
Using a frequentist approach, we tackle this question by first setting up the appropriate null
hypothesis. The null hypothesis here is that there is no difference between the population
means. We then have to work out what the ‘null distribution’ looks like. Here, this is sampling
distribution of the differences between sample means under the null hypothesis. Once we
have the null distribution worked out we can calculate a p-value.
The important difference is that we have to make an extra assumption to use the twosample t-test. We have to assume the variable is normally distributed in each population. If
this assumption is valid, then the null distribution will have a known form, which is closely
related to the t-distribution.
We only need to use a few pieces of information to carry out a two-sample t-test. These are
basically the same quantities needed to construct the one-sample t-test, except now there
are two samples involved. We need the sample sizes of A and B, the sample variances of A
and B, and the estimated difference between the sample means. That’s it.
How does it actually work? The two-sample t-test is carried out as follows:
Step 1. Calculate the two sample means, then calculate the difference between these
estimates. This estimate is our ‘best guess’ of the true difference between means. As with the
one-sample test, its role in the two-sample t-test is to allow us to construct a test statistic.
Step 2. Estimate the standard error of the difference between the sample means under the
null hypothesis of no difference. This gives us an idea of how much sampling variation we
expect to observe in the estimated difference, if there were actually no difference between
the means.
There are a number of different options for estimating this standard error. Each one makes a
different assumption about the variability of the two populations. Which ever choice we
make, the calculation always boils down to a simple formula involving the sample sizes and
sample variances. The standard error gets smaller when the sample sizes grow, or when the
sample variances shrink. That’s the important point really.
Step 3. Once we have estimated the difference between sample means and its standard error,
we can calculate the test statistic. This is a type of t-statistic, which we calculate by dividing
the difference between sample means (from step 1) by the estimated standard error of the
difference (from step 2):
Jones, 2019
This t-statistic is guaranteed to follow a t-distribution if the normality assumption is met. This
knowledge leads to the final step…
Step 4. Compare the test statistic to the theoretical predictions of the t-distribution to assess
the statistical significance of the observed difference. That is, we calculate the probability that
we would have observed a difference between means with a magnitude as large as, or larger
than, the observed difference, if the null hypothesis were true. That’s the p-value for the test.
We could step through the various calculations involved in these steps, but there isn’t much
to be gained by doing this. The formula for the standard two-sample t-test and its variants are
summarised on the t-test Wikipedi page. There’s no need to learn these—we’re going to let
R to handle it again.
Let’s review the assumptions of the two-sample t-test first…
Assumptions of the two-sample t-test
There are several assumptions that need to be met for a two-sample t-test to be valid. These
are basically the same assumptions that matter for the one-sample version. We start with the
most important and work down the list in decreasing order of importance:
1. Independence. Remember what we said in our discussion of the one-sample t-test. If
the data are not independent, the p-values generated by the test will be too small,
and even mild non-independence can be a serious problem. The same is true of the
two-sample t-test.
2. Measurement scale. The variable that we are working with should be measured on an
interval or ratio scale. It makes little sense to apply a two-sample t-test to a variable
that isn’t measured on one of these scales.
3. Normality. The two-sample t-test will produce exact p-values if the variable is
normally distributed in both populations. Just as with the one-sample version, the
two-sample t-test is fairly robust to mild departures from normality when the sample
sizes are small, and this assumption matters even less when the sample sizes are large.
How do we evaluate the first two assumptions? As with the one-sample test, these are
aspects of experimental design—we can only evaluate them by thinking about the data. The
normality assumption may be checked by plotting the distribution of each sample. The
simplest way to do this is with histograms or dot plots. Note that we have to examine the
distribution of each sample, not the combined distribution of both samples. If both samples
look approximately normal then it’s fine to proceed with the two-sample t-test, and if we
have a large sample we don’t need to worry too much about moderate departures from
normality.
Jones, 2019
What about the equal variance assumption?
It is sometimes said that when applying a two-sample t-test the variance (i.e. the dispersion)
of each sample must be the same, or at least quite similar. This would be true if we’re using
original version of Student’s two-sample t-test. However, R doesn’t use this version of the test
by default. R uses the “Welch” version of the two-sample t-test. The Welch two-sample t-test
does not rely on the equal variance assumption. As long as we stick with this version of the ttest, the equal variance assumption isn’t one we need to worry about.
Is there ever any reason not to use the Welch two-sample t-test? The alternative is to use the
original Student’s t-test. This version of the test is a little more powerful than Welch’s version,
in the sense that it is more likely to detect a difference in means. However, the increase in
statistical power is really quite small when the sample sizes of each group are similar, and the
original test is only correct when the population variances are identical. Since we can never
prove the ‘equal variance’ assumption—we can only ever reject it—it is generally safer to just
use the Welch two-sample t-test.
One last warning. Student’s two-sample t-test assumes the variances of the populations are
identical. It is the population variances, not the sample variances, that matter. There are
methods for comparing variances, and people sometimes suggest using these to select ‘the
right’ t-test. This is bad advice. For reasons just outlined, there’s little advantage to using
Student’s version of the test if the variances really are the same. What’s more, the process of
picking the test based on the results of another statistical test affects the reliability of the
resulting p-values.
Carrying out a two-sample t-test in R
We’ll work with the plant morph example again time to learn how to carry out a two-sample ttest in R. We’ll use the test to evaluate whether or not the mean dry weight of purple plants
is different from that of green plants.
Read the data in MORPH_DATA.CSV into an R data frame, giving it the name morph.data. N.B.
This should already be read in from the one sample t-test.
Next, we need to explore the data. Perform a histogram on the purple and green morph
weight data separately to test the normality assumption. You should already have the purple
data subsetted and produced a histogram of this. You have learnt how to do this in the data
visualisation workshop.
The sample sizes 173 (green plants) and 77 (purple plants). These are good sized samples, so
hopefully the normality assumption isn’t a big deal here. Nonetheless, we still need to check
the distributional assumptions. There is nothing too ‘non-normal’ about the two samples so
it’s reasonable to assume they both came from normally distributed populations.
Jones, 2019
Carrying out the test
The function we need to carry out a two-sample t-test in R is the t.test function, i.e. the same
one we used for the one-sample test.
Remember, morph.data has two columns: Weight contains the dry weight biomass of each
plant, and Colour is an index variable that indicates which sample (plant morph) an
observation belongs to. Here’s the code to carry out the two-sample t-test:
t.test(Weight ~ factor(Colour), data=morph.data)
We have suppressed the output for now so that we can focus on how to use the function. We
have to assign two arguments:
1. The first argument is a formula. We know this because it includes a ‘tilde’ symbol: ~.
The variable name on the left of the ~ must be the variable whose mean we want to
compare (Weight). The variable on the right must be the indicator variable that says
which group each observation belongs to (Colour).
2. The second argument is the name of the data frame that contains the variables listed
in the formula. This should be morph.data as we want to look at both the green and
purple morphs.
Let’s take a look at the output:
The first part of the output reminds us what we did. The first line reminds us what kind of ttest we used. This says: Welch two-sample t-test, so we know that we used the Welch version
of the test that accounts for the possibility of unequal variance. The next line reminds us about
the data. This says: data: Weight by Colour, which is R-speak for ‘we compared the means of
the Weight variable, where the sample membership is defined by the values of
the Colour variable’.
Jones, 2019
The third line of text is the most important. This says: t = -2.7808, d.f. = 140.69, p-value =
0.006165. The first part, t = -2.7808, is the test statistic (i.e. the value of the t-statistic). The
second part, df = 140.69, summarise the ‘degrees of freedom’ (see the box below). The third
part, p-value = 0.006165, is the all-important p-value. This says there is a statistically
significant difference in the mean dry weight biomass of the two colour morphs,
because p<0.05. As the p-value is less than 0.01 but greater than 0.001, we would report this
as ‘p < 0.01’.
The fourth line of text (alternative hypothesis: true difference in means is not equal to 0)
simply reminds us of the alternative to the null hypothesis (H1). We can ignore this.
The next two lines show us the ‘95% confidence interval’ for the difference between the
means. Just as with the one sample t-test we can think of this interval as a rough summary of
the likely values of the true difference (again, a confidence interval is more complicated than
that in reality).
The last few lines summarise the sample means of each group. This is useful as it shows the
means for each of the groups i.e. you will know which is greater than the other.
Jones, 2019
Summarising the result
Having obtained the result, we need to report it. We should go back to the original question
to do this. In our example the appropriate summary is:
Mean dry weight biomass of purple and green plants differs significantly (Welch’s t = 2.78,
d.f. = 140.7, p < 0.01), with purple plants being the larger.
This is a concise and unambiguous statement in response to our initial question. The
statement indicates not just the result of the statistical test, but also which of the mean values
is the larger. Always indicate which mean is the largest. It is sometimes appropriate to also
give the values of the means:
The mean dry weight biomass of purple plants (767 grams) is significantly greater than that
of green plants (708 grams) (Welch’s t = 2.78, d.f. = 140.7, p < 0.01)
When we are writing scientific reports, the end result of any statistical test should be a
statement like the one above—simply writing t = 2.78 or p < 0.01 is not an adequate
conclusion!
Jones, 2019
Paired t-test
When do we use a paired-sample t-test?
We learned before how to use a two-sample t-test to compare means among two
populations. However, there are situations in which data may naturally form pairs of nonindependent observations: the first value in a sample A is linked in some way to the first value
in sample B, the second value in sample A is linked with the second value in sample B, and so
on. This is known, unsurprisingly, as a paired-sample design. A common example of a pairedsample design is the situation where we have a set of organisms, and we record some
measurement from each organism before and after an experimental treatment. For example,
if we were studying heart rate in relation to position (sitting vs. standing) we might measure
the heart rate of a number of people in both positions. In this case the heart rate of a
particular person when sitting is paired with the heart rate of the same person when standing.
In biology, we often have the problem that there is a great deal of variation between the
items we’re studying (individual organisms, forest sites, etc). There may be so much amongitem variation that the effect of any difference among the situations we’re really interested
in is obscured. A paired-sample design gives us a way to control for this variation. However,
we should not use a two-sample t-test when our data have this kind of structure. Let’s find
out why.
Why do we use a paired-sample design?
Consider the following. A drug company wishes to test two drugs for their effectiveness in
treating a rare illness in which glycolipids are poorly metabolised. An effective drug is one that
lowers glycolipid concentrations in patients. The company is only able to find 8 patients willing
to cooperate in the early trials of the two drugs. What’s more, the 8 patients vary in their age,
sex, body weight, severity of symptoms and other health problems.
One way to conduct an experiment that evaluates the effect of the new drug is to randomly
assign the 8 patients to one or other drug and monitor their performance. However, this kind
of design is very unlikely to detect a statistically significant differences between the
treatments. This is because it provides very little replication, yet can we expect considerable
variability from one person to another in the levels of glycolipid before any treatment is
applied. This variability would to lead to a large standard error in the difference between
means.
A solution to this problem is to treat each patient with both drugs in turn and record the
glycolipid concentrations in the blood, for each patient, after a period taking each drug. One
arrangement would be for four patients to start with drug A and four with drug B, and then
after a suitable break from the treatments, they could be swapped over onto the other drug.
This would give us eight replicate observations on the effectiveness of each drug and we can
determine, for each patient, which drug is more effective.
Jones, 2019
The experimental design, and one hypothetical outcome, is represented in the diagram
below…
Figure 3: Data from glycolipid study, showing paired design. Each patient is denoted by a
unique number.
Each patient is represented by a unique number (1-8). The order of the drugs in the plot does
not matter—it doesn’t mean that Drug A was tested before Drug B just because Drug A
appears first. Notice that there is a lot of variability in these data, both in the glycolipid levels
of each patient, and in the amount by which the drugs differ in their effects (e.g. the drugs
have roughly equal effects for patient 5, while drug B appears to be more effective for patient
2). What can also be inferred from this pattern is that although the glycolipid levels vary a
good deal between patients, Drug B seems to reduce glycolipid levels more than Drug A.
The advantage to using a paired-sample design in this case is clear if we look at the results we
might have obtained on the same patients, but where they have been divided into two groups
of four, giving one group Drug A and one group Drug B:
Jones, 2019
Figure 4: Data from glycolipid study, ignoring paired design.
The patients and their glycolipid levels are identical to those in the previous diagram, but only
patients 2, 3, 4 and 8 (selected at random) were given Drug A, while only patients 1, 5, 6, and
7 were given Drug B. The means of the two groups are different, with Drug B performing
better, but the associated standard error would also be large relative to this difference. A
two-sample t-test would certainly fail to identify a significant difference between the two
drugs.
So, it would be quite possible to end up with two groups where there was no clear difference
in the mean glycolipid levels between the two drug treatments even though Drug B seems to
be more effective in the majority of patients. What the pairing is doing is allowing us to factor
out (i.e. remove) the variation among individuals, and concentrate on the differences
between the two treatments. The result is a much more sensitive evaluation of the effect
we’re interested in.
The next question is, how do we go about analysing paired data in a way that properly
accounts for the structure in the data?
How do you carry out a t-test on paired-samples?
It should be clear why a paired-sample design might be useful, but how do we actually
construct the right test? The ‘trick’ is to work directly with the differences between pairs of
values. In the case of the glycolipid levels illustrated in the first diagram, we noted that there
was a greater decrease of glycolipids in 75% of patients using Drug B compared with Drug A.
Jones, 2019
If we calculate the actual differences (i.e. subtracted the value for Drug A from the value for
Drug B) for each patient we might see something like…
-3.9 -4.3 2.5 -4.5 0.5 -3.9 -7.1 -2.6
Notice that there are only two positive values in this sample of differences, one of which is
fairly close to 0. The mean difference is -2.9, i.e. on average, glycolipid levels are lower with
Drug B. Another way of stating this observation is that within subjects (patients), the mean
difference between drug B and drug A is negative. A paired-sample design focusses on the
within-subject (or more generally, within-item) change.
If, on the other hand, the two drugs had had similar effects then what would we expect to
see? We would expect no consistent difference in glycolipid levels between the Drug A and
Drug B treatments. Glycolipid levels are unlikely to remain exactly the same over time, but
there shouldn’t be any pattern to these changes with respect to the drug treatment: some
patients will show increases, some decreases, and some no change at all. The mean of the
differences in this case should be somewhere around zero (though sampling variation will
ensure it isn’t exactly equal to zero).
So, to carry out a t-test on paired-sample data we have to: 1) find the mean of the difference
of all the pairs and 2) evaluate whether this is significantly different from zero. We already
know how to do this! This is just an application of the one-sample t-test, where the expected
value (i.e. the null hypothesis) is 0. The thing to realise here, is that although we started out
with two sets of values, what matters is the sample of differences between pairs and the
population we’re interested in a ‘population of differences’.
When used to analyse paired data in this way, the test is referred to as a paired-sample t-test.
This is not wrong, but it important to remember that a paired-sample t-test is just a onesample t-test applied to the sample of differences between pairs of associated observations.
A paired-sample t-test isn’t really a new kind of test. Instead, it is a one-sample t-test applied
to a new kind of situation.
Assumptions of the paired-sample t-test
The assumptions of a paired-sample t-test are no different from the one-sample t-test. After
all, they boil down to the same test! We just have to be aware of the target sample. The key
point to keep in mind is that it is the sample of differences that is important, not the original
data. There is no requirement for the original data to be drawn from a normal distribution
because the normality assumption applies to the differences. This is very useful, because even
where the original data seem to be drawn from a non-normal distribution, the differences
between pairs can often be acceptably normal. The differences do need to be measured on
an interval or ratio scale, but this is guaranteed if the original data are on one of these scales.
Jones, 2019
Carrying out a paired-sample t-test in R
R offers the option of a paired-sample t-test to save us the effort of calculating differences. It
calculates the differences between pairs for us and then carries out a one-sample test on
those differences. We’ll look at how to do it the ‘old fashioned’ way first—calculating the
differences ourselves and running a one-sample test—before using the short-cut method
provided by R.
Staying with the problem of trials of two drugs for controlling glycolipid levels, the serum
glycolipid concentration data from the trial illustrated above are stored in the
GLYCOLIPID.CSV file.
Download this file and place it in the working directory. Read GLYCOLIPID.csv into an R data
frame, giving it the name glycolipid.
As always, we should start by looking at the raw data by using:
summary(glycolipid)
There are four variables in this data set: Patient indexes the patient identity, Sex is the sex of
the patient (we don’t need this), Drug denotes the drug treatment, and Glycolipid is the
glycolipid level. Glycolipid_A is the glycolipid levels in response to drug A for patients 1-8.
Glycolipid_B is the glycolipid levels in response to drug B for patients 1-8.
Next, we need to calculate the differences between each pair. We can do this by creating a
new variable, as you did in the data visualisation workshop.
Create a new variable called glycolipid_diffs in the glycolipid dataframe by subtracting one
group from another using the glycoplipid_A and glycolipid_B columns.
What you did was calculate the difference between the two Glycolipid concentrations within
each patient. We stored the result of this calculation in a new data frame
called glycolipid_diffs. This is the data we’ll use to carry out the paired-sample t-test.
We should try to check that the differences could plausibly have been drawn from a normal
distribution, though normality is quite hard to assess with only 8 observations. Create a
histogram of the differences data.
The data seems roughly normal, so let’s carry out a one-sample t-test on the calculated
differences. This is test is easy to do in R:
t.test(glycolipid$glycolipid_diffs)
Jones, 2019
We don’t have to set the data argument to carry out a one-sample t-test on the differences.
We just passed along the numeric vector of differences extracted from glycolipid_diffs (using
the $operator). What happened to the mu argument used to set up the null hypothesis?
Remember, the null hypothesis is that the population mean is zero. R assumes that this is 0 if
we don’t supply it, so no need to set it here.
The output is quite familiar… The first line reminds us what kind of test we did, and the second
line reminds us what data we used to carry out the test. The third line is the important one: t
= -2.6209, df = 7, *p*-value = 0.03436. This gives the t-statistic, the degrees of freedom, and
the all-important p-value associated with the test. The p-value tells us that the mean withinindividual difference is significant at the p < 0.05 level.
We need to express these results in a clear sentence incorporating the relevant statistical
information:
Individual patients had significantly lower serum glycolipid concentrations when treated with
Drug B than when treated with Drug A (t = 2.62, d.f. = 7, p < 0.05).
There are a couple of things to point out in interpreting the result of such a test:
1. The sample of differences was used in the test, not the sample of paired observations.
This means the degrees of freedom for a paired-sample t test are one less than the
number of differences (= number of pairs); not one, or two, less than the total number
of observations.
2. Since we have used a paired-sample design our conclusion stresses the fact that the
use of the Drug B results in a lower glycolipid level in individual patients; it doesn’t say
that the use of Drug B resulted in lower glycolipid concentrations for everyone given
Drug B than for anyone given Drug A.
Jones, 2019
Using the paired = TRUE argument
R does have a built in procedure for doing paired-sample t-tests. Now that we’ve done it the
hard way, let’s try carrying out the test using the built in procedure. This looks very similar to
a two-sample t-test, except that we have to set the paired argument of the t.test function
to TRUE:
t.test(Glycolipid ~ Drug, data = glycolipid, paired = TRUE)
R takes care of the differencing for us, so now we can work with the original glycolipid data
rather than the glycolipid_diffs data frame constructed above. We won’t step through the
output because it should make sense by this point.
R makes it easy to do paired-sample t-test. It really doesn’t matter which method we use to
carry out the test. Just don’t forget that a paired-sample t-test is only a one-sample test on
paired differences.
Jones, 2019
Do it yourself!
The file spider.csv is on Canvas. The data here are from an experiment where researchers
removed a pedipalp from male spiders. In males these are enlarged and used in mating. They
hypothesised that the pedipalp was a sexual handicap to the male spiders They measured the
speed (in cm/s) of spiders before and after amputation.
Does amputation appear to have an effect on the speed of spiders?
Conduct the appropriate statistical test and provide a suitable figure.