Seeing triple (a short guide to experiment reproducibility)

triplicate_forweb_crop.png
Like the image? Click HERE for more.

With the reproducibility crisis in science showing no signs of abating, it’s never been more important to clearly communicate how rigorously your data were obtained. Here’s TIR’s short guide to technical replicates, biological replicates, independent experiments, and what they do and don’t tell you. 

When you’re doing experiments and, later on, preparing to publish them, it’s essential that you are clear – both to yourself and your readers – on how reproducible those conclusions are. However, there are many sources of variation in data, and demonstrating reproducibility requires time and care.

1. Technical replicates
Technical replicates are when you set up multiple identical experiments and run them in parallel, or take multiple readings from a single experiment. Good examples are enzyme assays or growth curves. At a chosen timepoint, you take measurements from each of those replicates, and calculate the mean value. The more technical replicates you have, the more accurate your estimation of the mean value will become. In other words, technical replicates control for the experimental variability for a given biological sample on a given day. However, no matter how many technical replicates you have – whether it’s 5 or 500 – your overall sample size is still 1 (n=1). All those measurements are averaged into a single mean value, and that’s your result from that one experiment. The number of technical replicates you use will usually be determined by how well-behaved the setup is and by simple convenience; calculating cumulative mean* values across technical replicates and seeing how quickly they converge on a mean is a good way of getting a sense of how much variation there is in your data.

2. Biological replicates
Of course, what technical replicates can’t control for is how representative your sample is. There will always be biological variation between different samples, whether they’re mice, cell lines, or preparations of purified protein, and it’s likely that this will be the greatest source of variation in your data. You can do as many technical replicates as you like, but if your sample is abnormal then your mean value is not going to be representative of the population as a whole. Biological replicates address this. By taking measurements from different biological samples (different mice, different cell lines, different protein preparations), you get a sense of how reproducible your data are against the background of intrinsic biological variation.

3. Independent experiments
However, what biological replicates don’t necessarily control for is human error. You can set up multiple technical replicates for multiple biological samples, but if you forget to add the enzyme/drug, or mix up your tubes, or do any of the hundreds upon hundreds of things that can spoil an experiment without your realising, then your data won’t be right. No amount of technical or biological replicates can protect you from human error on the day. Human error is an unavoidable feature of lab work, and the best way to control for it is to do independent experiments. Only by carrying out multiple independent experiments on different days can you be confident that an effect, or lack of it, is likely to be genuine.

Blurring the lines
Ideally, you should aim to quantify multiple technical replicates per experiment (exact number estimated through calculation of cumulative means), from multiple biological replicates, and on different days (independent experiments) in order to be 100% sure of your data. However, there’s a trade-off between rigour and rate of progress. Obviously you want your data to be as ironclad as possible, but at the same time it’s worth asking how much time to invest in obtaining what might be a trivial conclusion.

In some circumstances, blurring the lines between biological replicates and independent experiments is an acceptable shortcut – i.e. assaying your biological replicates on separate days in order to kill two birds with one stone. Whether you take this shortcut or not (and remember, it is a shortcut, and a questionable one), it’s still good practice to define what you mean by “technical replicate”, “biological replicate” and “independent experiment” in your Materials & Methods section so that your readers are absolutely clear on what you did. If the data obtained in this way are highly variable though, there’s no way of telling if it’s human error or biological variability of both – so be prepared to backtrack and do things rigorously if that’s the case.

In cell biology
In whole organism (e.g. mouse) work, the difference between technical replicates and biological replicates is pretty obvious – technical replicates are readings derived from a single mouse, biological replicates are readings derived from different mice. In cell biology it can be a bit harder to define things. For instance, if you’re doing an immunofluorescence experiment, what constitutes a technical replicate? Preparing identical coverslips in parallel, or are the hundreds/thousands of cells on a single coverslip all technical replicates, given that they’ve been exposed to the same labelling conditions but may respond slightly differently? If you’re dealing with a transgenic cell line, then biological replicates are different clones – but those clones could have been obtained from only one or several separate transfections (that tradeoff between rigour and speed again).

As per the preceding section, the best thing is to state explicitly in your Materials & Methods section (in the statistics part) what definitions you are using. And don’t forget to provide detail on biological replicates and independent experiments in your figure legends!

Counting the ways
How many independent experiments should you do? As many as possible really, but three is an achievable minimum for cell biology work. The number will generally be determined along empirical lines, with certain assays/techniques being more amenable to higher sample sizes than others.

Statistical significance (an update for 2019!)
Statistical significance tests are outside TIR’s area of expertise, but luckily there is a truly excellent technical perspective published by MBoC in May 2019 that is explicitly targeted at cell biologists – you can find it (free download) HERE. Fig2 is a flow chart designed to help you find the right statistical test for the experiment you’ve performed.

Two key observations/recommendations from it: (1) proportions and percentages are categorical responses and therefore not numerical data, despite appearances. As such, not all significance tests – and in particular, not the t-test – are applicable; (2) standard deviation, and not standard error of the mean, should be used for error bars in charts.

As always, corrections and clarifications from readers are very welcome…

 

* Cumulative mean: calculating the cumulative mean is very simple. You take your first measurement, A. Then you take your second measurement, B. The mean value for your experiment is now (A+B)/2. With your third measurement (C) the mean value becomes (A+B+C)/3. And so on. These iterative calculations of the mean are the cumulative mean.

By plotting the cumulative mean values (A, (A+B)/2, (A+B+C)/3 etc) versus total number of observations (1, 2, 3) on a graph, you will see how your mean value for the experiment gradually converges towards a particular value. This value is your mean from the experiment.

A good rule of thumb is that your sample size for the experiment in question – i.e. the number of technical replicates – should be double the total number of observations required for the mean value to converge on a set point. Under circumstances of high experimental reproducibility, that means that your total number of technical replicates will be relatively low; however if there’s more variation in your setup then the mean will take longer to converge and you should be prepared to take a higher number of measurements. If no convergence is readily seen, this suggests that the level of variability is very high – possibly indicative of low-quality data.

22 thoughts on “Seeing triple (a short guide to experiment reproducibility)

  1. Thanks for this article. We all should be doing more to limit irreproducibility. I would like to include a short comment about your guide in Biofisica-Magazine (news media by the Spanish Biophysics Society) ¿Do you mind that? ¿Can I reuse the figure of your post as the head of the comment in Biofisica-Magazine?

    Liked by 1 person

  2. This is super helpful…I was wandering, what happens in (bio)chemistry experiments? For instance, I work with enzymes produced by ourselves and we know that they are not very reproducible…every batch produced presents a high variability in activity because it’s very sensitive to many variables (purification, lyophilisation, storage, time etc…). So we usually produce a lot of the enzyme and use that single batch for all of our experiments….but sometimes it’s finished before we have done with the experiments, requiring more enzyme to be produced. In that case, when I’m measuring activity of the enzyme, can I say that my n = 3 if I do 3 “independent experiments” from the same enzyme batch (meaning that I do each experiment in a different day and, in that case, the enzyme is considered as another chemical reactant of the experiment – like pyruvate, etc)? Thank you.

    Like

    1. Thanks, delighted that you found it helpful! And your question is a very good one, and very relevant. I was having a similar discussion not long ago concerning a single-particle cryo-EM study where the question was whether the biological replicates were the batches (purifications) or individual molecules used for averaging.

      The point about doing replicates – whether technical or biological – is to minimise error and account for variability. In most cases, biological replicates will be the source of the largest variability. Given that you have already identified that the different purifications/batches of your enzyme appear to the source of highest variability in your assays, that means that the separate purifications are the biological replicates.

      What you propose doing is totally correct. If you do 3 separate experiments on different days, then your measured activity derives from 3 independent experiments and your n=3. That measured activity relates only to a single purification however, and so the question is: how representative is that purification of the enzyme’s real activity? To get a better sense of that, you will probably be better off obtaining an average from multiple purifications, where each purification is a separate biological replicate.

      I realise of course that it may not be possible to take such an approach. If that is the case, I would recommend being as transparent as possible in your description of the data and in your Materials & Methods. Note that you did 3 independent experiments using only a single batch of enzyme, but that batch variability does occur and so the measured values should be treated with a degree of caution as they may not be truly representative. If possible though, I would definitely recommend calculating activity using multiple batches – in that way you have best accounted for the variability in your setup. Sound reasonable?

      Like

      1. Dear Broke. Thanks for your prompt reply. Indeed I’d love to be able to do a triplicate of the desired enzyme, starting from the production of the enzyme itself (ex. using 3 different culture plates) and also keeping the 3 batches separated during purification. However the limiting issue here is time (it takes about 2/3 weeks to get some 500-600 mg of the lyophilised enzyme when we are lucky) and money (lots of expensive resources goes into enzyme production). So I think my best compromise will be to use the single batch and perform the triplicated ind. exp. in different days whenever possible. Maybe if I need more enzyme then I’ll have to produce more and could then compare the values of both batches, which is a good think for my statistics but also means loads of extra work.

        Like

  3. Sure, that’s what I meant with “it may not be possible…”. It’s not always possible to have multiple biological replicates – in cell biology, when using a wild-type strain, it’s generally not possible to compare results with other independent wild isolates, for instance. In your case, just be as clear as possible in your reporting in terms of batches, independent experiments etc and then it will be clear to readers that you are being transparent and have identified possible sources of variability in your data. Good luck! B.

    Like

    1. Dear Brooke

      I forgot to ask something, let’s say I have 3 batches of the same enzyme, produced in the same way but they were produced in different months. If I calculate their activity, would I still be able to consider each as a independent/biological replicate (n=3) or the time factor does not allow that?

      Looking forward for your opinion.

      Thanks,
      Edu

      Like

      1. Hi, sorry for the slow reply. I think that different batches can be considered separate biological replicates, even given the time difference. The question is whether you think the activity has been affected by the different storage times. As always, clarity about what you’ve done is the most important thing.

        Like

  4. Hi Brooke,

    A very informative article indeed. Thank you very much.

    I would like to bring a potential error to your kind attention.

    You have referred to an MBoC publication in the section ‘Statistical significance’. There you have stated, “standard deviation, and not the standard error of the mean, should be used for error bars in charts”, while in reality the publication advocates using confidence intervals over SE (“Using SEM reduces the size of error bars on graphs but obscures the variability. Using confidence intervals (see Box 2) is preferred to using SEM.”).

    While the publication puts forth some valid arguments against using SE for error bars, this article (http://jcb.rupress.org/content/jcb/177/1/7.full.pdf) explains how one could still use SE effectively to estimate the p-value.

    Kindly share your opinion on the above mentioned.

    Cheers!

    Anupam

    Like

    1. Hi Anupam,

      Thanks for taking the time to write – it’s always good (and helpful!) to get feedback.

      You are absolutely right that the MBoC article recommends using CI instead of SEM, and that the Vaux & co article from JCB (one of my favourites) provides good advice on using the SEM appropriately.

      The reasons for making my original recommendation, and why I would still stick with it, are as follows:

      1. Many scientists (and I include myself in this number) are not statistically literate and are more likely to misuse statistical tools or misunderstand what they show. CIs are much less straightforward to calculate than SD or SEM, so it is safest to advocate for a simple measure (i.e. SD). P values are more likely to be misused than appropriately employed, in my experience.

      2. The way that most biologists use error bars is consistent with SD (showing variability in the data) rather than SEM (showing accuracy of the estimation of the mean). The use of the SD (a descriptive statistic) is therefore probably closer to what many scientists are trying to illustrate rather than inferential statistics like SEM and CI. SEM is more likely to be used to attempt to minimise the perception of variability in the data rather than its real statistical meaning.

      3. In addition (and as noted in the Vaux paper), a lot of biological experiments are done with very small sample sizes – often as few as 3, and rarely more than 10. For this reason, I think the most transparent depiction of such data is a dot blot (so you can see all the data), together with a median and SD.

      Does that sound reasonable?

      Cheers,
      Brooke

      Like

      1. Hi Brooke,

        Thank you for your elaborate response and sorry for the delay in replying.

        You are really being modest when you count yourself among the statistically illeterate 🙂 But yes, I agree that there is a derth of statistical awareness among the biologists in general.

        Your point about the straightforwardness of SD calculation and interpretation is agreeable.

        However, doesn’t SEM also represent a ‘variablity’ in the sampling distribution of sample mean? In that sense, both SD and SEM are similar, I feel. Moreover, I would rather say that SEM represents the ‘precision’ in the estimation of the mean instead of ‘accuracy’. What are your thoughts?

        I second your other statements.

        Happy learning from this interaction 🙂

        Anupam

        Liked by 1 person

      2. Hi Anupam,
        Sorry for the slow reply.
        Soooo, much of the below is paraphrased from “Naked Statistics” by Charles Wheelan, which I can highly recommend if you’re not already familiar with it.

        – Any given experiment (with multiple technical and biological replicates) will give us a sample mean, which is an approximation of the true population mean.
        – If we do multiple experiments, each with multiple technical and biological replicates, we will end up with a set of sample means.
        – The sample means will themselves form a normal distribution around the true population mean.
        – Crucially, the underlying data does not have to have a normal distribution for this to apply.
        – The bigger the number of samples, the closer to a normal distribution we’ll be, and the larger the size of the dataset in each sample, the tighter the distribution will be (because each sample mean has been measured more accurately).
        – The power of a normal distribution is that we know what proportion of the observations will lie within 1 SD above or below the mean (68%), what proportion will lie within 2 SD of the mean (95%), and so on.
        – The standard error is the term that describes the dispersal of these sample means.
        – I.e. standard deviation measures the dispersion in the sample; the standard error measures the dispersion of the sample means. The standard error is the standard deviation of the sample means.

        – It’s therefore easy to see why the standard error is a richer readout on your data, because it provides an estimate of the variation in the sample means, and therefore an indication of how close you might be to the true population mean.

        – The thing to ask though is this: how many people use the standard error because it provides a measure of the dispersal of the sample means, and how many people use it just because it gives a lower value than the standard deviation? In biological sciences, I’d wager that the latter is often more likely. In fact, I’d bet that a lot of times that the standard error is used, it’s (erroneously) not even being used on means.

        – In biological sciences, people with only a basic knowledge grasp of statistics (like myself) want to show where the centre of their data – obtained using multiple technical replicates, multiple biological replicates, and across multiple independent experiments – is, and how much variation around that point there is. Median and SD describe those things.

        Does that sound reasonable?

        Liked by 1 person

Leave a comment