2  When Should I Generate Data?

After you’ve got a plan for how you are going to analyse it.

“Data is like garbage. You’d better know what you are going to do with it before you collect it.”

- Mark Twain (attribution questionable)

If you will have a specific narrowly and well defined hypothesis to test which you will analyse yourself having a thorough plan before you generate your data may be relatively straightforward. However it is easy to accidentally gloss over some detail in an inexactly formulated analysis plan. I always advise that you produce some dummy data in the format that your experiment will generate. Use this to work through the steps of your analysis in its entirety. This will make it easier to spot any ambiguities or confusions in the planned analysis. In Chapter 4 working with data we discuss some approaches you can take to make this process easier.

Test your analysis with dummy data that supports your hypothesis as well as dummy data that goes against it. If you play around a bit with your simulated data this will give you qualitative impression of the statistical power of your experiment. How strongly would your experimental data need to match your expectations for you to be able to see the results clearly in the statistics? Does this match up with the effect sizes that you saw in any preliminary data? Do you have all the controls/calibrations that you need and are you making best use of them in your analysis? What can you tweak in your design to make it robust if your full dataset turns out more noisy than your test data? In short is your design actually up to the task of answering your question?

“The first principle is that you must not fool yourself and you are the easiest person to fool.”

- Richard P. Feynman

At this stage you should seek feedback on your plans from your colleagues, indeed as we shall discuss in Chapter 5 when to publish data this might be a good time to think about publishing a registered report detailing your hypothesis(es), planned experimental protocols & analysis(es). Getting an outside perspective on your methods and design can be very helpful and lead to major improvements in your experiment. You may also get some bad takes. Even if some or all of the feedback proves of little value merely going through the process of putting your work in a form that others can give feedback on at this stage is remarkably effective at helping you to catch errors in your own formulation of your ideas. The advantage offered by getting someone external to look at your planed project and not just your supervisor or someone else close to it is that you get a ‘red team’ a concept popular in defence, cybersecurity, and other areas where the stakes are high for your models mapping well onto reality. Whilst a culture/system of red-teaming can be cultivated internally people often feel a bit awkward about being brutally honest about a colleagues work, an affliction often mitigated by simple expedience of not being likely to bump into the other person at lunch the next day. The red team concept has been tried in academic contexts but has yet to see wide adoption.

Importantly this planning should take place before you have committed experimental resources to testing your ideas. It is a responsibility of an ethical researcher to ensure that when significant resources are going to be expended on an experiment that its design be sound and as close to optimal as is practically achievable. This is especially true in the context of scarce and valuable experimental resources like donated human tissues.

Why seek feedback at this time? at the grant stage plans are often insufficiently concrete to benefit from benefit from this kind of object level critique and at the publication stage it is too late to change anything really important and you’ve already committed resources. Grant scrutiny is for funders, publication scrutiny is for journals so that they can protect their respective reputations if they attach their names to your work. Red-teaming and registered reports (Section 5.2) are for the researcher they let you make best use of your own resources and demonstrate the integrity of your process.

2.1 When to speak to data analysts

“Further, science is a collaborative effort.”

- John Bardeen

If you are going to be collaborating with a bioinformatician/statistician or other analyst on the analysis of your data you should speak to them before you pick up a pipette to generate your main dataset 1.

Bioinformticians and statisticians frequently receive data to analyse with problems in the experimental design that cannot be properly addressed at the analysis phase. An example common in my experience is that of technical or batch effects perfectly confounded with biological variables of interest to the experimenter. This is often readily resolved by using variants on a split-plot design / appropriate blocking. This can sometimes lead to a more logistically challenging experiment at the bench but often means that you need fewer repeats to robustly observe smaller effects. More importantly fixing confounding issues can allow you to properly separate technical from biological sources of variation something which cannot be fully addressed by attempting to correct for it statistically. It is important to spend the time thinking about these trade-offs during the design process.

A common symptom of this sort of design failure is over-reliance on batch and other effect corrections, it is now commonplace to see batch correction used in various ‘omics’ analyses essentially by design, this is almost always a bad idea. You should not assume that you can successfully correct for something in your analysis, you should especially not assume that you can correct for multiple things in your analysis, and you should really not assume that you can correct for multiple things when you have a small number of samples and are running a large number of tests. You are very unlikely to be able to perform a robust statistical analysis of your data if you have made these assumptions.

These correction methods are tools that should ideally only show up in observational studies or meta-studies combining results from multiple experiments. It should not be a part of the design of a planned experiment unless there is no other recourse. These and similar approaches should not be used in lieu of proper experimental design.

2.2 Exploratory analyses are not exempt from proper planning

“if it’s worth doing, it’s worth doing well”

- proverb

It is one thing to generate a dataset without knowing what it is going to show you, this is to be expected, it is quite another to generate a dataset without even knowing if it will be able to answer any of the questions that you are interested in. A valid reason to generate a dataset is also calibration: what can we see, and how reliably with a given method, in a given system? Calibrating your method, performing exploratory analysis and testing the resulting hypotheses are too often conflated into a single step. You don’t want to end up ‘betting the farm’ on one set of experimental parameters that you hope will be able to answer your question(s) without a high degree of confidence that you will see the effect(s) you are interested in. Large ‘calibration’ experiments can be a public service to a research community as subsequent users of these methods are saved the bulk of the parameter tuning work and can use simple controls to check their implementation lines up with the reference. Such datasets can be expensive and a lot of work but often garner a lot of citations and good will.

People attempt, often due to resource limitations, to answer a biological question with with poorly calibrated methods where the baseline variability is insufficiently characterized and perhaps most common mistake: insufficient statistical power to properly test the question the experimenter wants to answer.

Exploratory and descriptive analyses are perfectly reasonable undertakings as hypothesis generating exercises. It can be helpful to have at least one concrete well defined hypothesis to test with the data you plan to generate. Because of the nature of null hypothesis significance testing an important thing for correctly interpretable statistical results is to draw clean separation between planned analyses testing specific hypotheses and exploratory analyses, we will return to this subject in Chapter 5 When to publish data. It is important to have exacting clarity on what your study is setting out to achieve and what you can reasonably expect to measure with your experimental setup.

If you generate for example an RNA-seq dataset vague goals like: “I’m going to do differential expression analysis & functional enrichment analysis” are not a plan.

  • What sort of effects do you want to be able to see?

  • Will the design you have planned have the power to see the sorts of effect you are interested in?

  • What FDR (False Discovery Rate) is acceptable for want you want to do with your results?

You might treat a large effect size and low p-value on an exploratory analysis that you did not plan as a sort of informal bayesian evidence and modify your qualitative priors as a result. This can inform subsequent experiments but the absolute value of a p-value from such an after the fact test is not meaningful, treating it as such is HARKing (Hypothesizing after the results are known). This is not in and of itself a problem as an approach to thinking up new hypotheses but mis-representing a hypothesis arrived at after the fact as the one originally being tested is.

You can design for this with simulated data and running analysis code before a single cell sees culture medium. Modern bioinformatic analysis & image processing pipelines are long complicated and difficult to comprehend in their entirety. It is a very challenging if not impossible task to intuit what sort of changes in your input data may yield outputs with meaningful effect sizes and degrees of confidence.

One of the best ways to handle this is to design your analysis and feed it dummy / test data, be it simulated or from previous/pre-liminary experiments and see what you can actually reliably detect in the output. This needn’t be the entire pipeline in the case of sequencing results for example it can be a lot easier and more practical to start at the count matrix phase if you are looking for differences in expression/occupancy.

This process is also very useful for being able to properly interpret your results. Play with the knobs on the black box of your complicated analysis and see if you can predict what happens to the dials at the other end, set the dials and see where the knobs had to be to produce that result. Does it pass the high level common sense checks? Does it break, or perhaps more worryingly not break, when you set it to weird edge-case values which often include 0, negative numbers, infinity, or missing values?

This is important for the verifiability of your work (Hinsen 2018 [cito:citesAsAuthority] [cito:agreesWith]). If your work is reproducible but wrong it’s nice that it’ll be a bit easier for someone to figure out why it’s wrong but it will still be wrong. Putting the time into understanding and stressing your analysis process makes it easier to spot conceptual, logical and technical errors in the final analysis. This is hard to do unless you have calibrated your intuitive expectations for how the system will behave when presented with a given input. You can develop the diagnostic tools, such as visualizations and summary statistics, you need to spot sources of error.

When you have a pipeline that you are going to re-use or that is of critical importance it can be useful to construct more formal tests of your analysis pipeline borrowing concepts and sometimes tooling from integration and systems testing in software development and applying some of these practices to your analysis pipeline(s). Good test coverage can be a valuable resource to help you catch mistakes when updating, extending or modifying a pipeline. It’s easy to make inadvertent changes which have unexpected effects especially when re-visiting old projects, tests and checklists can help you avoid this. This needn’t be only in data analysis similar principles apply in the wet lab, refreshing cell-lines and reagents, re-validating mutants etc.

Hinsen, Konrad. 2018. “Verifiability in Computer-Aided Research: The Role of Digital Scientific Notations at the Human-Computer Interface.” PeerJ Computer Science 4 (July): e158. https://doi.org/10.7717/peerj-cs.158.

  1. You may find that your bioinformaticians and other analysts are surprised and even confused to be consulted at this early a stage in the process. This is likely because they are so used to be presented with the data as a fait acompli that they are not sure how to react when consulted at the proper time. You may need to be patient and encouraging with them in order to convince them that you are actually interested in their thoughts on your experimental design.↩︎