See also
OTU / denoising analysis pipeline
I recommend pooling samples, i.e. combining all reads for all samples, for generating OTUs and making an OTU table. This applies both to UPARSE (97% OTU clustering) and UNOISE (error-correction to obtain "ZOTUs", i.e. all biological sequences in the reads). The only exception is cross-talk, as explained next.
Improved amplicon abundance estimation and singleton detection
The most important motivation for pooling is that it
enhances the abundance signal for correct sequences.
If samples are pooled, then a sequence that appears as a singleton in one sample may also appear in another sample and will therefore be retained and included in the OTU table. If singletons are discarded after pooling (as usually recommended in order to reduce spurious OTUs), then more low-abundance species will be retained compared with discarding singletons for each sample separately.
Cross-talk detection
To detect
cross-talk
manually or by using the
UNCROSS algorithm
, it is important to include
all
reads from
one
sequencer run
,
even if the samples are from different environments. Each sequencing run should be analysed separately. If one run contains unrelated samples, e.g. if your sequencing center did one run with samples from several users, then you should get the reads for the other samples for cross-talk analysis but not for other analyses as explained below.
OTU clustering and denoising
For OTU clustering and denoising, it is usually better to include reads from
all related samples
, e.g. all samples from a given environment, even if they were sequenced in more than one run or were sequenced together with other environments (e.g., a sequencing center did one run with samples from several users).
Comparing samples
Creating a single set of OTUs is the most natural and intuitive basis for sample comparison, e.g. using a
beta diversity metric
. If you create separate 97% OTUs for each sample, they are not directly comparable because the clustering will give different results in each sample even if they contain mostly the same biological sequences. With denoising, comparing OTUs from different samples is less of a problem because if the same biological sequence is found in two samples, it should be found in both cases. However, see this doesn't always work correctly and better results are obtained by pooling as explained under "Error detection" below.
Chimera detection
The
UPARSE-OTU
and
UNOISE
algorithms both require that a chimera has lower read abundance than its parents. Chimeras are not detected if a parent has the same number or fewer reads. This most often happens with low-abundance parents, e.g. when a chimera and one of its parents are both present in exactly two reads. If samples are pooled, parent abundances usually increase because they are found in multiple samples, while chimeras are only rarely reproduced so will usually be found only in a single sample. Even if chimeras are reproduced, pooling will tend to increase both chimera and parent abundances, leading to a more accurate reflection of amplicon abundance so that parent abundances become greater than their chimeras. Conversely, pooling is highly unlikely to increase the abundance of a chimera relative to its parents. Pooling is therefore effective in reducing the number of spurious OTUs due to chimeras.
Error detection
The
UNOISE algorithm
uses unique sequence abundances to detect bad reads. If a read (R) with low abundance that is very similar to a read with much higher abundance (H), then R is probably a bad read with correct sequence H. This is most effective when all samples are pooled together to give the highest possible abundances for correct reads.
When to pool in your pipeline
Samples should be combined
after
non-biological sequences such as barcodes have been stripped from the reads, and
before
dereplication
. This is required so that dereplication reflects the abundances of unique biological sequences across all samples.