USEARCH v12

fastx_uniques command


See also
search_exact command
fastx_uniques_persample command

Find the set of unique sequences in an input file, also called dereplication . Input is a FASTA or FASTQ file. Sequences are compared letter-by-letter and must be identical over the full length of both sequences (substrings do not match). Case is ignored, so an upper-case letter matches a lower-case letter.

All 26 letters of the English alphabet are treated in the same way, so there is no concept of a biological alphabet or of wildcard matching (unless strand -both is used).

Multithreading is supported.

The -fastaout option specifies a FASTA output file for the unique sequences. Sequences are sorted by decreasing abundance.

The -fastqout option specifies a FASTQ output file for the unique sequences. Sequences are sorted by decreasing abundance.

The -tabbedout option specifies an output file in tabbed text format. The fields are: 1. input label, 2. output label (this is the input label of the first occurrence of the sequence, or the new label assigned to it if the -relabel option is used), 3. cluster number (zero-based, so 0 is the first unique sequence found, 1 is the second etc.), 4. member number in the cluster (zero -based), 5. input label of the first occurrence of the sequence (only if -relabel is specified).

The - uc output file is supported, but not other standard output files.

The - sizeout option specifies that size annotations should be added to the output sequence labels.

The -relabel option specifies a string that is used to re-label the dereplicated sequences. An integer is appended to the label.
E.g., -relabel Uniq will generate sequences labels Uniq1, Uniq2 ... etc. By default, the label of the first occurrence of the sequence is used.

The -minuniquesize option sets a minimum abundance. Unique sequences with a lower abundance are discarded. Default is 1, which means that all unique sequences are output.

The -topn N option specifies that only the first N sequences in order of decreasing abundance will be written to the output file .

Reverse-complemented matching for nucleotide sequences is supported by specifying -strand both.

Example

usearch -fastx_uniques input.fasta -fastaout uniques.fasta -sizeout -relabel Uniq