The -id option is an accept option that specifies the minimum sequence identity of a hit. It is expressed as a fraction between 0.0 and 1.0, meaning from 0% to 100% as a percentage. It is supported by most search and clustering commands. Identity is the fraction of columns in an alignment with matching letters.
Example
usearch -cluster_fast reads.fasta -centroids c.fasta -id 0.90
Rules for wildcards and matching letters (version 8 and later)
Case is ignored for calculating identity, so an upper case letter can match a lower case letter. (See
Masking
for discussion of lower-case for indexing). Wildcards match, so for example in a amino acid alignment, a column containing AX is an identity, and in a nucleotide alignment AN and AW are identities (because W is the
IUPAC ambiguity symbol
for A or T). Two wildcard letters match each other if they represent at least one identical residue, so for example NN matches in a nucleotide alignment, and MR matches in a nucleotide alignment (because both M and R include A). Identical letters always match, even if they are not part of a known alphabet. These rules for matching wildcards give an upper bound on the identity of the true sequences when wildcards are replaced by fully specified residues. Other rules are possible, e.g. always considering wildcards to be mismatches (which would give a lower bound), or ignoring columns containing wildcards. There is no one best rule for dealing with wildcards; all possible rules have advantages and disadvantages in different situations.
Identity in global alignments
In
global alignments
, columns containing
terminal gaps
are discarded before identity is calculated, while internal gaps always count as differences. The example below has a terminal gap of length 3 at the end of the alignment, the identity is therefore calculated over the remaining seven columns which contain six matches and the identity is 6/7 = 0.86.
GATTACA---
||| |||
GATAACAATC
Fractional identity vs. percentage identity
To convert between fractional identity and percentage identity, multiply or divide by 100, as appropriate. Since percentage identity is much more commonly used in practice, using fractional identity was a minor design mistake -- it would have been better to use percentage. The historical reason is that the USEARCH code began with UCLUST, motivated as an attempt to improve on CD-HIT, and CD-HIT is one of the few programs to use fractional identity (its -c option). Note that CD-HIT uses a
problematic non-standard definition of identity
.