The USEARCH algorithm searches a database for high-identity hits to one or more database sequences ("targets"). USEARCH is used by the usearch_global and usearch_local commands and is used as a subroutine by cluster_fast and cluister_smallmem . This algorithm is fundamentally different from the UBLAST algorithm that is designed for low identity local searches.
Recommended identity ranges
USEARCH is effective at identities of
~50% and above for proteins and ~75% and above for nucleotides
.
Word counts and U-sorting
USEARCH exploits that fact that similar sequences tend to have several short words in common. These words have a fixed length k, and are sometimes called kmers. Unlike other programs that use kmer counting, USEARCH does not attempt to estimate sequence identity from the number of matching kmers. This is because identity correlates only approximately with the word count and does not give an accurate estimate, especially for lower identities. Instead, USEARCH uses the word count to prioritize the database search. Target sequences are compared to the query in order of decreasing unique word count (U). This is the "U" in USEARCH. If a hit above the
identity threshold
exists in the database, it is likely to be found close to the start of the U-sorted list of targets (the "U vector").
Accepts and rejects
An "accept" is a target sequence that is above the identity threshold so can be considered a hit. A "reject" is a target sequence that is compared to the query but falls below the threshold. If targets are compared to the query in U-sorted order, then:
(i) the first hit that is found is likely to be the best hit that exists in the database, or close to it, and
(ii) the more rejects that have occurred without a hit, the less likely it is that a hit exists.
This means that the search can be terminated early if (i) a hit is found, or (ii) several rejects have occurred, with only a small loss in sensitivity. The termination options -maxaccepts and -maxrejects determine when the search is stopped. This technique can give a dramatic improvement in speed, which is a decisive advantage for the very large sequence datasets that are increasingly common in biology.
Word index
USEARCH uses an index on words in the database for rapid calculation of the U vector for a given query.
Index options
allow fine tuning of speed, sensitivity and memory use. The index can be stored in a
UDB file
, which can be advantageous when a large database will be searched repeatedly. Alternatively, the index can be constructed on the fly, for example when a
database
is provided in FASTA format. Indexes are also used by
cluster_fast
and
cluister_smallmem
, which use USEARCH as a subroutine to match query sequences to a database of centroids of existing clusters. With clustering, the index is necessarily constructed on the fly since the database is updated as new clusters are identified.