|SABmark - Sequence and structure Alignment Benchmark|
SABmark is designed to assess the performance of both multiple and pairwise (protein) sequence alignment algorithms, and is extremely easy to use. To download it, go to the bottom of the page, or just view the manual. A short description of the database is given below, and will soon also be published in Bioinformatics.
Currently, the database contains 2 sets, each consisting of a number of subsets with related sequences. It's main features are:
The Twilight Zone set is divided into sequence groups that each represent a SCOP fold. All sequences within a group share a pairwise Blast e-value of at least 1, for a theoretical database size of 100 million residues. Sequence similarity is thus very low, between 0-25% identity, and a (traceable) common evolutionary origin cannot be established between most pairs even though their structures are (distantly) similar. This set therefore represents the worst case scenario for sequence alignment, which unfortunately is also the most frequent one, as most related sequences share less than 25% identity.
The Superfamilies set consists of groups that each represent a SCOP superfamily, and therefore contain sequences with a (putative) common evolutionary origin. However, they share at most 50% identity, which is still challenging for any sequence alignment algorithm.
Frequently, alignments are performed to establish whether or not sequences are related. To benchmark this, a second version of both the Twilight Zone and the Superfamilies set is provided, in which to each alignment problem a number of false positives, i.e. sequences not related to the original set, are added.
Download (PDBs separately):