SABmark - Sequence and structure Alignment Benchmark

SABmark is designed to assess the performance of both multiple and pairwise (protein) sequence alignment algorithms, and is extremely easy to use. To download it, go to the bottom of the page, or just view the manual. A short description of the database is given below, and will soon also be published in Bioinformatics.

Currently, the database contains 2 sets, each consisting of a number of subsets with related sequences. It's main features are:
- Covers the entire known fold space (SCOP classification), with subsets provided by the ASTRAL compendium
- All structures have high quality, with 100% resolved residues
- Structure alignments have been derived carefully, using both SOFI and CE, and Relaxed Transitive Alignment
- At most 25 sequences in each subset to avoid overrepresentation of large folds - Automated running, archiving and scoring of programs through a few Perl scripts

The Twilight Zone set is divided into sequence groups that each represent a SCOP fold. All sequences within a group share a pairwise Blast e-value of at least 1, for a theoretical database size of 100 million residues. Sequence similarity is thus very low, between 0-25% identity, and a (traceable) common evolutionary origin cannot be established between most pairs even though their structures are (distantly) similar. This set therefore represents the worst case scenario for sequence alignment, which unfortunately is also the most frequent one, as most related sequences share less than 25% identity.

The Superfamilies set consists of groups that each represent a SCOP superfamily, and therefore contain sequences with a (putative) common evolutionary origin. However, they share at most 50% identity, which is still challenging for any sequence alignment algorithm.

Frequently, alignments are performed to establish whether or not sequences are related. To benchmark this, a second version of both the Twilight Zone and the Superfamilies set is provided, in which to each alignment problem a number of false positives, i.e. sequences not related to the original set, are added.

Database specifications:
- Current version: 1.65 (concurrent with PDB, SCOP and ASTRAL)
- Twilight Zone set (with false positives): 209 groups, 1740 (3280) sequences, 10667 (44056) related pairs
- Superfamilies set (with false positives): 425 groups, 3280 (6526) sequences, 19092 (79095) related pairs

Download (PDBs separately):
SABmark1.65.tar.gz (47 MB)
SABmark1.65_PDBs.tar.gz (73 MB)

Installation: unpack to some directory
Usage: see enclosed documentation (SABmark.pdf)

Previous versions:
- SABmark1.63_Twilight_Zone.tar.gz, SABmark1.63_Superfamilies.tar.gz