Reference Pairwise and Multiple Structural Alignments

Evaluation of sequence and structure alignment methods requires reference alignments. Data-sets provided here represent such reference alignments. Part of these sets have been used in “Comparative analysis of protein structure alignments” (Mayr et al. [1]), where we have evaluated pairwise structure alignment methods according their consistency and accuracy.

Since this work we have updated the data-sets used in [1] and added new data-sets for multiple alignments.

There are three basic sets:

  • The RIPC set contains protein pairs exhibiting very difficult structural relations including repetitions, large InDels, circular permutations and conformational variability.
  • The SISY pairwise set contains protein pairs selected from the Sisyphus database, which provides structural alignments for proteins with non-trivial relationships (Andreeva et al. [3]).
  • The SISY multiple set contains protein families selected from the Sisyphus database.
  • The data-sets are up to date with PDB Nov. 2008, SCOP 1.73 and Sisyphus 1.3. We introduced an xml-based file format to specify the reference alignments. Since SCOP and Sisyphus may refer to older PDB entries we mapped the chain id’s to PDB Nov. 2008. Additionally we provide PDB style files which are referenced in the xml-files. If you use the data-set you should use PDB files provided here. For details specific for a certain set please refer to the set specific pages.

    The xml format is used for pairwise and multiple alignments. Each alignment in turn may contain alternative solutions. A certain alternative alignment is written in a row format. Below we show an excerpt of a case from the RIPC set:

    <?xml version="1.0"?>
    <multiple-alignment n="2" altalg="1">
      <description>
        <source>RIPC v 1.0</source>
        <aname>d1an9a1-d1npx_1</aname>
      </description>
      <members>
        <member>d1an9a1</member>
        <member>d1npx_1</member>
      </members>
      <alternative id="1" eqr="11">
        <mequivalences n="11">
           <row><meq>   6 :I:A</meq><meq>   6 :L: </meq></row>
           <row><meq>  37 :D:A</meq><meq>  33 :K: </meq></row>
           <row><meq>  47 :V:A</meq><meq>  41 :S: </meq></row>
           .
           .
           .
        </mequivalences>
      </alternative>
    </multiple-alignment>
    
    Entity Meaning
    multiple-alignment Contains the alignment of a certain set of proteins. For the same set of proteins alternative solutions may exits (see <alternative>). E.g. <multiple-alignment n="2" altalg="1"> is an alignment of two (n=”2″) structures with a single solution (altalg=”1″).
    description Contains general information about the alignment.
    members Lists the name of the proteins/domains used.
    alternative Encloses a certain alternative alignment solution.
    mequivalences The alignments are stored in a row format. The attribute n counts the number of rows. In the example we have (n=”11″) rows, three of them are shown.
    row Each row consists of as many <meq> entities as there are members in the member section. The order from left to right corresponds to the top to bottom order of the molecules in the member section.
    meq Each <meq> contains double colon ( : ) separated fields refering to pdb format ATOM/HETATM records. The fields are:“resSeq+iCode”:“residue type”:“chainId”E.g. <meq> 315 :G:A</meq> refers to a glycine residue on position 315 (with blank iCode) in chain A. The field resSeq+iCode refers exactly to columns 23-27 in pdb ATOM/HETATM records. In the provided dataset only structurally equivalent residues are shown. If required, gaps may be easily coded as <meq>---------</meq>.

    We try to improve and extend the data sets and these web pages. Changes in the data or new versions will reported here. Feedback is highly appreciated. If you use the data sets please cite [1].