A program to predict the consensus secondary structure of a set of unaligned RNA sequences
D. Bouthinon (db@lipn.univ-paris13.fr
H. Soldano (soldano@lipn.univ-paris13.fr)
LIPN-UPRESA CNRS 7030
Université Paris 13, Institut Galilée 
99 Avenue Jean-Baptiste Clément 93430 Villetaneuse (France)

http://www-lipn.univ-paris13.fr

 

Introduction

We have designed a method based on a new representation of any RNA secondary structure as a set of structural relationships between the helices of the structure. We refer to this representation as a structural pattern. 
In a first step, we use thermodynamic parameters to select, for each sequence, the k best secondary structures according to energy minimization and we represent each of them using its corresponding structural pattern.
In a second step, we search for the repeated structural patterns, i.e. the largest structural patterns that occur in at least one sequence, i.e. included in at least one of the structural patterns associated to each sequence. Thanks to an efficient encoding of structural patterns, this search comes down to identifying the largest repeated word suffixes in a dictionary.
In a third step, we compute the plausibility of each repeated structural pattern by checking if it occurs more frequently in the  sequences under investigation than in random RNA sequences. We then suppose that the consensus secondary structure corresponds to the repeated structural pattern that displays the highest plausibility.

 

For more details see "A new method to predict the consensus secondary structure of a set of unaligned RNA sequences", (D. Bouthinon and H. Soldano, BioInformatics, Vol 15 n°10, 1999, pages 785-798).

The method described above is implemented as C programs and C-shell scripts. Some limitations of the current implementation are listed below :
  1. Dealing with RNA sequences longer than 200 nucleotides will be time expensive (as all combinatorial approaches based on a branch and bound algorithm)
  2. In this release we cannot consider consensus secondary structure made of more than 6 helices (a new version will overcome this limit )
  3. It is designed to run on UNIX environment (checked on Solaris 2.5.1)

Download

Clic here to download rna.tar. Then execute the command tar xvf rna.tar . This will create a rna directory that contains the software to install.

Install 

Run the script rnainstall which is in the rna directory, then logout and login again. Notice that rnainstall will modify the .cshrc file.

Compile

Run the script compile which is in the rna directory. You need to have the gcc compiler.

Run 

The program should be executed from the directory containing your sequences (as an example the rna directory contains a data subdirectory that contains the file rnaSeq). The program takes RNA sequences (inputs) and marks the occurrences of the candidate consensus structures (outputs) in the sequences. It is made of 2 scripts alea and rna that you must run in this order (although there are located in the bin directory you can run them from anywhere).
syntax alea N  Mod < Seq
N is the number of RNA sequences taken in Seq (from the beginning)
Mod is the name of the output file that will contain the random model of the RNA sequences
Seq is the name of the input file containing the RNA sequences (see the file rnaSeq in the data directory to have the format of the sequences)

example alea 100 rnaMod < rnaSeq

 

where one creates the random model of 100 RNA sequences contained in the file rnaSeq. rnaModis the name of the file which will contain random model.
syntax rna N P Mod < Seq > Res
N (the same N than in alea) is the number of RNA sequences
is the percentage of RNA sequences where any candidate consensus structure must be present
Mod is the random model computed by alea
Seq(the same Seq than in alea) is the file containing the RNA sequences

Resis the name of the file that will contain the results

example rna 100 85 rnaMod < rnaSeq > rnaRes

Compute the candidate secondary structures that are present in at least 85% of 100 RNA sequences contained in rnaSeq, and whose frequencies in then random model rnaMod are low. The results are in rnaRes (see the file rnaRes in the data directory to have the format of the outputs)

 

rna is itself made of 2 scripts, sol and struct, that can be run separately (in this order) :
syntax sol N < Seq > fSol
N (the same N than in alea) is the number of RNA sequences
Seq(the same Seq than in alea) is the file containing the RNA sequences
fSolis the name of the file that will contain the sets of possible helices

example sol 100 < rnaSeq > rnaSol

Compute the sets of possible helices for each among 100 RNA sequences (contained in rnaSeq). The results are in rnaSol

syntax struct P Mod < fSol > Res
is the percentage of RNA sequences that must contain any candidate consensus structure
Mod is the random model computed by alea
fSol(the same fSol than in sol) is the file containing the sets of possible helices of the RNA sequences used in sol (notice that fSol also contains the RNA sequences)
Resis the name of the file that will contain the results

 

example struct 85 rnaMod < rnaSol > rnaRes
Compute the candidate secondary structures from the set of possible helices contained in rnaSol that are present in at least 85% of the corresponding RNA sequences and whose frequencies in then random model rnaMod are low. The results are in rnaRes (see the file rnaRes in the data directory to have the format of the outputs)
Running struct after sol is interesting because it is based on a fast algorithm, so you can make tries with different values for the parameter P.