A program to predict the consensus secondary structure of a set of unaligned RNA sequences

D. Bouthinon (db@lipn.univ-paris13.fr)

H. Soldano (soldano@lipn.univ-paris13.fr)

LIPN-UPRESA CNRS 7030

Université Paris 13, Institut Galilée

99 Avenue Jean-Baptiste Clément 93430 Villetaneuse (France)

Introduction

We have designed a method based on a new representation of any RNA secondary structure as a set of structural relationships between the helices of the structure. We refer to this representation as a structural pattern.

In a first step, we use thermodynamic parameters to select, for each sequence, the k best secondary structures according to energy minimization and we represent each of them using its corresponding structural pattern.

In a second step, we search for the repeated structural patterns, i.e. the largest structural patterns that occur in at least one sequence, i.e. included in at least one of the structural patterns associated to each sequence. Thanks to an efficient encoding of structural patterns, this search comes down to identifying the largest repeated word suffixes in a dictionary.

In a third step, we compute the plausibility of each repeated structural pattern by checking if it occurs more frequently in the sequences under investigation than in random RNA sequences. We then suppose that the consensus secondary structure corresponds to the repeated structural pattern that displays the highest plausibility.

For more details see "A new method to predict the consensus secondary structure of a set of unaligned RNA sequences", (D. Bouthinon and H. Soldano, BioInformatics, Vol 15 n°10, 1999, pages 785-798).

The method described above is implemented as C programs and C-shell scripts. Some limitations of the current implementation are listed below :

Dealing with RNA sequences longer than 200 nucleotides will be time expensive (as all combinatorial approaches based on a branch and bound algorithm)
In this release we cannot consider consensus secondary structure made of more than 6 helices (a new version will overcome this limit )
It is designed to run on UNIX environment (checked on Solaris 2.5.1)

Download

Clic here to download rna.tar. Then execute the command tar xvf rna.tar . This will create a rna directory that contains the software to install.

Install

Run the script rnainstall which is in the rna directory, then logout and login again. Notice that rnainstall will modify the .cshrc file.

Compile

Run the script compile which is in the rna directory. You need to have the gcc compiler.

Run

The program should be executed from the directory containing your sequences (as an example the rna directory contains a data subdirectory that contains the file rnaSeq). The program takes RNA sequences (inputs) and marks the occurrences of the candidate consensus structures (outputs) in the sequences. It is made of 2 scripts alea and rna that you must run in this order (although there are located in the bin directory you can run them from anywhere).

alea is used to create a random model of the RNA sequences : it takes a set of unaligned RNA sequences, randomly shuffles them and outputs a file containing the frequencies of all possible secondary structures (having at most 6 helices) of these random sequences.

syntax alea N Mod < Seq

o N is the number of RNA sequences taken in Seq (from the beginning)

o Mod is the name of the output file that will contain the random model of the RNA sequences

o Seq is the name of the input file containing the RNA sequences (see the file rnaSeq in the data directory to have the format of the sequences)

example alea 100 rnaMod < rnaSeq

where one creates the random model of 100 RNA sequences contained in the file rnaSeq. rnaModis the name of the file which will contain random model.

rna locates the occurrences of candidate consensus secondary structures of the set of unaligned RNA sequences using a random model of these sequences

syntax rna N P Mod < Seq > Res

o N (the same N than in alea) is the number of RNA sequences

o P is the percentage of RNA sequences where any candidate consensus structure must be present

o Mod is the random model computed by alea

o Seq(the same Seq than in alea) is the file containing the RNA sequences

o Resis the name of the file that will contain the results

example rna 100 85 rnaMod < rnaSeq > rnaRes

Compute the candidate secondary structures that are present in at least 85% of 100 RNA sequences contained in rnaSeq, and whose frequencies in then random model rnaMod are low. The results are in rnaRes (see the file rnaRes in the data directory to have the format of the outputs)

rna is itself made of 2 scripts, sol and struct, that can be run separately (in this order) :

sol is used to form the sets of (possible) helices having best free energies for each of a set of RNA sequences. This is a very time expensive computing using a branch and bound algorithm, so we advice to not deal with sequences longer than 200 nucleotides.

syntax sol N < Seq > fSol

o N (the same N than in alea) is the number of RNA sequences

o Seq(the same Seq than in alea) is the file containing the RNA sequences

o fSolis the name of the file that will contain the sets of possible helices

example sol 100 < rnaSeq > rnaSol

Compute the sets of possible helices for each among 100 RNA sequences (contained in rnaSeq). The results are in rnaSol

struct is used to located the occurrences of candidate consensus secondary structures from the sets of possible helices of each of a set of RNA sequences and from a model random of these sequences.

syntax struct P Mod < fSol > Res

o P is the percentage of RNA sequences that must contain any candidate consensus structure

o Mod is the random model computed by alea

o fSol(the same fSol than in sol) is the file containing the sets of possible helices of the RNA sequences used in sol (notice that fSol also contains the RNA sequences)

o Resis the name of the file that will contain the results

example struct 85 rnaMod < rnaSol > rnaRes

Compute the candidate secondary structures from the set of possible helices contained in rnaSol that are present in at least 85% of the corresponding RNA sequences and whose frequencies in then random model rnaMod are low. The results are in rnaRes (see the file rnaRes in the data directory to have the format of the outputs)

Running struct after sol is interesting because it is based on a fast algorithm, so you can make tries with different values for the parameter P.