A
program to predict the consensus secondary structure of a set of
unaligned RNA sequences
LIPN-UPRESA
CNRS 7030
Université
Paris 13, Institut Galilée
99 Avenue
Jean-Baptiste Clément 93430 Villetaneuse (France)
http://www-lipn.univ-paris13.fr
Introduction
We
have designed a method based on a new representation of any RNA
secondary structure as a set of structural relationships between the
helices of the structure. We refer to this representation as a
structural pattern.
In
a first step, we use thermodynamic parameters to select, for each
sequence, the k best secondary structures according to energy
minimization and we represent each of them using its corresponding
structural pattern.
In
a second step, we search for the repeated structural patterns, i.e. the
largest structural patterns that occur in at least one sequence, i.e.
included in at least one of the structural patterns associated to each
sequence. Thanks to an efficient encoding of structural patterns, this
search comes down to identifying the largest repeated word suffixes in
a dictionary.
In a third step, we
compute the plausibility of each repeated structural pattern by
checking if it occurs more frequently in the sequences under
investigation than in random RNA sequences. We then suppose that the
consensus secondary structure corresponds to the repeated structural
pattern that displays the highest plausibility.
For more details see
"A new method to predict the consensus secondary structure of a
set of unaligned RNA sequences", (D. Bouthinon and H. Soldano, BioInformatics,
Vol 15 n°10, 1999, pages 785-798).
The
method described above is implemented as C programs and C-shell
scripts. Some limitations of the current implementation are listed
below :
-
Dealing with
RNA sequences longer than 200 nucleotides will be time expensive (as
all combinatorial approaches based on a branch and bound algorithm)
-
In this
release we cannot consider consensus secondary structure made of more
than 6 helices (a new version will overcome this limit )
-
It is
designed to run on UNIX environment (checked on Solaris 2.5.1)
Install
Run
the script rnainstall which is in the rna directory, then
logout and login again. Notice that rnainstall will modify the .cshrc
file.
Compile
Run
the script compile which is in the rna directory. You
need to have the gcc compiler.
Run
The
program should be executed from the directory containing your sequences
(as an example the rna directory contains a data
subdirectory that contains the file rnaSeq). The program takes
RNA sequences (inputs) and marks the occurrences of the candidate
consensus structures (outputs) in the sequences. It is made of 2
scripts alea and rna that you must run in this order
(although there are located in the bin directory you can run
them from anywhere).
-
alea
is used to create a random model of the RNA sequences : it
takes a set of unaligned RNA sequences, randomly shuffles them and
outputs a file containing the frequencies of all possible secondary
structures (having at most 6 helices) of these random sequences.
syntax alea
N Mod < Seq
o N
is the number of RNA sequences taken in Seq (from the beginning)
o Mod
is the name of the output file that will contain the random model of
the RNA sequences
o Seq
is the name of the input file containing the RNA sequences (see the
file rnaSeq in the data directory to have the format of
the sequences)
example alea
100 rnaMod <
rnaSeq
where
one creates the random model of 100 RNA sequences contained in
the file rnaSeq. rnaModis
the name of the file which will contain random model.
-
rna
locates the occurrences of candidate consensus secondary structures of
the set of unaligned RNA sequences using a random model of these
sequences
syntax rna
N P Mod < Seq > Res
o N
(the same N than in alea) is the number of RNA sequences
o P is
the percentage of RNA sequences where any candidate consensus structure
must be present
o Mod
is the random model computed by alea
o Seq(the
same Seq than in alea) is the file containing the RNA
sequences
o Resis
the name of the file that will contain the results
example rna
100 85 rnaMod < rnaSeq > rnaRes
Compute the candidate
secondary structures that are present in at least 85% of 100
RNA sequences contained in rnaSeq, and whose frequencies in
then random model rnaMod are low. The results are in rnaRes (see
the file rnaRes in
the data directory to have the format of the outputs)
rna is
itself made of 2 scripts, sol and struct, that can be run
separately (in this order) :
-
sol
is used to form the sets of (possible) helices having best free
energies for each of a set of RNA sequences. This is a very time
expensive computing using a branch and bound algorithm, so we advice to
not deal with sequences longer than 200 nucleotides.
syntax sol
N < Seq > fSol
o N
(the same N than in alea) is the number of RNA sequences
o Seq(the
same Seq than in alea) is the file containing the RNA
sequences
o fSolis
the name of the file that will contain the sets of possible helices
example sol 100 <
rnaSeq > rnaSol
Compute the sets of
possible helices for each among 100 RNA sequences (contained in rnaSeq).
The results are in rnaSol
-
struct
is used to located the occurrences of candidate consensus secondary
structures from the sets of possible helices of each of a set of RNA
sequences and from a model random of these sequences.
syntax struct P
Mod < fSol > Res
o P is
the percentage of RNA sequences that must contain any candidate
consensus structure
o Mod
is the random model computed by alea
o fSol(the
same fSol than in sol) is the file containing the sets of
possible helices of the RNA sequences used in sol (notice that fSol
also contains the RNA sequences)
o Resis
the name of the file that will contain the results
example struct
85 rnaMod < rnaSol > rnaRes
Compute
the candidate secondary structures from the set of possible helices
contained in rnaSol that are present in at least 85% of
the corresponding RNA sequences and whose frequencies in then random
model rnaMod are low. The results are in rnaRes (see the
file rnaRes in the data directory to have the format of
the outputs)
Running struct
after sol is interesting because it is based on a fast
algorithm, so you can make tries with different values for the
parameter P.