-
Pierre Nicodeme, CR-CNRS, University Paris13
-
"Statistics of words and motifs"
-
The amount of textual data increased exponentially during the last decades.
The discovery of the structure of DNA in the sixties and later large scale
sequencing provided very long sequences that are hard to analyze.
While string-matching tools based on automata (Knuth-Morris-Pratt, Boyer-Moore,
Aho-Corasick) gave the possibility to locate the query pattern within the
target string, it became important to study the statistics of word
or motif occurrences; this allows to find patterns with an exceptional
behavior with respect to the background distribution of the letters.
- We will present first pure combinatorial methods for counting occurrences
of words:
- The language decomposition method that has been introduced by
Guibas and Odlyzko and later extended by Regnier and Szpankowski and
by Bassino et al.
- The analytic inclusion-exclusion method from Goulden and Jackson
that allowed solving difficult open problems.
- We present next methods that use automata construction for computing
"motif statistics" and "hidden motif statistics".
Both approaches use generating functions and singularity analysis,
automatic asymptotics of the moments, and lead to asymptotic laws.