Pierre Nicodeme, CR-CNRS, University Paris13

"Statistics of words and motifs"

The amount of textual data increased exponentially during the last decades. The discovery of the structure of DNA in the sixties and later large scale sequencing provided very long sequences that are hard to analyze. While string-matching tools based on automata (Knuth-Morris-Pratt, Boyer-Moore, Aho-Corasick) gave the possibility to locate the query pattern within the target string, it became important to study the statistics of word or motif occurrences; this allows to find patterns with an exceptional behavior with respect to the background distribution of the letters.

We will present first pure combinatorial methods for counting occurrences of words:
1. The language decomposition method that has been introduced by Guibas and Odlyzko and later extended by Regnier and Szpankowski and by Bassino et al.
2. The analytic inclusion-exclusion method from Goulden and Jackson that allowed solving difficult open problems.
We present next methods that use automata construction for computing "motif statistics" and "hidden motif statistics".

Both approaches use generating functions and singularity analysis, automatic asymptotics of the moments, and lead to asymptotic laws.