Since the start of my PhD in October 2014, I am working on my thesis "Fouille de motifs et modélisation statistique pour l'extraction de connaissances textuelles" supervised by Thierry Charnois and co-supervised by Nadi Tomeh.


Recent Publications :



Exploration of Textual Sequential Patterns

Hedi-Théo Sahraoui, Pierre Holat, Peggy Cellier, Thierry Charnois, Sebastien Ferré

14th International Conference on Formal Concept Analysis (ICFCA 2017), Rennes, France, Juin 2017

Abstract : The extraction of regularities in texts is important for several natural language processing tasks. For instance, in information extraction, the regularities can allow to discover linguistic patterns [4] or to study the stylistics of authors [9]. When looking for those regularities, some specificities of textual data have to be taken into account: the sequentiality of the data (i.e., the order between words), the different levels of abstractions (i.e., words, lemma, Part-Of-Speech (POS) tags) and specific constraints (e.g., ”the regularities have to contain a verb”). SDMC (Sequential Data Mining under Constraints) [3, 2] is a sequential pattern mining tool that deals with all those requirements. From a text, the tool extracts regularities called sequential patterns, i.e sequences of words, lemmas, and POS tags that frequently appear together in the text. In order to extract such patterns mixing different levels of abstraction, each word in the text is represented by itself but also by its lemma and its POS tags. In addition, SDMC allows to apply constraints to filter the extracted patterns: widespread constraints in data mining like minimum frequency (support) but also text-specific constraints like ”contains a verb”.

SDMC



Weakly-supervised Symptom Recognition for Rare Diseases in Biomedical Text

Pierre Holat, Nadi Tomeh, Thierry Charnois, Delphine Battistelli, Marie-Christine Jaulent, Jean-Philippe Métivier

The 15th International Symposium on Intelligent Data Analysis (IDA 2016), Stockholm, Sweden, October 2016

Abstract : In this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore, existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns.

Companion webpage (To appear)
Software will be available
Data will be available



Fouille de motifs et CRF pour la reconnaissance de symptômes dans les textes biomédicaux

Pierre Holat, Nadi Tomeh, Thierry Charnois, Delphine Battistelli, Marie-Christine Jaulent, Jean-Philippe Métivier

23e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2016), Paris, France, Juillet 2016

Abstract : Dans cet article, nous nous intéressons à l’extraction d’entités médicales de type symptôme dans les textes biomédicaux. Cette tâche est peu explorée dans la littérature et il n’existe pas à notre connaissance de corpus annoté pour entraîner un modèle d’apprentissage. Nous proposons deux approches faiblement supervisées pour extraire ces entités. Une première est fondée sur la fouille de motifs et introduit une nouvelle contrainte de similarité sémantique. La seconde formule la tache comme une tache d’étiquetage de séquences en utilisant les CRF (champs conditionnels aléatoires). Nous décrivons les expérimentations menées qui montrent que les deux approches sont complémentaires en termes d’évaluation quantitative (rappel et précision). Nous montrons en outre que leur combinaison améliore sensiblement les résultats.

Companion webpage (soon)
Software is available
Data are available



Classification de texte enrichie à l’aide de motifs séquentiels

Pierre Holat, Nadi Tomeh, Thierry Charnois

22e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2015), Caen, France, Juin 2015

Abstract : En classification de textes, la plupart des méthodes fondées sur des classifieurs statistiques utilisent des mots, ou des combinaisons de mots contigus, comme descripteurs. Si l’on veut prendre en compte plus d’informations le nombre de descripteurs non contigus augmente exponentiellement. Pour pallier à cette croissance, la fouille de motifs séquentiels permet d’extraire, de façon efficace, un nombre réduit de descripteurs qui sont à la fois fréquents et pertinents grâce à l’utilisation de contraintes. Dans ce papier, nous comparons l’utilisation de motifs fréquents sous contraintes et l’utilisation de motifs delta-libres, comme descripteurs. Nous montrons les avantages et inconvénients de chaque type de motif.

Companion webpage
Software is available
Data are not available for free



Sequence Classification Based on Delta-Free Sequential Patterns

Pierre Holat, Marc Plantevit, Chedy raïssi, Nadi Tomeh, Thierry Charnois and Bruno Crémilleux

14th IEEE Int. Conf. on Data Mining series (ICDM 2014), Shenzhen, China, December 2014.

Abstract : Sequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of delta-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of delta-free sequential patterns. Furthermore, we show the advantage of the delta-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions.

Companion webpage
Software is available
Data are partially available