Research

Key words : Knowledge graph, RDF, OWL, Rule mining, Data linkage

Rule mining

I am interested in mining logical rules in knowledge graphs (KG). These rules can serve to complete the KG, detect erroneous data, or uncover the knowledge that is not explicitly stated in the ontology. We have developped systems that discover expressive rules that involve constraints defined on numerical values (system Regnum, CRA-Miner), or that focus on differential causal rules that can express that differences in treatments lead to differences in a studied characteristic (system DICARE-E and its variants).

Data Linkage

I am also interested in data linkage approaches which exploit knowledge that can be declared in the ontology. To combine and reason about data coming from different RDF data sources, semantic links are needed to connect resources. In particular, identity links allow to declare that two data items refer to the same real world object, i.e the same hotel, the same lab, ... Based on these links, it is possible to combine information about the same real-world entity.
We have defined several data linkage approaches. The first one was a logical approach named L2R, in which some ontology axioms and additionnal knowledge about the data sources are automatically translated into Horn rules and used to infer (non) identity links. We have also proposed a numerical approach named N2R, in which knowledge semantics is automatically translated into non-linear equations which allow to compute similarity scores for pairs of individuals.
Discriminative properties (key axioms) that can be used for data linkage are difficult to determine even for domain experts So, we have developped approaches that can discover the composite keys from RDF data sources. KD2R, exploits data sources for which the Unique Name Assumption is stated. SAKey can discover keys when datasets may contain duplicates or erroneous property values. Vickey is an efficent approach that discovers conditional keys (keys that are valid for class expressions). We have also recently proposed an approach named RE-Miner that mines graph patterns that can be used as referential expressions in a data linkage task.
We have also investigated how invalid sameAs links can be logically detected or detected using the topological properties of the sameAs graph of the Linked Open Data cloud. We have also proposed a weaker and contextual representation of identity links that do not follow the strict OWL2 semantics of the sameAs construct.

Data integration - Semantic annotation of web documents

I have investigated ontology-based approaches that aim to semantically annotate web documents. I was interested in approaches that are guided by the syntactic structure of (part of) documents. In the setting of the e.dot project, we have investigated the semantic annotation of tabular data. Indeed, annotation tools can take benefit of the structure of the tables. The idea was to semantically annotate as many information as possible while allowing the users to have access to elements of the original table in order to limit the errors due to wrong annotations (Xtab2SML).

In the SHIRI project we have also exploited the syntactic structure of heterogeneous HTML documents in an annotation process. In the developped approaches, the HTML structure is either used to better rank the annotations when they appear in the most structured parts of the documents (SHIRI-Querying) or to provide semantic relations that are difficult to discover with lexico-syntactic patterns (REISA).

Formal Concept Analysis - Automatic construction of class hierarchies over XML data

I have also worked on approaches that can help users to access to a large amount of XML data by clustering them in a small number of classes described at different levels of abstraction. Our basic representation was a Galois lattice which is a well-defined and exhaustive representation of the classes embedded in a data set (approach ZOOM).

Some Recent Publications

REGNUM: Generating Logical Rules with Numerical Predicates in Knowledge Graphs

Armita Khajeh Nassiri, Nathalie Pernelle, Fatiha Saïs (2023)
Proceedings of the Semantic Web - 20th International Conference, (ESWC, 2023)
Hersonissos, Crete, Greece, May 28 - June 1, p.856-867.

Discovering Causal Rules in Knowledge Graphs using Graph Embeddings.

Lucas Simonne, Nathalie Pernelle, Fatiha Saïs, Rallou Thomopoulos (2022)
International Joint Conference on Web Intelligence and Intelligent Agent Technology, (WIC-WI-IAT)
Niagara Falls, ON, Canada, November 17-20, p.856-867.

BECKEY: Understanding, comparing and discovering keys of different semantics in knowledge bases

Danai Symeonidou, Vincent Armant, Nathalie Pernelle (2020)
Knowledge Based Systems, vol 195, pp 105708

more

phD Students

  • Hugo Attali (2023-?) : Apprentissage de représentations sur des graphes complexes par GNN pour le débruitage des graphes de connaissances
  • Vincent Beugnet (2022-?) : Incremental classification in Knowledge graphs
  • Lucas Simonne(2019–2023) : Discovery of differential causal rules in Knowledge graphs
  • Armita Khajeh Nassiri(2020–2023) : Expressive Rule Discovery for Knowledge Graph Refinement
  • Thamer Mecharnia (2018-2022) : Approches sémantiques pour la prédiction de présence d'amiante dans les bâtiments : une approche probabiliste et une approche à base de règles
  • Joe Raad ( 2015–2018) : Identity management in Knowledge graphs (First Accessit for the best PhD in AI of AFIA in 2019)
  • Danai Symeonidou (2011–2014), : Automatic Key discovery for data linkage
  • Yassine Mrabet (2008-2012) : Approches hybrides pour la recherche sémantique de l’information : Intégration des bases de connaissances et
  • Mouhamadou Thiam (2006–2010) : Annotation sémantique de documents semi-structurés pour la recherche d’information »
  • Fatiha Saïs (2004-2007) : Intégration Sémantique de Données guidée par une Ontologie »

Projects

  • Ermes (2024-2028), ANR project (public health)
    Partners : LEPS (USPN), LIPN (RCLN, and A3 teams), EconomiX, Agro Paris tech, INRAE
    ERMES (Enfants Récepteurs et Messagers pour l'Éducation à la Santé) aims to characterise how food messages are transmitted between peers and the effects of peer education programmes on children's eating behaviour. In particular, we propose to explore the dietary messages that children receive and relay and to gain a better understanding of which profiles of children are most likely to be influenced by these messages. This interdisciplinary project will involve researchers in the social sciences and humanities (anthropology, sociology, education sciences), economics and computer science (knowledge representation, machine learning).
  • GenMific (2022-2025), EcosNord.
    Partners : Monterrey (Mexico), UNAM (Mexico), LIPN.
    Definition of methods that generate Micro-Fictions (collaboration with linguists).
  • DataForYou (2018-2020), SATT Saclay.
    Maturation of LN2R and SAKEY softwares within the DataForYou start-up with the aim of creating an RGPD tool and a data anonymisation tool for local authorities.
  • Liones (2015-2018), Center for Data Science, Idex Paris Saclay
    Partners: INRAE/AGRO ParisTech (Computer scientists and biologists)
    The aim of this interdisciplinary project was to model semantic links between multi-scale and multi-stage observations (topic: stabilisation of micro-organisms such as yeast). The provided modelisation can be used to reduce the environmental cost of the industrial processes involved.
  • Projet Qualinca (2012-2016), ANR-Contint
    Partners: ABES (Agence Bibliographique de l’Enseignement Supérieur) INA (Institut National de l’Audiovisuel)), LRI, LIRMM (Université Montpellier 2), LIG (Université de Grenoble)
    Qualinca (Quality and Interoperability of Large Catalogues) was a fundamental research project that aims to develop algorithms for qualifying and improving the quality level of an existing documentary database.
  • EIT ICT labs - DataBridges activity (2011-2012)
    Partners: 17 european partners.
    The aim of the DataBridges activity "Data Integration in Digital Cities" was to build open and extensible systems for integrating heterogeneous data sources concerning digital cities.
  • HEDI (Heterogeneous Electronic Data Integration) (2009-2010)
    Partners: LRI, Thales Corporate Services
    The aim of this collaboration agreement was to define reconciliation tools for Thales Corporate Services, enabling descriptions of electronic components to be reconciled efficiently. The project involved 3 research professors from the IASI/LEO team at the University of Paris Sud (now Paris Saclay).
  • GEONTO (2008-2011), ANR project (Masse de Données et de Connaissances)
    Partners: LRI, COGIT de l’IGN (Institut Géographique National), Liuppa (Univ. de Pau et des pays de l’Adour), IRIT (Toulouse),
    The GEONTO project was concerned with the interoperability of heterogeneous data relating to geographical information.
  • Digiteo SHIRI (2007-2011)
    Partners: LRI, SUPELEC
    The aim of this project (Hybrid Integration System for Information Retrieval in heterogeneous resources) was to enable a user to access parts of HTML documents responding to a query formulated in the vocabulary described in a domain ontology.
  • PICSEL 3 (2005 – 2008), industrial contract Partners: LRI (Université Paris Sud), France Télécom R&D Ongoing collaboration of LRI lab with France-Télécom R&D through the series of PICSEL projects (Production d'Interface à base de Connaissances pour des Services En Ligne) . The aim was to implement a generic, declarative environment for building semantic portals on top of distributed and heterogeneous data sources relating to the same application domain.
  • ACI MDD (2003 – 2006), ACI Masse de données project
    Partners: LRI Lab, LIP6, MOSTRARE (INRIA Lille)
    Definition of tools for information retrieval, information extraction and the classification of XML documents. I took part in the definition and development of a tool for discovering concept hierarchies from free text, making it possible to structure a set of XML documents.
  • e.DOT (2002-2005), RNTL project
    Partners: LRI, VERSO (INRIA Rocquencourt), MIA department (INRA), Xyleme company.
    The aim was to develop tools for building and maintaining a data warehouse relating to a specific field by automatically integrating information discovered on the Web. Chosen field of application: microbiological risk in food. Development of a model and an approach for the automatic semantic annotation of tabular data.
  • GAEL (1999 – 2001), RNRT pré-compétitif
    Partners: LRI, VERSO (INRIA Rocquencourt), MATCHVISION company.
    The aim was to design tools to help develop electronic catalogues on the web. The defined ZooM system, formally based on nested Galois lattices, makes it possible to group semi-structured data according to two different levels of abstraction.

Teaching

I am co-head of the Computer Science department of the Galilee Institute of Université Sorbonne Paris Nord (USPN) since 2021.

Current Teaching

I teach at University Sorbonne Paris Nord, in the computer science department of the Galilée Institute (Sup-Galilee engineer school, and bachelor in computer science).
I teach courses related to programing languages (python, C), databases, UML, semantic web, and knowledge representation.

Former teaching at Paris Saclay University

I have been teaching from 2001 to 2019 at University Paris Saclay, mostly at the technological institute of Sceaux (databases, information systems, data analysis) and also in the Master of Data Sciences (semantic web).
I have also teached in China (USTHB University), and in Senegal (University of Thies).