Acquisition of knowledge from texts

Knowledge discovery remains a major challenge for interpreting, exploiting or exploring corpora at the scale of the semantic web and the web of data. The team intends to continue its work on this axis by developing acquisition methods from heterogeneous textual sources, but also from ontologies and data from the semantic web, while allowing the reuse of extracted knowledge and its adaptation to areas. The approaches developed combine linguistic analysis, exploitation of existing knowledge and learning and data mining techniques, with an emphasis on unsupervised approaches in an open domain.

Construction of knowledge bases from alignments between reference ontologies

Data is published on the Web using Semantic Web technologies with the aim of simplifying data discovery and addressing the issue of heterogeneous data integration. To be able to integrate several data sets, putting them in an interoperable computer format is not enough. For efficient integration, separate datasets should be linked together through pivot data. Reference ontologies defining pivot data are beginning to be published on the Linked Open Data (LOD) or in French Linked Data Web. They are nevertheless far from covering and we are always led to build new knowledge bases which make it possible to cover our specific use cases. The RCLN team aims to propose new methods to produce complex knowledge bases that integrate reference ontologies. The difficulty consists in making correspondences, not blindly, but by taking into account the use cases and the associated sources of information. The problem comes down to an N-ary mapping which concerns both semantic entities and source entities. This mapping will rely on ontological design patterns as well as anti-patterns. Design patterns allow directing the method of acquiring N-ary relations. The anti-patterns allow to clean the candidates of N-ary relations when the existence of these relations highlights inconsistencies within the knowledge base.

Resource enrichment

Available semantic resources often lack rich domain relationships. Relationship extraction approaches are often calibrated for a domain. The RCLN team addresses this issue through an unsupervised machine learning approach where the process of information extraction is guided by the corpus, the objective being to be able to apply it to different specialist areas. Different types of information are available: the nature of the entities, the sequence of words between them, their syntactic relations in the text. The RCLN team wishes to explore the different combinations of this information by relying on sequential pattern mining techniques. The first experiments confirm that the pattern mining approach makes it possible to discover and identify new types of semantic relations. The goal is to automatically generate syntactic patterns capable of characterizing the different kinds of relations and would ultimately allow them to be labeled.

The RCLN team also plans to combine text mining work with data mining work, in particular graph mining (graph abstraction), carried out in the A3 team. This involves both refining the semantic relationships discovered and extending them by discovering new types of links.

Resource extraction and linguistic variations

The team’s work on adapting data mining and machine learning methods has yielded promising results in the field of corpus linguistics. In stylistics, we have shown that characteristic patterns of an author, or literary genre, can be discovered by extracting emergent motifs. The aim is to extend the method by taking into account various clues such as the topology of patterns in the corpus, and different levels of annotation (lexical, syntactic, semantic, etc.). Based on the first work of the Néoveille project on the automatic detection of neologies of form, the team plans to combine the approaches of pattern mining (allowing to discover the typical sequences of a lexy), and distributional semantic type approaches (making it possible to obtain a vision of similar lexies at a given period) in order to identify and characterize semantic neologies. This work will make it possible to tackle the problem of the induction of meaning from a corpus.

One of the goals will be to produce Framenet-type semantic lexical resources from large corpora.