Semantic annotation and textual exploration

Semantic analysis occupies an important place in the work of the RCLN team. With the development of the web and semantic technologies, this analysis tends to be encoded as a semantic annotation anchored in the source text: analyzing a text then amounts to affixing a coherent set of annotations to it, the semantics of these annotations being given by an explicit and more or less rich extra-linguistic formal model (a taxonomy, an ontology, a data graph).

Annotated corpora are used as laboratory data (to train or evaluate analyzers) but also serve in different types of applications.

Fully automatic semantic analysis processes are rare: human intervention is required to annotate training data, design the semantic model guiding the analysis, write extraction rules or correct the results. The quality and robustness of the proposed methods depend on a fair distribution between automatic calculation and interpretation.

Control the quality of manual annotations

Annotated data is essential to many NLP applications. Whether producing annotations or reusing annotated data, it is always important to control the quality of the annotations produced. We have proposed a method to manage annotation campaigns taking into account the annotation cost, the volume of annotated corpus and the expected quality. This method is based on an analysis grid which is used to evaluate a priori the complexity of an annotation task [ CI-6 ].

We also studied the “natural” or “usual” annotations found in folksonomies, whose resources are associated with labels.

Train semantic annotators

For simple annotation tasks (ie semantic labeling), annotation tools can be trained from corpora previously annotated by hand. However, as it is difficult to systematically annotate large corpora, approaches must be proposed to train the annotator with little data, even if it means correcting erroneous predictions afterwards.

To minimize the human effort required to write annotation rules or to annotate training data, we proposed a hybrid approach and an interactive system that allows the user to work in a dual way on the extraction rules. information and learning examples [CO-14, CO-31]. She shows in her thesis that learning on a reduced corpus, with in particular an active learning module for an intelligent selection of examples, allows a considerable gain in learning time without degrading performance.

We explored in parallel a method to predict semantic annotations based on a small training corpus by modeling the task as a segment-based statistical translation system [CO-39]. The approach, which has been tested on regulatory texts but with a reduced volume of initial annotations, shows that we can effectively assist the annotation work of human experts.

Combining analysis and interpretation

Thinking about the interaction between automatic semantic analysis and interpretation by the human analyst is essential for the implementation of automatic language processing methods in real context.

We have proposed a method and a tool (semex) for annotating regulatory texts to integrate them into decision support systems [CI-5]. It is a question of selecting in the texts the passages expressing the rules and formalizing them in order to be able to implement them in decision systems, explain the decisions taken, manage the consistency of the base of rules obtained and, if necessary, update it when the source texts evolve [TU-1].

The annotation of the rules of legal and regulatory texts is also useful for the analysis of legal sources and the interpretations that can be made of them [CI-44]. An abstract controlled language, hCL [CO-63], has been proposed as an annotation language and we have shown how natural language processing tools can help translate rules written in natural language into this controlled language. We are also studying the LegalRuleML standard as an alternative to hCL [CI-44].

As semantic annotation relies both on the analysis of textual data and on extra-linguistic knowledge, it is necessary to design a dynamic process linking knowledge review and analysis. This is the objective of Ivan Garrido Marquez’s thesis on blog annotations.

Back to Top