Syntactic and semantic analysis

Despite the progress made, the analysis of texts always raises new challenges, due to the volume and diversity of the texts to be analyzed, but also to the level of analysis expected: beyond surface analysis, the RCLN team’s project is to do deep syntax analysis and to combine syntactic and semantic analysis.

Syntactic analysis and neural networks

Recent advances in neural networks and their use in natural language processing have led to the resurgence of an old debate: should we go through more or less implicit structures for the analysis of language productions or can we stick to surface sightings?

The objective is to adapt neural networks to syntactic analysis, which constitutes a challenge because the outputs of syntactic analyzes that are targeted are generally more complex structures than simple sequences of labels (trees in constituents or in dependencies, even graphs). Several approaches will be studied. On the one hand, it will be a question of transforming our algorithms (for example those of the Lorg analyzer, cf. section 3.4) by using neural networks to transform the input data or in the evaluation of the hypotheses during the creation of analysis structures. In this case, we will use recurrent or convolutional networks which are widely used as descriptor extractors. New architectures based solely on dynamically calculated distributions, said to be careful,

Another avenue consists in designing neural architectures adapted to the prediction of tree structures, as has been done for the prediction of sequential structures [7, 4], but the difficulty remains the incorporation of such structures into the analysis algorithms in dynamic programming. The reverse search direction will also be considered. Rather than trying to predict tree structures from theoretical linguistic models, it seems to us more interesting, in line with the work of [9], to enrich the syntactic information extracted, which seems more relevant for higher level applications. (semantic analysis, translation). It is also possible to imagine an a posteriori reconstruction of structures, which would be constrained by this sequential information, if these are sufficiently precise, by capitalizing on the work of the team in constrained optimization for TAL. Energy networks for structured prediction [2] are also based on this type of hypothesis.

From syntax to applications

Besides the quest for all the syntactic information that one is able to extract from a text, it is interesting to look at what syntactic analysis can bring to applications.

In connection with the other axes of the RCLN team, it is first a question of identifying the most useful syntactic information, noting that it is likely that the answers vary according to the targeted application (extraction of relations, entities, semantic analysis, etc.).

One can also try to apply syntactic analysis methods to other problems. The prism of combinatorial optimization has proven to be extremely useful in solving difficult problems in parsing. On the strength of the experience acquired in the use of optimization techniques, thanks to interactions with the AOC team, the RCLN team plans to extend their application, either to carry out different analyzes jointly rather than sequentially (search for linguistic units: tokens, polylexical expressions, entities), or to use semantic models with which the problem of analysis is intrinsically complex, such as for example HRG grammars (by hyper-edge replacement) [3].

Syntax and distributional semantics

A complementary problem concerns the contribution of syntactic analysis for distributional semantics [1, 8]. This type of analysis is a promising research avenue for accessing the underlying semantics of languages, with numerous applications (induction of meaning and semantic networks, monitoring of the semantic evolution of lexies in diachrony, etc. (see section 2.3.3). Word embeddings[5] currently make it possible to detect semantically similar lexies on the basis of shared contexts. But current methods do not make it possible to distinguish the different semantic relationships underlying this notion of similarity (analogy, synonymy , hyperonymy, hyponymy, antonymy, etc.) nor to deal with polysemy.

In order to identify more precisely the semantic structuring of the lexicon and automatically produce usable resources, we intend to deepen several avenues already opened in previous works (see section 2.1.3). One of the avenues will consist in studying the contribution of the syntax in order to identify descriptors, allow an automatic classification of the results of the distributional analysis and identify the underlying semantic relations.