2026-04-13 12:15:00
Salle B107, bâtiment B, Université de Villetaneuse
In the era of big data, information retrieval (IR) plays a central role in how information is accessed and consumed. Recent advances in Transformer-based neural models have substantially improved retrieval performance. Two major paradigms have emerged in this context: learned sparse retrieval, which represents texts using weighted vocabulary terms, and generative retrieval, which formulates retrieval as the generation of a document identifier. While both approaches have shown strong performance, they also exhibit important limitations. Sparse retrieval methods are often constrained by the fixed vocabulary of the underlying language model, limiting their adaptability, whereas generative retrieval methods rely on arbitrary document identifiers that tend to generalize poorly to unseen documents.
In this thesis, we explore how these two paradigms can be combined to obtain more efficient and more effective retrieval representations. Our core idea is to construct sparse retrieval vocabularies from learning rather than from predefined lexical tokens. We first propose REFERENTIAL and HotBERT to investigate the use of hierarchical structured identifiers as the vocabulary representation for retrieval, whose coarse-to-fine representation is designed to capture global semantics at higher levels and progressively refine finer-grained distinctions. While this representation proves expressive and effective, our analysis reveals that directly learning and optimizing hierarchical identifiers is challenging in practice. Motivated by this observation, we introduce SAE-SPLADE, a sparse retrieval framework built on sparse autoencoders (SAE), which is an architecture that learn sparse, interpretable latent representations. By using SAE latents as the retrieval vocabulary, SAE-SPLADE removes the dependence on fixed token vocabularies and improves flexibility and representation capacity. Finally, recognizing the efficiency challenges HotBERT, we propose a theoretically lossless token-pruning method for late interaction models that reduces computation while preserving retrieval performance.