A3, AOC, CALIN, LCR, MERCRED, RCLN

Heure:	12:30 - 13:30
Lieu:	Salle B107, bâtiment B, Université de Villetaneuse
Résumé:	Towards Detecting Pre-training Data Set Manipulations: the Need to Build Efficient Language Models
Description:	Wissam Antoun The high compute cost required to train Large Language Models (LLMs) makes them only available to a hand full of high-budget private institutions, and countries. These institutions rarely documented their training data nor the data collection and filtering source code, thus raising questions about potential vulnerabilities of models that have been trained on them. For example, one of the many ways to inject adversarial biases and temper with training data is to produce machine-generated text carrying out these biases and have them included in the training data. So the matter of robust detection of machine-generated text is becoming crucial. Answering these questions first requires efficient ways to iterate and train language models quickly. In this talk, I will present my work on pretraining language models for Arabic and French and showcase the lessons learned in designing and training efficient LLMs. In particular, I'll talk about training AraBERT, AraELECTRA, AraGPT2, the current largest Transformer-based models for Arabic, and the AraGPT2 detector. Ill also introduce CamemBERTa, a new sample-efficient language model for French, the first publicly available DeBERTa V3-based model outside of the original paper and which establishes a new SOTA for this language in many tasks. (Joint work with Benoit Sagot and Djamé Seddah, at the Inrias Almanach team project)

Lundi 13 Février