Automatic generation of scientific state of the art

Abstract

Tracking the evolution of a scientific topic from its origins to its current state-of-the-art is a difficult task that requires a huge amount of manual effort by domain specialists and researchers. Such a task requires that human actors identify a topic by the means of related keywords and key-phrases (semantic modeling of a topic), that they search for scientists that are closely related to that topic (social modeling), and search for the literature produced to address the topic. We envision building a state-of-the-art via the construction of a time-varying network of concepts and people, using this network to study how concepts are formed, how they evolve over time, to examine how researchers influence each other, and how their research interests vary over time. Starting from the socio-semantic network model by (Roth and Cointet, 2010), we will combine statistical, linguistic and knowledge-based methods to automatically build a time-evolving socio-semantic network from a corpus of scientific documents.

Suivre l'évolution d'un sujet scientifique depuis ses origines à son-état de l'art actuel, est une lourde tâche pour les spécialistes du domaine et les chercheurs. Il s'agit de cerner le périmètre du sujet par l'identification de mots clés pertinents (modélisation sémantique), de repérer les scientifiques qui y sont liés (modélisation sociale) et de recenser la littérature pertinente. Or ces réseaux de concepts et de personnes évoluent avec le temps. Notre objectif est alors la conception de modèles de réseaux socio-sémantiques dynamiques fondés sur une double approche linguistique et statistique, en partant de corpus de documents scientifiques et en nous appuyant sur le modèle proposé par (Roth and Cointet,2010)

Tracking the evolution of a scientific topic from its origins to its current state-of-the-art is a difficult task that requires a huge amount of effort by domain specialists and researchers. Such a task requires the identification of a topic by the means of related keywords and key-phrases (semantic modeling of a topic), search for scientists that are closely related to that topic (social modeling), and the literature produced to address the topic, all of this by taking into account their evolution during time. Therefore, we suppose that a scientific state-of-the-art represents the impact of a certain topic in time, society and cognition.

In our vision, building a state-of-the-art consists in the construction of a time-varying network of concepts and people. This would allow to see how concepts are formed and how they evolve over time, and how they establish relationships with other concepts. This network would also allow to show how researchers influence each other, and how their research interest vary over time. We suppose that this can be realized from a historical corpus of scientific documents annotated by their time of publication (e.g.: ACL anthology, ArXiV, Citeseer). The fundamental questions that we want to answer are: What are the scientific ideas present in such collections at a given moment of history? And how did a specific idea evolve over time? How are these ideas related, and how these relations also evolve over time? All of this while tracking the evolution of the underlying social network of scientists.

The base model that we want to extend is based on socio-semantic networks (Roth and Cointet, 2010) http://epubs.surrey.ac.uk/1563/1/fulltext.pdf. This model formalizes the relation between concepts, persons and time. However, this model doesn't take into account the relationships between concepts. Taking the socio-semantic networks as a base model, we would like to extend it in order to represent relations between concepts, similarities, differences and ontological relations, and how this conceptual network evolves in time. In addition to this, we will adapt the model to focus on the definition and identification of structural patterns in citation and collaboration graphs which are both (i) linked to key studies and (ii) sufficiently distinct from one another, in order to grasp mainstream works without redundancy and without neglecting more peripheral subfields; in other words, the aim will be to render literature diversity rather than aggregates – a task well-adapted to state-of-the-arts, but rather opposed to what is usually being done in science studies. Joint qualitative-quantitative approaches will also be proposed to see how our methods may contribute to the description of the development of specific disciplines, e.g. in Europe over the last century, relying on a comparative approach, on corpuses able to assist historians of science in their endeavors.

Our main technological objective would be to put together a variety of statistical, linguistic and knowledge-based methods to build this time-evolving socio-semantic networks from a historical corpus of scientific documents.

One of the key issues to address is the identification of topics: naively, any combination of words may represent a topic. Possible solutions include using text mining, NLP and machine learning methods on scientific papers in order to find recurrent patterns. Topic models, such as LDA, represent a good statistical framework to start with, although they do not model temporal relations between the discovered topics, and do not take semantic relatedness between topics into account as they infer topics by only observing surface word co-occurrence. This aspect of LDA leads us into the second problem, which consists in determining the semantic relatedness between two topics: are two topics the same, even formulated in a different way (e.g. is “text mining” and “text analytics”), or are they different (e.g. “information retrieval” and “information seeking”)? The topic text is obviously not enough to determine their similarity; instead, a detailed analysis of the related papers and authors will be needed, possibly using external resources such as knowledge bases (ontologies, WordNet, thesauri, dictionaries, etc.). Author-based similarity may be calculated using co-operation graphs, or using methods for textual similarity (for example, those developed in the context of the SemEval Semantic Text Similarity task). Finally, it will be necessary to define a way to combine the different similarity scores calculated on the basis of the social or semantic layer.

A breakthrough in existing academic search technology will be achieved by being able to automatically explain the links between topics (e.g. in which “intellectual freedom” and “intellectual property” differ? How do they relate?). In order to do this, on one hand we will need to carry out the classification of the links between papers (such that it will be possible to determine automatically why a paper has been cited by another one), but on the other one, it will be necessary to discover the semantics of the link by the means of deep semantic parsing techniques (also known as “deep machine reading”, (Etzioni 2006, Gangemi et al. 2014)), which make it possible to connect the topics within a larger semantic graph, exploiting linking events, situations, and resolvable concepts or entities.

A thorough study on the evolution of concepts and their neighborhood will also be needed. How to compute a concept neighborhood is not a trivial task (it is equivalent to determine at which semantic distance we consider a concept to be “too far” from another one). Supervised and unsupervised approaches may be explored in order to address this issue. To address the temporal aspect, some extensions to LDA have been proposed, such as the Dynamic Topic Model (Blei and Lafferty, 2006) and The Topics over Time Model (Wang and McCallum, 2006).

Finally, we aim to deploy the proposed model into a tool: for this part of the work it will be necessary to address the issues related to the presentation of the results and the generation of a textual state-of-the-art from the socio-semantic network.

Blei, David M., and John D. Lafferty. Dynamic topic models. Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

Chavalarias, David; Cointet, Jean-Philippe. Phylomemetic Patterns in Science Evolution—The Rise and Fall of Scientific Fields PLoS ONE, Edited by Eduardo G. Altmann, vol. 8, issue 2, p. e54847

Oren Etzioni, Michele Banko and Michael J Cafarella. Machine Reading. AAAI Conference on Artificial Intelligence, 2006.

Osborne, F. and Motta, E. (2012) Mining Semantic Relations between Research Areas, International Semantic Web Conference, Boston, MA

Roth, Camille and Cointet, Philippe. Social and Semantic Coevolution in Knowledge Networks. Social Networks (Elsevier), 32(1):16-29, 2010.

Wang, Xuerui, and Andrew McCallum. Topics over time: a non-Markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

This project is fully relevant to the call, in particular with the social challenge 3.7 Société de l'information et de la communication, sub axis 3.7.1.2 Études numériques et technologies de l'intellect. The problem of tracking the evolution of a scientific topic in time, society and with respect to similar topics can be positioned in the transdisciplinary crossing of digital humanities and computational linguistics. By automatically generating scientific state-of-the-art we aim to produce models, methods and working tools that would allow researchers and scientists to cope more effectively with the new epistemic challenges of the Big Science scientific publication rythm.

The capability to build automatically a state-of-the-art for a given topic will allow researchers to improve their bibliographical activity by obtaining the needed information more quickly and efficiently, relieving them from some tedious tasks such as following links from one paper to another and sparing them from reading works that are not pertinent to their current research work. Reviewers may also benefit from the availability of such a tool in their assessment of other researchers’ works, specifically their novelty and how they relate to current state-of-the-art. In our vision, the project may pave the way to a new generation of tools for the exploration of existing literature in the digital era.

Furthermore, by choosing an industrial partner we aim to go beyond the scientific community and adapting the technological outcomes of this project to the real world requirements of a startup company trying to produce innovative solutions in the expertise finding field.

Consortium description

LIPN Laboratoire d’Informatique de Paris Nord, Université Paris 13 UMR 7030

LIPN Laboratoire d’Informatique de Paris Nord, Université Paris 13 UMR 7030 The work of the RCLN team at LIPN has been centered around four major axes: semantic annotation and analysis, knowledge extraction and structuration, semantic information retrieval and syntax-discourse integration. In the context of this project, LIPN will provide substantial skills in semantic similarity, topic identification and knowledge extraction. The team members are: Davide Buscaldi (coordinator), researcher with a record of more than 60 internationally published papers, expert in semantic similarity, word sense disambiguation and knowledge extraction; Aldo Gangemi, who has been involved in several projects funded by both the European Union and industrial companies on ontology engineering, Semantic Web, knowledge management and cognitive linguistics; Jorge García Flores, who is working on Web People Search, information extraction, and semantic similarity; Nadi Tomeh, expert in statistical methods for NLP.

Selected publications:

Aldo Gangemi, Valentina Presutti, Diego Reforgiato Recupero. Frame-based detection of opinion holders and topics: a model and a tool. IEEE Computational Intelligence, 9(1), 2014
Buscaldi D., Le Roux J., García-Flores, Jorge J., Popescu A. LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features. In: Second Joint Conference on Lexical and Computational Semantics (*SEM 2013), pp.162-168. ACL

INRIA Saclay

At Inria Saclay, the team of Laurent Romary is a key contributor to several domain directly related to the present proposal: scholar information (EU project PEER, Cendari, European DARIAH eInfrastructures), document encoding (TEI), language resource standards (ISO/TC 37/SC 4). The team developed state of the art key-phrase and concept extraction for technical and scientific content (SemEval 2010 winner for the relevant task), the best automatic patent prior art prototypes (CLEF IP winner in 2009 and 2010) and a state of the art bibliographical information extraction tool from scholar and patent publications (Grobid). The participation of this partner will be supervised by Gregory Grefenstette, an internationally recognized researcher (25 keynotes and invited talks). Prior joining Inria, he served as Chief Scientist of Exalead, the European leader in professional search, and was a key player of the 5-year project Quaero, a major R&D project in search technologies.

Selected publications:

Gregory Grefenstette, “Explorations in Automatic Thesaurus Creation”, Kluwer, 1994.
Patrice Lopez, Laurent Romary. HUMB: Automatic Key Term Extraction from Scientiﬁc Articles in GROBID. SemEval 2010 Workshop, Jul 2010, Uppsala, Sweden.
P. Lopez and L. Romary. 2010. GRISP: A Massive Multilingual Terminological Database for Scientiﬁc and Technical Domains. 7th international conference on Language Resources and Evaluation (LREC), Valletta, Malta
Laurent Romary. TEI and LMF crosswalks. Stefan Gradmann and Felix Sasaki. Digital Humanities: Wissenschaft vom Verstehen, Humboldt Universität zu Berlin, 2013

MoDyCo - UMR 7114 CNRS - Université Paris Ouest Nanterre La Défense

MoDyCo lab has been an active research team on linguistic modeling, linguistics dynamics and corpus linguistics at Paris Ouest Nanterre University in recent years. MoDyCo partners will bring their expertise on temporal analysis of texts, and particularly on aspecto-temporal, enunciative and modal constituents. They will also contribute on the modeling of semantic calendars for temporal expressions, whose schemas are already being used in the text navigation platform NaviText. This platform relies on a language representation of navigation knowledge in texts which make use of semantic annotations. Furthermore, MoDyCo members might contribute to the topic identification and text mining tasks by improving the time modelisation of socio-semantic networks. MoDyCo participation will be supervised by Delphine Battistelli, Computational Linguistics Professor at Paris Ouest Nanterre University, and Mathilde de Saint-Léger, Research Engineer at CNRS.

Selected publications:

Battistelli D., Cori M., Minel J.-L., Teissèdre C. (2012)- « Information Retrieval: Ranking Results according to Calendar Criteria, in Actes IPMU'12 (14th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems), 9-13 juillet 2012, Catania, Italie.
Couto J., Minel J.−L. (2009), «Text Linguistics and Navigation: Questions about Text», Belgian Journal of Linguistics, p. 100-108, John Benjamins Publishing.
van Meter K., de Saint Leger M. (2009) «German & French Contemporary Sociology Compared: Text Analysis Of Congress Abstracts», Bulletin of Sociological Methodology/Bulletin de Methodologie Sociologique October 2009 vol. 104 no. 1 5-31

Centre Marc Bloch

This CNRS unit, based in Berlin, Germany, hosts an interdisciplinary Digital Humanities team gathering researchers with a dual profile in computer science (especially system modeling, NLP and pattern recognition, signal analysis) and social science (especially ICT sociology, social network analysis, public sphere studies). The team focuses on the sociosemantic dynamics of various social systems, such as scientific and digital communities, working extensively on textual corpora. The participation of this partner will be supervised by Camille Roth, 34, currently Permanent Researcher at the CNRS in Computer Science since 2008, after having been Associate Professor in Sociology at the University of Toulouse. Team members are either computer scientists or modelers who have been working in social science fields, or social scientists who are acquainted with formal and ICT methods.

Selected publications:

Taramasco C., Cointet J.P., Roth C. (2010), “Academic team formation as evolving hypergraphs”, Scientometrics, 85(3):721-740.
Roth C., Cointet J.P (2010) “Social and semantic coevolution in knowledge networks”. Social networks, 32, pp:16–29 (Prize of the European Academy of Sociology).
Roth C., Wu J., Lozano S. (2012), “Assessing impact and quality from local dynamics of citation networks”, Journal of Informetrics, 6(1):111-120

ideXlab

IdeXlab has created an intermediation platform to accelerate innovation for companies of all sizes in all industries. The platform connects companies and a global community of experts discovered “in real time”, through a dedicated search engine. ideXlab customers are large industrial groups today but it gradually broadens its target to innovators of various natures and sizes (start-ups, SME’s, laboratories, universities, specialized agencies, etc.) The coordinator will be Pierre Bonnard, with over 15 years of research and technical developments in the field of software, telecommunication and information processing. He has conducted projects for companies as Usinor, Matra, Alcatel-Lucent where he managed software and products development and authored 7international patents. Nouha Omrane is working for ideXlab as a Research and Development Engineer and her core interests are in data mining and processing for Open Innovation platform development. Her thesis work was on the semantic web and ontology and thesaurus construction.

IdeXlab will contribute to the project as a data provider, using the existing ideXlab platform to gather together the documents to form a historical corpus to be used for the production of the State-of-the-Art, and providing industrial use cases: the users of the ideXlab platform may want to obtain a State-of-the-art document produced in near real time that would summarise the key concepts, dates, people, companies, etc. associated with a given question. ideXlab would be able to input user requirements in terms of information content and data presentation.

Avec l'entrée du partenaire ideXlab, on peut demander le budget maximale (700-799k€)

Automatic generation of scientific state of the art