Automatic tag suggestion for blog posts

De wikiRcln
Révision de 23 septembre 2016 à 14:32 par Jgflores (discussion | contributions) (23/9/2016)

(diff) ← Version précédente | Voir la version courante (diff) | Version suivante → (diff)
Aller à : navigation, rechercher

Découverte automatique d'étiquettes pour des billets de blog

Le présent projet propose l'implémentation d'une méthode d'annotation, classification et indexation de billets de blogs basée sur la découverte d'étiquettes appropriées pour le contenu d'un billet (Automatic tag suggestion). L'objectif scientifique du projet est d'implémenter un parsing sémantique léger dans une tâche sémantique concrète et relativement facile à évaluer. La tâche consiste à proposer automatiquement des étiquettes (tags) au rédacteur d'un billet à partir et des étiquettes précédemment attribuées et d'une analyse du contenu du présent billet. L'objectif technologique est d'élargir les fonctionnalités de la plate-forme d'annotation Platanne (basée sur le framework UIMA) avec la tâche ainsi qu'en l'intégrant, ne serai-ce que partiellement, à une version allégé du parseur sémantique d'Alpage (basée sur le schéma d'annotation Deep-Sequoia).


The goal of this project is to develop an original method for automatic tag suggestion based on the Platanne semantic annotation platform and on a lighter approach of Alpage's semantic parser. This method would be applied on the classification and indexation of blog posts. The scientific objective of the project is to study the contribution of an experimental semantic parser in a concrete and easy to evaluate semantic task like automatic tag suggestion. The task consists on automatically proposing tags to the blog post writer based on previous tags and on a light semantic parsing of the current post. The technological goal is to extend to the UIMA based Platanne annotation platform's functionality through an implementation of a light version of Alpage's semantic parser (based Deep-Sequoia based annotation schema).

Scientific goals

  1. To find out which semantic parsing level would be helpful for the automatic tag suggestion task.
  2. To measure the impact of light semantic parsing in a concrete, easy to evaluate semantic task
  3. To produce a method of automatic tag suggestion based on light syntactic and semantic analysis integration

Technological goals

  1. To upgrade the current implementation of named entities and term annotation in Platanne to the annotation tag suggestion task
  2. To integrate top Platanne an UIMA type system adapted to a lighter version of Alpage's semantic parser
  3. To start creating interfaces between LIPN's Platanne and Alpage's semantic parsing technology



1. Platanne

  • Finish the implementation of the named entities module
  • Finish the implementation of the terminology module
  • First Platanne prototype for the Automatic tag suggestion task without integration with Alpage's semantic parser
  • Second Platanne prototype for the task integrated with Alpage's semantic parser

2. Light semantic parsing for tag suggestion

  • Find out what's the level of syntax analysis and semantic parsing required by the task
  • Specify and implement a lighter version of Alpage's semantic parser into a Platanne compatible UIMA type system

3. Blog corpus building

Corpus characteristics

French corpus
  • An excel file with some corpus statistics
  • 19 blogs, from where there are 3 in a clean XLM format (11M) and 18 to be cleaned
  • We don't have a word count yet to estimate the blog corpus
  • Word count of all the cleaned corpus: (TO DO)
  • Size in Mb of the whole corpus: (TO DO)
English corpus
  • Ehab's blog corpus (~3Tb)

4. Evaluation

  • Set up a blog post writing campaign at LIPN
  • Implement an evaluation method A based on blog post writers choosing automatically proposed tags
  • Implement an evaluation method B based on the blog corpus


Corpus building

  1. Build a blog corpus. on this first iteration we will work only with wordpress blogs. [Laurent, Ivan, Jorge]
    • Deadline: March 20 2015 (delayed)
    • New deadline: May 6
      • Choisir 20 blogs comme corpus préliminaire (DONE)
      • Développer une méthode pour l'aspiration des notes des blogs (DONE)
      • Creation of a github repository and a slack channel for the corpus and the project code (DONE)
        • Il faut identifier: titre, sous-titres, mots-clés, catégories dans le cadre d'UIMA [Iván, Laurent]
          • Deadline: April 3 (delayed)

Platanne development

  1. Set up a working version of Platanne with named entities and term annotation [Iván & Jorge]
    • Deadline: April 20 (delayed)
    • New Deadline: May 20 (DONE)
    • New Deadline July 1st (for named entities and terms working in Platanne with Erwane's code
    • Tasks:
      1. To install a Platanne working version locally without eclipse(DONE) [Iván]
      2. Document the eclipse-UIMA installation process on Platanne's wiki [Iván & Jorge]
      3. To extend Platanne to named entities [Iván & Jorge]

Experimental settings

  1. Pre-UIMA: machine learning for tag suggestion
    1. Category and tag learning for suggestion
    2. New categories and new tags
    • First proposition: July 1st
    • What's the difference between tag and category? The tag suggestion process could enhance the semantic model (for instance, to propose a hierarchical categories tree) and then we are in dynamic annotation.
    • Dynamicity cases
      • When the categories become hierarchies
      • When we have new tags for old posts
    • Could this method be used in tweet-hastag annotation?
  2. UIMA experimental settings: To develop a type system annotation model for blog posts [Jorge & Iván]
    • Deadline: May 15
      • First stage: to tag only with keywords
      • Second stage: to create a hierarchical category structure.
      • To develop a dynamic annotation method for blog tags production [Adeline, Laurent, Iván, Jorge]
  • Deadline for the first (very simple) tag learning experiment: before august 2015



  • Check Ehab blog corpus
  • Organize a hack meeeting every two weeks
    • Next meetings:
      1. April 24(14h Paris)
      2. May 7(14 Paris)
      3. May 15(14h Paris)
  • Create a Slack Channel


  • Adeline, François, Ivan and Jorge
  1. Tagging a blog post is not semantic annotation (you can randomly associate a word to a blog post)
  2. Categories are semantic annotation
  3. NE, term annotation are semantic annotation.
  4. Can we isolate a semantically relevant set of tags from blog corpus?
  5. Tag prediction or tag clustering?
  6. We shoud focus on the categories: try to predict new categories.
  7. When you have too many blogs on the same category, we could split in order to find subcategories.
  8. We might discover that some of the tags are clearley related to the categories.
  9. If we can find the tag in the text, then we can predict that it is a content relation tag.
  10. Tag description: propose tags and discover categories
  11. To produce a named entitiy hierarchy
  12. We have to annotation level: final user level (tags, categories) and system level (NE, Postags, terms)
  13. It might be necessary to include the UIMA type system in the semmantic annotation discussion Ivan's writing right mow


  • Ivan, Jorge
  1. Is the problem of tag suggestion coherent with Ivan's PhD's subject
  2. We might start by
    1. Suggest a tag
    2. Does the tag fit on a category?
    3. Dowes the categegorised tag helps on learning a new taxonomy?


  • Iván, Jorge
  1. About the implementation for NE in Platanne.
  2. Ivan will calculate a Goodman-Kruskal correlation between tags and categories
  3. He proposes to train a classifier on each notes (represented as a bag of words) in order to predict categories. The experiment can be planned in growing complexity:
    1. Predict (and suggest) tags
    2. Find categories
    3. Find categories structures (or even taxonomies of categories, which could be semantic system candidates)


  • Ivan, Adeline, François, Jorge
  1. Discussion about Adeline's document: three types of applications of dynamic semantic annotation
    1. Manual annotation on text (an evaluation campaign) where the annotation process takes some tieme to stabilize.
    2. Automatic annotation on a flow of documents, like news, blog posts or tweets.
    3. Annotation of an already annotated corpus
  2. The blog annotation task must be now reformulated as a dynamic flow of documents where the annotation system gets frequents updates.
  3. Ivan proposes to take into account the date of the blog post and category classifications. We could take into account category discovering. The thing is to introduce the dynamic element on the process, meaning that each time we discover a new category or a new tag, to try to re annotate the past blog post which didn't consider this category or tag.
  4. We still have to formalize the applicated work we are doing with the formal formulation from Adeline's document
  5. Dynamicity is better introduced by partitioning the corpus on time slices where new tags and categories from recent slices affect the annotation of past slices.
  6. TODO: We should concentrate on the blog suggestion application but having in mind the big picture from Adeline's paper: we're working on dynamic annotation, with a general framework which could be applied on a wide range of annotation process (like manual annotation).
  7. TODO (Ivan): Write down the experimental planning of the task as a draft (and then we will add the results of the experiment and we have an article)
  8. TODO (Iván): To finish a draft on the experimental planning of the blog post task (and some formalization on dynamicity) before the summer break.
  9. TODO: We might submit a paper for TIA 2015 demos and posters call (Deadline: September 10)


  1. Iván presents his first experiences with blog data
  2. Discussion about the category prediction based on keywords and the Lambda-Cruskal
  3. Iván's results suggest that keyword prediction works better than word prediction
  4. Adeline explains the links between category prediction and dynamic annotation. The key on this process is the reannotation process when the category taxonomy evolves
  5. Index and category construction, and its relation to table of contents (table de matières) and index building.
  6. Tag suggestion: double prediction: new tags and tag already found as keywords.
  7. Is it worth to predict tags from categories?
  8. Annotation dynamique (d'après Adeline)
    1. Le texte change
    2. Le modèle sémantique change
    3. Des nouveaux liens apparaissent
  9. TODO (Iván): Base prediction system (really quick). Tags predictions and category prediction.
  10. Discussion sur le corpus: c'est bien d'avoir un corpus hétérogène.
  11. Il faudrait finaliser déjà le travail sur la prédiction pour commencer le travail sur la dynamique.
  12. Adeline part
  13. On continue la discussion par rapport à l'implémentation du système de prédiction dans le cadre de Platanne-UIMA où non. On propose alors qu'Ivan continue en Python
  14. TODO (Iván): éteindre la portée du corpus pour couvrir des thèmes plus hétérogènes....
  15. KORBEN (blog d'informatique que Laurent propose). Des blogs de cuisine ça pourrait être intéressant.
  16. TODO (Iván) compléter le corpus aussi avec un blog juridique (maitre-eolas,
  17. François propose de télécharger un snap-shot des blogs maintenant avec l'intention de, en fin de thèse, le traiter à nouveaux avec les nouveaux posts pour travailler sur al dynamicité.
  18. Laurent propose de regarder de près la M-estimation sur l'algorithme Naive Bayes pour améliiorer la performance

25/09/2015 (réunion de suivi de thèse)

  • Adeline says that blog annotation is a pertinent application field of semantic dynamic annotation
  • Iván looked for a state of the art on blog annotation
  • Adeline says we need a blog annotation system right now...
  • Iván has extended his corpus to video games, kitchen recipes and legal blogs.
  • Shall we connect Ivan's approach to Wordpress
  • Jorge presents Semeval tasks related to taxonomies. Adeline thinks that task 14 (semantic taxonomy enrichment) is closer to what we do, but we should consider carefully if we participate or not in this task.
  • We consider that the main goal right now is to develop as soon as possible the blog annotation system in order we could keep doing experiments.
  • Tracks:
  1. blog technichal platform (functions) wordpress related
  2. data: blog collection
  3. functionalities that could enhance the blog plateforms
  4. prediction system
  5. dynamic annotation functions
  • Ivan will produce a schedule for this task for 2015-2016.

6/10/2015 (réunion de suivi de thèse)

  • Adeline, François, Jorge, Iván, Laurent
  1. Iván presents his new blog corpus
  2. The new corpus contains blog posts on three subjects: law, cooking recipes and video games
  3. We need to buy an english corpus from the labex project (
  4. Ivan has now a reasonable corpus on French and English
  5. copyright problems with French corpus: we shoud make a research on copyright: can we use them to produce a challenge corpus? (contact the authors and make sure if the blog platform allows tu publish the content as a corpus)
  6. Laurent needs to fill the Labex reporting form and we will buy the corpus from this budget
  7. Is the data enough?
  8. We discuss about the experimental prediction: predicting tags from words, predicting categories from tasks, etc.

16/10/2015 (réunion de suivi de thèsse)

  • Adeline, François, Ivan, Jorge, Laurent
  • (no notes)

30/10/2015 (réunion de suivi de thèse)

  • Adeline, François, Iván, Jorge
  • Iván is still processing and analysing the augmented blog corpus
  • Iván presents a state of the art on Tag Suggestion Tool and Blog plugins
  • We talk about text annotation tools, like Open Calais and Yahoo Content Analysis
  • We discuss if we should take them as a baseline or as a brick component in our pipeline
  • Ivan's presentation focus on entity, events and relationships annotations tools that could eventually be connected to blog annotations.
  • We discuss about the dynamicity part: both blog annotation and semantic annotation are not trending research subjects: the main contribution of Ivan's works would for sure come from dynamicity.
  • Adeline stresses the point on developping a blog annotation system
  • François proposes to compare the tools that Ivan presented in order to test which fits better to our task
  • Challenge: is it possible to build an experimental sample of blog posts to work with?
  • We propose that Ivan submit a Early PhD paper to ESW 2016 PhD symposium call (Deadline: December 18)
  • Next meeting: Thursday, November 5

10/11/2015 (réunion de suivi de thèse)

  • Ivan reports on the reading of Aldo's paper: A Comparison of Knowledge Extraction Tools for the Semantic Web. He read this paper trying to find a comparison between semantic annotation API's that could be applied to blog posts.
  • Ivan wonders if we are not forgetting the semantic model and ontology part of the PhD. It's funny that while he was looking for blog annotation tools, he landed in a paper written by Aldo linking annotation and semantic (ontology driven) resources
  • Talking about the weblogging corpus WWE, we find this paper: Whose thumb is it anyway? Classifying author personality from weblog text
  • Ivan would like to buy the weblogging2006 conference corpus containing for his weblog annotation experiences
  • Ivan already finished the data cleaning and treatment (just legal and cooking weblogs)
    • ~25000 blog posts, avg size per post ~157 average words per post
  • Jorge thinks that it might be intresting to share this corpus with the scientific community: the corpus itself is a result.
  • TODO: verify if we can use weblog posts for a public shared academic corpus
  • Adeline proposes to write a paper describing the corpus for CILC 2016 (Deadline for abstract submission: December 1, 2015).
  • What's our main question for the ESWC PhD student paper? (Abstract submission deadline: December 11 and December 18 for full paper submission)
  • Jorge thinks that the experiment for ESWC might be focused on semantic tag suggestion
  • Adeline prefers to concentrate on a more specific and focused task, that is automatic tag prediction: comparing the tags we predict from text with the tags choosen by the blog post author.
  • Adeline suggest Ivan to explore the use of annotation on the corpus.
  • We agree on writing a paper based on a very modest tag suggestion system, which would eventually become the baseline of future experiences based on Dynamic annotation.
  • Draft of the paper structure (according to Adeline):
    1. this is my corpus
    2. this is my experimental setup
    3. those are the results of my tag suggestion (using Open Calais or Alchemy)
    4. this system would be the baseline of my future experiences
  • Ivan proposes the following question: how to measure the quality of a tag?
  • Do we have the means to identify the good tags from the bad tags?
  • Jorge proposes an experience with students where they have a blog post and a mix of author and Ivan system's proposed tags, in order to choose which are the tags which fit the best to a blog post.
  • Ivan thinks that the quality of a tag might be a good idea to build the paper around.
  • François proposes to look globaly to the tags not putting the attention on authors, but on how the tag split the corpus on two subcorpus: one which is annotated by the tags and another one without any annotation whatsoever. Could you learn from the annotated corpus to process the one without annotation?. You could split the corpus randomely in two, tag one half and not the other
  • From François idea follows a long discussion about the information quantity of each tag (inspired by classic Shannon information theory) and which tags are meaningful and which aren't.tha
  • We all agree on a measure of the informational quality (or quantity) of a tag
  • We realize that we don't know exactly the differences between categories and tags
  • We need to study the difference between categories and words

19/11/2015 (réunion de suivi de thèse)

  • What should be the message for the ESWC paper?
  • Adeline suggests to start with a discussion about how the existing tools doesn't take into account the dynamicity problem
  • Ivan proposes to focus on the maintaining of annotations in a changing environment
  • You can annotate the blogger according to the new structure.
  • Adeline: "I have this problem, this tools and this cases; the tools as there are they don't tackle this question, so our goal will be to enrich the existing tools with this extension".
  • Then Ivan could argue that this dynamicity extension could be used for other problem where the semantic system changes...
  • "The concrete example for your problem is on blog annotation, and you will focus on how to enrich the family of tools you just describe".
  • Ivan asks: are we forcing too much the scenario in order to test our ideas on blogs?
  • We are discussing if it's worth to make interviews of bloggers or send questionnaires in order to find a concrete application scenario for dynamicity relevance for bloggers
  • Actually, the point is to get convinced about the validity of the dynamic annotation task based on the actual blog corpus of legal, receipes and video games texts.
  • Measuring recall and precision of the tag prediction tool (like comparing Open Calais with another tool). Whatever is the result of the precision and recall of tools, we will show that the dynamicity issue is not addressed.
  • Maybe the second tool should be as different as possible from Open Calais.
  • The narrative for the ESWC is: why blogs area annotated, how annotations are used, what are the tools used for blog annotation (that's the intro). Then we show some data from the corpus, and make the point that existing tools don't the dynamicity problem. We would need some experiments to make the things more concrete.
  • Then we discuss an abstract for corpus linguistics conference... Is this conference is CILC 2016?
  • This second paper is How weblogs are tagged?. The goal: just describe from the collected data the way tags are done, the tags distribution and so on...
  • We agree on the following due dates:
    • December 1st: abstract for CILC 2016
    • December 11: abstract for ESWC
    • December 18: full paper for the ESWC PhD conference

25/01/2016 (réunion de suivi de thèse)

  • We discuss about CILC 2016: there's this weird rule about each author paying 75€
  • Ivan presents the plan of what he considers to present on the CILC paper
  • Adeline proposes to focus the presentation exclusively on the corpus structure:
    1. Motivation
    2. Overall presentation of the corpus
    3. Analysis
      • Tags and categories distribution
      • Types of blogs and types of tags
      • Evolution over time
    4. Conclusions
  • Ivan shows some statistics from his blog metadata database
  • Iván will present figures about his blog analysis for the next meeting (Monday, Feb 1st, 16h)

2/02/2016 réunion de suivi de thèse

  • on pourrait éventuellement retirer le papier de CILC
  • On s'accorde sur le fait d'envoyer un papier à CICLING: date limite 15 février
  • Ivan nous fait une présentation de sa base de données du corpus des blogs
  • On voudrait interroger le base des données avec les questions suivantes :
    1. nombre des tags par blog
    2. nombre des catégories par blog
  • Outline of the CICLling paper: "How blogs are tagged?" "Semantic annotation: how blogs are tagged?"
    1. Introduction: motivationw: working on dynamicity of annotation ; blog annotation is meant for information retrieval and for referencement but tagging is mostly based on a distributed, subjective and "humor" basis. We are focusing on non-commercial blogs. We are not focusing of the normalization of the tags per se.
    2. State of art
      1. manual annotation
      2. tag prediction tools
      3. blog platforms
      4. -> tools are not really efficient and there is a need for a certain dynamicity in annotation
    3. Corpus analysis
      1. Overall presentation of the corpus
      2. Tagging activity
      3. (Correlation between tags and categories)
      4. Evolution over time
    4. First experiments on tag prediction
      1. Basic comparison between 2 tools?
      2. Evaluation
    5. Discussion
      1. Exemples of tags that should benefit from dynamicity ("vintage"). Examples driven from th evolution graphs. Bad tags are kept for ever, new ones are used but older ones are not changed. Changes in the granularity of the categories. Backward annotation is an important fetaure to propose to bloggers.
    6. Short conclusion

8/02/2016 réunion de suivi de thèse

Cf Call de DocEng 2016

9/2/2016 session de codage collaboratif pour établir les expés à présenter dans le papier de CICILING

  1. Unfortunately, those annotations are often made on subjective grounds and not in a systematic way.
    • What would be a "systematic" way to annotate under objective grounds?
      1. Analysis of semantic coherence of tags and categories
        1. Useless tags: low frequency tags (used once or twice in all the blog)
        2. Frequency analysis by blog post vs tags (you calculate correlation between the most frequent words on every posts and compare it against the attributed tags)
        3. Topic analysis with mallet (you calculate the most frequent topics on every post and compare them against the attributed tags and categories)
        4. Wordnet synset frequency analysis, inspired by Olivier Ferret ACL paper (you calculate the most frequent wordnet synsets and compare them against the attributed tags and categories)
  2. Although there are currently several tools to help bloggers to annotate their posts, these tools do not take into account the already existing information inside the blogs nor the evolution of their topics.
    • How do we take into account the existing information and the evolution of the topic?
      1. A qualitative analysis (and comparison) of previous tag suggestion tools is needed. We don't have the time to make a full analysis of all the approaches before CICLING paper, but that would be necessary afterwards, but it is possible to:
        • to use existing tag suggestion tools to annotate a subcorpus of blog posts and compare them against author's attributed tags
    • We most compare the present post against the collection of past posts in the following way:
      1. Calculate and compare any post tags with previous posts tags ( vs post.past[n].tags)
      2. Calculate and compare any post categories with previous posts categories ( vs post.past[n].cats)
      3. Calculate and compare any post frequentest words with past posts most frequent words ( vs post.past[n].freq)
      4. Calculate and compare any post relevant topics with past posts most relevant topics ( vs post.past[n].topics)
      5. Calculate and compare any post most frequent wordnet synsets of any posts with past posts most frequent synsets ( vs post.past[n].synsets)
      6. Where:
        • freq = most frequent words in a post
        • topics = relevant topics in a post
        • synsets = most frequent wordnet synsets in a post
        • post.past[n] = the collection of previous posts of
        • to use existing tag suggestion tools to annotate a subcorpus of blog posts and compare them against author's attributed tags
  3. By the analysis of blog texual data we try to mark off the practices of blog annotation and we evaluate the annotation tools with respect to the bloggers’ requirements.
    • To proof this point it would be necessary to make a survey between blog authors and we don't have the time before CICLING deadline
  4. This paper presents an analysis of a corpus of blogs in French and an evaluation of blog annotation tools. From these results, we explain what are the advanced annotation functionalities that blog platforms should offer
    • A detailed description of the blog corpus according to previous presentation of Ivan's statistics.

Experiences we agree to try before the CICLING paper

  1. A detailed description of the blog corpus according to previous presentation of Ivan's statistics.
  2. To calculate frequent words, frequent synsets and relevant topics and look for correlation between those measures and tags and categories (synchronic analysis)
  3. To compare existing tools annotating a subcorpus of blog posts and calculate recall and precision against author's attributed tags
  4. To calculate frequent words, frequent synsets and relevant topics and look for correlation between those measures and tags and categories 'from past posts (diachronic analysis)


  1. We agree on experimental protocols for experiences 2 and 3
  2. Ivan starts implementing experience 2
  3. Jorge starts implementing and documenting experience 3


  • François (skype), Adeline, Ivan and Jorge (skype)
  1. We talked about the CILC 2016 paper called (Deadline: 1st May, paper lenght: 15 pages)
    • A French Weblog Corpus For New Insights in Blog Post Tagging
  • TODO
    1. Ask for permission for the corpus rights
    2. State of the art
      • Compare our blog corpus with other blog corpus
      • Compare with other existing diachronic corpus
      • Cite the annotation tools we got
    3. Iván will take the lead on the corpus collection section
    4. Iván will install his database on the Cluster TAL (Kilroy virtual machine)
    5. François on the Corpus presentation and corpus activity sections


  • François, Adeline, Ivan and Jorge (par skype)
  1. We discuss if we build a platform, a system, an annotation system.
  2. What we have:
    • A corpus
    • A very basic prediction system
  3. If we want to work on dynamic annotation we have to take a step back (Adeline). We have done very little on that direction, but we could work on a generic framework on dynamic annotation. Issues, applications, main scenarios for dynamic annotation. But that would be an abstract framework. Then we could think of instantiate this framework for blogs.
  4. Working with categories is much simpler than working with ontologies. If we say that we have a generic framework then we can identify a few scenarios that are relevant for blogs and try to implement this.
  5. Experimentation and evaluation.
  6. Abstract general framework: an abstract method to be instantiated in the blog corpus
  7. What are our research questions?
    • How could we use past annotation to propose new ones?
    • What's the quality of past annotations?
    • How do a change in the semantic structures of the annotation system affects past and future posts?
    • is it possible to have similar posts that are not at all annotated in the same way?
    • consistency of the annotation
    • the annotation of these posts is not consistent... we have to reannotate
    • i have a new post with no annontation: how should i annotate that post with respect of the existing ones?
    • do past tags allow us to predict new tags?
    • how good are past tags for indexing/searching/describing the posts?
    • can we predict categories?
    • how good categories are for indexing/searching/describing the posts?
    • How subjectives are tags? How related are they to the post content?
  8. Adeline proposes to switch from a data driven approach to a goal driven approach and what are the scenarios we want to test
  9. We have three tracks:
    • Continue doing experience on data and tag prediction
    • Have a generic approach to analize dynamicity
    • We could have the dynamic approach instantiated on the blog
  10. How about listing the research questions and decide which track to follow.
  11. What are the nice functionalities that we dream for blogs and blog platforms?
  12. What are the dynamicity problems, both on existing and imaginary platforms?
  13. How do we express the dynamicity problem in terms of blog posts?
  14. How do we model the dynamicity problem (rule changing, type system changes) on the blog post
  15. Ivan will model the dynamicity problem in terms of blogs and work on applications scenarios.


  • Adeline, Francois, Ivan and Jorge (skype)
  1. Ivan and Adeline made a flow diagrame of the blog annotation (suggest a tag, etc).
  2. We are talking a scenario where we cluster a blog post.
  3. Adeline: we need a formalization of classification cases
  4. Ivan describes formalization criteria: size of the category, relations between the conncepts inside the classes
  5. Probably we wont go far away into the semantics and the concepts of the clases
  6. We need a definition of balance (how well balanced are the categories)
  7. Adeline: semantic relations between categories is a very difficult problem
  8. Clarify the different classification scenarios
  9. Adeline: work on two levels: 1) revise the diagram (and maybe work with different diagram levels) and 2) consider different steps: condition for trigerring the revision, options that the user can have, impact of each of these options. Let's take one case: there's a category with zero documents associated. In these case we could point out the empty categories to the user and he could decide to erase the category or merge it with other category.
  10. Operations on categories: add a category, erase a category or change the population of a category.
  11. Adeline: after defining these very simple operation, we can work with more complex scenarios.
  12. Ivan: the hardest problem is to identify the trigerring event that user attention on a certain category is needed.
  13. Adeline: 1) C'est qui le critère de qualité qui nos intéresse? (mésure de qualité: équilibre) C'est quoi une bonne catégorisation (mésure de qualité à definir) 2) C'est quoi les opérations qu'on peut faire à partir des catégories; 3) Quelle est la dynamique du système d'annotation: à quelle moment l'utilisateur annote, à quel moment on vérifie la qualité?
  14. Adeline: axes sur lesquels il faut qu'on avance: mesure de qualité, opérations et dynamique du système.
  15. François: l’intérêt de notre approche est qu'un utilisateur est là pour moduler la prédiction.
  16. Adeline: il nous manque la vue d'ensemble du travail de thèse: il faut y aller sur la définition de l'annotation dynamique.
  17. Pour la semaine prochaine: avancer au moins sur un des trois sujets (mésure de qualité, micro-scénarios de revision, dynamique générale du système (quelles sont les étapes supervisées et quelles sont les étapes non supervisées)
  18. Prochaine réunion: vendredi 23 septembre à 15h (Ivan fera une proposition par rapport aux axes et Jorge et François feront des retours sur des articles qu'Ivan nous a envoyé).
  19. CICLING: Camerar ready version pour 25 septembre. Regarder le graphe du poster pour que François, Jorge et Iván décident si l'on l'inclue dans la camera ready version de CICLING où si on va faire un nouveau papier.
  20. Adeline: distance ontologique


  • François, Ivan, Jorge
  1. On parle de l'article, on l'a fait rentré dans 12 pages mais Ivan a des nouveaux éléments (notamment des approches qui prennent en compte des étiquettes (tags) passées. On s'accorde sur le fait d'inclure cette partie, même si ça mène l'article jusqu'à 13 pages.
  2. Ivan expose des schémas pour rajouter des annotations au modèle
  3. Pour la semaine prochaine: plus de précision par rapport aux diapos 4 et 5 (formalisation par rapport aux mesures entropiques) et une preuve de concept sur des données réels (notre corpus) des heuristiques des diapos 4 et 5 (peut être une mise en relation des heuristiques avec l'état de l'art en clustering serait souhaitable, mais il faut s'assurer par rapport à son utilité: sommes nous vraiment devant un problème de clustering?).



  1. Documentation du système original d'Erwan
  2. Blog classification: Adding Linguistic Knowledge to Improve the K-NN Algorithm
  3. Automated Blog Classification: Challenges and Pitfalls.
  4. TagAssist: Automatic Tag Suggestion for Blog Posts
  5. Semantic indexing using WordNet senses
  6. Platanne platform
  7. UIMA
  8. Dynamic semantic annotation
  9. Fernando Perez-Tellez, John Cardiff, Paolo Rosso, David Pinto: Weblog and short text feature extraction and impact on categorisation. Journal of Intelligent and Fuzzy Systems 27(5): 2529-2544 (2014)
  10. Exploiting Category-Specific Information for Multi-Document Summarization. Proceedings of COLING 2012: Technical Papers, pages 2093–2108, COLING 2012, Mumbai, December 2012.