Notes on Semeval 2015 conference
Wednesday, June 3 2015
The top 10 systems [English task] did not show statistical significant variation among them.
Aligning words between sentences has been the most popular approach for the top three participants (DLS@CU, ExBThemis, Samsung). They use WordNet (Miller, 1995), Mikolov Embeddings (Mikolov et al., 2013; Baroni et al., 2014) and PPDB (Ganitkevitch et al., 2013).
Most teams add a machine learning algorithm to learn the output scores, but note that Samsung team did not use it in their best run.
Only about one fifth of the systems were un- supervised, among which, the top performing sys- tem, UMDuluth-BlueTeam-run1, was able to come within 0.1 correlation points from the top perform- ing system on Wikipedia and within 0.03 on the Newswire dataset. This relatively narrow gap suggests that unsupervised semantic textual similarity is aviable option for languages with limited resources
System's worth studying and reading the paper
Samsung: 4th place (no significant statistical difference with 1st) without machine learning
ExBThemis: Best paper, 2d in english, 1st in spanish (they'll give a presentation today)
DLS@CU: Our alignment master Sultan (1st in english two years in a row)
Thursday, June 4 2015
I say hi to Greg Grefenstette (INRIA) and Mariana Apidianaki (LIMSI)… Vive la France!
I had a chat with Eneko Aguirre (sends Greetings to Davide) and David Cer, from Google (I still have to ask him about querying for the immigrants projects)
Marco Baroni keynote conference on distributional semantic models
He starts with some words about Adam Kilgariff:
“He was totally allergical to bullshit
“He wrote a great paper in 1997: I don't believe in word senses”
“He never got a paper accepted in the main
ACL conference (which is a scandal, given his contribution to the field)”
“Recently we were talking about all this young people doing deep learning: they don't make any distinction between language and vision”
On the multimodal skip-gram model
Inspired by language learning by children
Training the model for standard distributional semantics with 20K words extracted from a baby language learning corpus.. they try to predict the objects the babies are learning (hat, ring) with a skip-gram model. The corpus is composed of words and object images (I don't totally understand it). The taks is called Matching words with objects
“Look at the kitty! Look at the oink!”
The model tries to predict from the word kitty the right cute animal image
Concept learning, word learning, synonim learning inspired by the human cognition process
How MMSkipGram visualizes new concepts
As far as I understand, he's trying to train an MMSkipGram model in order to learn unkown words and associate them to images, inspired by the baby language learning process… looks intresting, but…
Models of language acquisition
SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)
A task worth to participate in… two subtasks: paraphrase identification (bynary) and semantic similarity between tweets. In contrast to STS, you have much more spell variations and much more street language. Very interesting in deed, specially if we are thinking about processing @menosdias tweets.
352 features combined with logistic regression
a lot of metrics, machine translation, biological metrics, etc!
word and phrase embeddings (unless you're under a rock you must have heard about word and phrase embeddings!)
Tweet alignments with embeddings (each atom to a corresponding atom on other side)
Póster session
Next year we shoud participate also in Interpretable STS, in order to explain why we attribute a note (and to improve our alignment algorithms)
Word embeddings are the new buzz word
Alignmente, word embeddings, SVN
Tokenizatin, case correction, unsupervised POS tagging, lemmatization, detection of dataset-specific stop words, identification of measurement & temporal expressions, state of the art NER (Haniing et al 2014, winner of GermEval-2004)
Non-alignment features
Character n-grams, pathlen similarity, numbers overlaps, word n-gram similarity, sentence length, average word lenght
Alignement features
Diretion dependent m:n alignements of types EQUI, OPPO, SPE, SIM, EL, NOALI
Align in strict order: NE, Normalized temporal expresions, normalized measurements, arbitrary token n-grams 1-5, negations remaining content words
Proportion features for EQUI, OPPO, SPE, REL
Binned frequency features for OPPO, SPE, REL, NOALI
Han et al 2013 align-and-penalize features “good alignment vs bad alignment”
A robust system across all the corpus. 2d in english, 1st in spanish with a huge gap with the next performing system
SVR using 40 alignment features aand 51 non alignment features
SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking
Babelnet, Tokenized, POS tagged documents in four languages
Example: the concept of medicine as a drug (variating specification according to the source: Wikipedia, wordnet, etc.)
Very intresting dataset in english, spanish and italian with lots of ambiguous terms
Resources used by the participants:
DBPediaSpotlike, Wikpedia Miner, evolutionary game theory using a non cooperative multiplayer game setting, Tagme, EL services, Babelnet
optimizing multiple objective functions, document monosemy plus personalized page rank
The winner approach: content words tagged by exploiting their translations in other languages
LIMSI: Translations as Source of Indirect Supervision for Multilingual All-Words Sense Disambiguation and Entity Linking
LIMSI system exploitts the parallelism of the miltilingual test data
assumption of sense correspondence between a word and its translation in context (Diab and Resnik, 2002)
sentence and word (lemma) level alignements (Hunaligh, GIZA++)
keep spanish translation for english words, english translation for spanish and italian words
sense selection for word(w) in context
the synsets of w in babelnet are found
sw is filered to keep only synsets that contain both w and its aligned trasnlation t in this context
if theres one more sens, synsets ranked using the default sense comparator in babelnet api and keep the highest ranked synset
BFS helps to find also the wrong senses…
LIMSI systems needs no training, it only relies on alignment and sense ranking
weaker performance for spanish and italian due to the problematic sense ranking in these language (performed by Babelnet)
when multiple senses are retained after filtering by alignment
BFS is needed
alignment-based filtering remains benefical as the translation might occur in only one synset
BFS= Babelnet First Sense
BFS prediction are often wrong, especially in Spanish and Italian
Perspectives: experiment with alignments provided by MT systems, train a WSD systems on data annotated by the alignment-based method
SemEval-2015 Task 14: Analysis of Clinical Text
Corpus: 100 annotated notes (109K words)
400K unnanotated notes
Annotations: subject, course, severity, generic, body location
Taks 1: identify the disorder span + CUI (concept unique identifier) normalization
Task 2: disorder slot filling
# 2a: gold-standard disorder spans are provided
# 2b: no gold standard, just raw text
The hardest thing: entity linking part (cody part identification and CUI)
# CRF-based span recognition, bag of words, bigrams, POS, chunks, dependency, specialized lexicons, trigger temrs, distance to disporder spans, dependency parse information
UTH-CCB: The Participation of the SemEval 2015 Challenge – Task 14
SemEval-2015 Task 15: A CPA dictionary-entry-building task
CPA: Corpus pattern analysis.
corpus driven technique for mapping meaning onto words in text
tools and resources to identify and represent unambiguosly the main semantic patterns in which words are used
Sense Discriminative Patterns
CPA Parsing, CPA Clustering, CPA lexicography
MICROCHECK (29 verbs 378 patterns, 4529 annotated sentences)
WINGSPREAD (93, 856, 12440 annotated sentences: ~10K learning, ~400 testing)
ACL-2015 tutorial: Patterns for semantic processing
BLCUNLP: Corpus Pattern Analysis for Verbs Based on Dependency Chain
SemEval-2015 Task 9: CLIPEval Implicit Polarity of Events
Phrase level sentiment
MEssage level sentiment
Topic level sentiment
I had to go to the toilet, so I lost most of the presentation, which looked very good: Trento was the winer in the subtask A (phrases) using deep convolutional NN with aditional input for phrases; the second use message level sentiment + character n-grams, the third used model iteration (tetrai)
For subtask B (message polarity, the most popular task of Semeaval) the winner put together four top performer classifiers from previous editions of the task; the second used deep convutional NN and the third used logistic regression with special wieghting for positives and negatives
For subtsask C (topic extraction) both system uses “system for subtask B”
General Ressources: tokenization, stemming, lemmatization, stopword removal, POS tagging
Twitter specific ressources: (I lost them)
Classifiers: SVm, MaxEnt, Naive Bayes,
Deep learning, embeddings (unitn, INESC-ID)
integration of enseble metthods (Webis)
input: a list of termis; output: the same list of termis with a polarity score.
MaxDiff method of annotation. Which term is the most positive and which is the leas positive?
rotten is less positive than; #hapiness is ore positive than
Very intresting an didactic paper on deep learning to nlp
The key to success is the initialization of the NN
Deep Learning models in NLP
ConvNet architecture
sentence matrix, word embeddings, phrase indicator features, convolutional feature map, pooled representation, softmax
For the message classification task you need to add more features for each word
Models for twitter sentiment analysis
SVM with various n-gram, char-gram, lexicon features
State of the art model (NRC) in Semeval 13 and 14
Deep learning models have shown excellent results on many NLP sentence classification tasks but failed so far to eat carefully engineerd methods
Tree step pre-training process
Train the network on a large distant supervised corpora on 10M tweets
Fine tune the network on the supervised dataset (about 10k tweets)
Major novelty intializing network with weights
ConvNet params
wide convolution, max-pooling, filter width 5
word embeddings dimensionality 100
number of feature maps 300
Importance of pre-training: Three different experiments: random, unsupervised, distant
careful weights
CLaC-SentiPipe: SemEval2015 Subtasks 10 B,E, and Task 11
Negation and modality
They created a resource for irony called Gezi (and they used a resource called NRC)
95 features!
primary features: polarity class, lexical ressource, inguistic context
secondary features: emoticons, highest and lowest sentiment scores, POS counts, named entities
no bag of words but context aware polarity classes
SemEval-2015 Task 12: Aspect Based Sentiment Analysis
NLANGP: Supervised Machine Learning System for Aspect Category Classification and Opinion Target Extraction
Friday, June 5
I talked with Daniel Cer from google, who help me to (at least) have 100 google queries per day for unoporuno with a tweak to the google API dedicated to websites
API dedicated to websites
Nice talk with Georgeta Bordea, she was the one who built Saffron Expert Finding System, she would be glad to collaborate or do things related to our highly qualified immigration project
Had lunch with Greg Grefenstette: invited him to work on the Quijote project: he's already counting Don Quixote's words!
SemEval-2015 Task 4: TimeLine: Cross-Document Event Ordering (or Newsreader project)
TempEval-3 corpus and evaluation methodology proposed by UzZaman (2011)
Relations represented as timegraph
Intresting task!
4 teams participating with 13 unique runs
Three corpus: airbus, GM, stock market. Twe tracks and two subtracks
First task focusing on cross-document ordering of events
If we're thinking in AGESS we should participate in this task
SPINOZA_VU: An NLP Pipeline for Cross Document TimeLines
based on the NewsREader pipeline system (ENeko Agirre)
subtask first addressed at document leven and then aggregated at corpus level
entity driven instead of event-driven
time-lines are obtained in the post-processing
NER CoNLL, NED (named entity desambiguation): DBPedia Spotlight
Entity coreference: Stanford Multie Sieve Pass, Event Coreference:?
Event detection, timex detection and normalization, TLINK detection and classification
System trained on TempEval-2 corpus
Timex Detection and normalization: TimePro system
TLINK: TimeProRel system
TimeLine aggregation module:
Very low F values… the most difficult was time ordering (low recall for temporal relations availiable)… “we were missing temporal relations for anchoring”
QA TempEval: TimeML was originally developped to support research in complex temporal QA. TempEval mainly focused on a more strightforward temporal inforamtion extraction. QA TempEval focuses on an end-user QA task. It is opposed to earlier corpus-based evaluation
This evaluation is about the accuracy for answering targeted questions
It's easier to evaluate
Task description: plain documents with DTC (TempEval-3 format). The plain documents are fed into participation systems, which annotate timexes, events and temporal relations: the output are TimeML annotated documents.
Test dataset creation: question sets and key documents.
example of question: is event21 after event19? (the answer order the event). the questions are yes/no temporal quesitions regarding any of the 13 Allen Interval relations holding between two designated temporal entities.
Corpus: wikinews, wsj, nyt, wikipedia article, informal bblog post
each system's annotation represent its temporal knowledge of the documents
the annotation of each system is fec into a temporal qa system (UzZaman et al 2012) that answers questions on behalf of the systems
Given a system's TimeML annotated documents, the TimeML QA process consists of three main steps: (lost)
Participants: rule-based timex module, SVM liblinear for event and relation detection and classification, SVM separated event detection
very low recall on the results
main finding: using event co-reference may help
systems are still far from deeply understanding from temporal aspects of NL (recall: 30%)
HLT-FBK: a Complete Temporal Processing System for QA TempEval
ML based (SVM in Yamcha)
Training: TimeBank and AQUAINT data from TempEval3 task
News reader pipeline: tokenization, pos, constituency parser, dependency parser, named entity recognition, SRL
Timex identificcation: classification of all tokens in 9 classes (B-DATE, I-DATE, B-TIME… etc)
Timex normalization: Time Expression normalizer for enlgish: timenorm (Bethard, 2013)
Two classifiers: event detection and event classification
Features: lemma, pos, chunk, entity type (NE or Timex), verb tense and polarity, etc.
All predicates identified by the SRL (semantic role labelling)
System described in Paramita Mirza and Sara Tonelli. 2014. Classifying Temporal Relations with Simple Features.
SemEval-2015 Task 6: Clinical TempEval
Clinical events identification (April 23: the patient did not have any postoperative bleeding)
Detection of events in relation of the time when the document was written (narrative container relation).
Annotated with THYME extensioon of ISO-TimeML
Corpus: ~300 documents; ~40000 events
event/time spans: begin, end
event/time attributes: begin, end, value
document time relations: begin, end, relation
narrative container relations: begin1, end1, begin2, end2
ML systems had better recall, rule-based systems had better precision (accuracy)
Tools: PyConText and Moonstone
initiate work on end to end temporal reasoning
approach: UIMA/ClearTK (liblinear): BIO-representations
cTAKES; pyConText
Features: lexical, section, HeidelTime lexicon
CRF++, cTAKES, lexical, semantic type, context window
SemEval 2015, Task 7: Diachronic Text Evaluation
intresting task: to temporary date text snippets according to the style
linear models to extend pairwise decision
linking to Wikipedia and Google n-gram
stylistic classification problem
a crawler to crawl text snippets
intresting for AGESS
UCD : Diachronic Text Classification with Character, Word, and Syntactic N-grams
stylometric text classification
word epoch disambiguation (Mihalcea and Nastase, 2012)
temporal text ranking (Niculae et al, 2014) TEmporal TExt ranking and automatic dating of texts
identifying period-specific language
direct lookup
focus on language style
treat it as a multiclass classification (Weka SMO 1-vs-1 polynomial)
label each text using non overlapping year ranges
CPWS features
Naive Bayes estimate p(y|w) for each year (the probability of a word used in a year)
Multiclass classification seems to work better
character n-grams are highly effective features for diachronic classificattion (but not very satisfying)
the prior distribution over date-labels has a significant domain-specific effect
SemEval-2015 Task 8: SpaceEval
Question answering about location of objects, events
Text to scoene conversion/visualization
Generating textual description of images
Navigational instructions to a robot
Adopts ISO-Space encoding for spacial information (and ISOspace metamodel)
qualitative spatial liink: RCC8 Relations for the topological relations between elements (QLink)
QLinks: qualitative spatial links
Uses SpatialML relation types based on RCC8
Example: the book is on the table
spatialsignal(s1, cluster=“on-1”, semantictype=topological, directional)
qslink(qsl1, trajector=se1, landmark=se2, signal=s1, relType=EC)
Very few participants on this task
Standard machine learning models
Lexical, syntactical, open sourde features
SpRL-CWW: Spatial Relation Classification with Independent Multi-class Models
Spatial role labeling
Sequential labeler ¡, generate candidate relation tuples, multi-class classifiers
spatial elements and signal
the ball is in the backyard of the house: detect signals (in) with lemmatize, pos-tagger, etc.
Spatial element: the ball; spatial signal: in; place: the backyard; spatial signal: of; spatial element: the house
classify candidate spatial relations and label arguments with multi-class classifiers for each relation type
dependency path to spatial signal, lemma, pos, direction from spatial signal.
best features: raw string in a 5 word window, 300-dimension GloVe word vector; POS bigrams for a 5-word window (best feature)
Taxonomy extraction: given a list of domain specific term, structure them in a taxonomy
Subtask: term extraction, relation discovery, taxonomy construction
Domains: chemical, equipment, food, science.
combined gold standards:
wikipedia bitaxonomi WiBi
The Google product taxonomy (food)
material handling equipement (equipment)
taxonomy of fields and their subfields (science)
baselines: all the nodes conncected to the root conccept, string inclusion (science and network science)
structural evaluation: presence of cycles and intermediate nodes
Evaluation: cumulative Fowlkes&Mallows (formula/mesure for comparing clusters)
Generalised F&M and cumulative F&M
The task didn't provide the corpus, just the terms: each participant had to find his own corpus
the baseline is closer to the base system
Taxonomy visualisation
Relations discovery: lexico-syntactic patterns have high precision bt low recall
cooccurrendce based approaches improve results
taxonomy construction: approaches are less known or difficult to reimplement
no corpus was provided and participants had no gold standard
terms: one to nine words
substring inclusion: bycycle helmet < helmet (suffix)
main intuition: hypernyms and hyponyms often occur together
strategy: cooccurrence statistics and term frequencies in a collection of documents
Wikipedia (only text, no categorie, redirects, titles)
sentencized: 125 million sentences
counts of term cooccurrence in the same sentence (document frequency of terms)
method: consider all domain terms B co-occurring in the same Wikipedia sentences, eliminate any candidate B that appears in fewer documents than A, retain N=3
SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing
SemEval-2016 Task Announcements and closing session
Compute the degree of semantic similarity between paired sentences (as usual)
Annotated data
Evaluation: Pearson correlation to mean of human scores
Applications: deep QA, distillation, generation, machine reading, MT, plagiarism detection, paraphrasing, textual inference, summarization and many more
Datasets 2016
Plagiarism detection
QA question-question
Post-edited MT
Q&A Answer-answer
Data selection will target weakness of existing techniques
Pilot task: crosslingual STS
JOhn said he said he is considered a witness but not a suspect
“Él ya no es un sospechoso”, John dijo
Interpretable STS
Full task on its own
Student grading scenario
Given a pair of sentences
chunk the sentences /gold chunks provided
systems align chunks across both sentences
score similarity for each chunk pair
classify the thpe of relation: EQUI, OPPO, SPE, SIMI, REL, FACT, OPI
Annotation guidelines publick, high qualiy annotation 75F1, high participation
Novelties: allow for N:M alignements
New test data: same datasets and education-related dataset
Given a question Q, find a good answer A from a collection of CQA threads
Given a question, find a similar question
English and Arabic tasks
tweet level polarity
topic level polarity (pos/neg/neu, 5 stars)
topic trend detection (pos/neg/neu, 5 stars)
Aspect-based sentiment analysis (ABSA)
input: a target and a tweet pair
output: determine whether the author of the text is in favor of, against or neutral towards the target
two subtask: labeled training data and no labeled training data
Sentiment intensity of English and Arabic
Meaning representation parsing
Input: the soldier was not afraid of dying
Output: f/fear-01 :arg0 (s / soldier) :arg2 (d / die-01) :arg0 s) :polarity:”-“)
resources: 15000 training parirs (LDC/DEFT), tokenizer, aligner (Pourdamghani et al 14), AMR manipulating librari, baseline parser (Flanigan et al 14), scorer (Cai & Knight '13)
Chinese semantic dependency parsing
Semantic Analysis Track
Detection of minimal semantics units and their meanings (DimSUM)
Lexical semantic taksk
units (single/multiwords expressions) and classes (noun+verbs)
I googled restaurants in the area and Fuji Sushi came up and reviews were great so U made a carry out order.
I googled restaurants in the area and FujiSushi cameup … carry_out
googled (V:COMMUNICATION) restaurants (N:GROUP) area (N:LOCATION)
Tag the english sentence for MWEs and supersenses
Domains: online reviews and Tweets (Copenhagen supersense dataset (Johannsen et al 2014)
Complex Word Identification
The cat perched on the mat
complex: perched
simple: cat, mat
Format: 2247 training instances, 88000 testing instances
Corpus come from 400 non-native speakers of english language, 42 distinct natural language: ~4000 complex words were found
Clinical TempEval
Task definition: organise domain-specific terms in a taxonomy
Datasets: chemicals, equipement, food, science
goldstandards: wordnet, wikipedia, online taxonomies
multilingual settings: English, French, Italian and Dutch
evaluattion: structural evaluation
comparaison against gold standards
the corpus will be provided to participant this year.
semantic taxonomy enrichment
task objective: given a word and a gloss, identify the wordnet synset that is its synonym por hyponym
taks oriented to words missing in wordnet
Closing session
peer-reviewed all future task proposals
organized 14 tasks in 5 tracks
new paper reviewing guidelines
improve replicability (possibily introducing the chance of paper rejection)
the semeval experience
noting the reviews!!!
release initial version of submitted papers and anonymized reviews and ratings released after Semevval as a corpus for analysis!!!
what makes a good review?
Semeval 2017 Task: predicting reviewing quality (???)
Use Sultan to align @menosdias tweets
While drinking cofee with charming Houda Bouamor (from Qatar Carnegie Mellon) we had the idea of a multilingual summarizer (french-LIPN, spanish-IIMAS, arabic-Qatar) based on alignement and moderate generation (LORIA). She said she could get some funding for a one year project from Qatar.
De pronto pienso que @menosdías quizá daría para una tarea de Semeval
más le pienso y más me digo: hagamos chunking y alineación con @menosdías usando pura similaridad semántica y ya nos sale el paper (y la extracción de entidades, al menos de los tuits)
pensando en AGESS (Automatic Generation of State of the Arts)… creo que deberíamos lanzar a un doctorando a que trabaje en el tiempo (y participe en las tareas temporales de Semeval)
Using corefeence reduces he number of unknown relations significantly
SpaceEval might be useful for the GolemGenFred project
Orientation Link (OLink) describe non topological relationships between spacial elements (the chair is in front of the couch)
Data sources:
SE (special entities), SS (spatial signal), MI (motion signal)
Three configurations: un-annotated text, manually annotated spatial elements, manually annotated spatial elements with attributes
word2vec… and wordembedding
Can Selectional Preferences Help Automatic Semantic Role Labeling? Shumin Wu and Martha Palmer (intresting poster, semantic role labelling with LDA)