Language Modeling With Sentence-Level Mixtures
This paper introduces a simple mixture language model that attempts to capture long distance constraints in a sentence or paragraph. The model is an m-component mixture of digram models. The models were constructed using a SK vocabulary and trained using a 76 million word Wall Street Journal text corpus. Using the BU recognition system, experiments show a 7% improvement in recognition accuracy with the mixture digram models as compared to using a Digram model.
Identifying Unknown Proper Names In Newswire Text
The identification of unknown proper names in text is a significant challenge for NLP systems operating on unrestricted text. A system which indexes documents according to name references can be useful for information retrieval or as a preprocessor for more knowledge intensive tasks such as database extraction. This paper describes a system which uses text skimming techniques for deriving proper names and their semantic attributes automatically from newswire text, without relying on any listing of name elements. In order to identify new names, the system treats proper names as (potentially) context-dependent linguistic expressions. In addition to using information in the local context, the system exploits a computational model of discourse which identifies individuals based on the way they are described in the text, instead of relying on their description in a pre-existing knowledge base.
Representation of Atypical Entities in Ontologies
This paper is a contribution to formal ontology study. Some entities belong more or less to a class. In particular, some individual entities are attached to classes whereas they do not check all the properties of the class. To specify whether an individual entity belonging to a class is typical or not, we borrow the topological concepts of interior, border, closure, and exterior. We define a system of relations by adapting these topological operators. A scale of typicality, based on topology, is introduced. It enables to define levels of typicality where individual entities are more or less typical elements of a concept.
Geo-WordNet: Automatic Georeferencing of WordNet
WordNet has been used extensively as a resource for the Word Sense Disambiguation (WSD) task, both as a sense inventory and a repository of semantic relationships. Recently, we investigated the possibility to use it as a resource for the Geographical Information Retrieval task, more specifically for the toponym disambiguation task, which could be considered a specialization of WSD. We found that it would be very useful to assign to geographical entities in WordNet their coordinates, especially in order to implement geometric shape-based disambiguation methods. This paper presents Geo-WordNet, an automatic annotation of WordNet with geographical coordinates. The annotation has been carried out by extracting geographical synsets from WordNet, together with their holonyms and hypernyms, and comparing them to the entries in the Wikipedia-World geographical database. A weight was calculated for each of the candidate annotations, on the basis of matches found between the database entries and synset gloss, holonyms and hypernyms. The resulting resource may be used in Geographical Information Retrieval related tasks, especially for toponym disambiguation.
Robust Generic And Query-Based Summarization
"We present a robust summarisation system developed within the GATE architecture that makes use of robust components for semantic tagging and coreference resolution provided by GATE. Our system combines GATE components with well established statistical techniques developed for the purpose of text summarisation research. The system supports "generic" and query-based summarisation addressing the need for user adaptation. "
Description Of The LINK System Used For MUC-4
The University of Michigan's natural language processing system, called LINK, is a unification-based system which we have developed over the last four years. Prior to MUC-4, LINK had been used to extract information from free-form texts in two narrow application domains. One application corpus contained terse descriptions of symptoms displayed by malfunctioning automobiles, and the repairs which fixed them. The other corpus described sequences of activities to be performed on an assembly line. In empirical testing in these two domains, LINK correctly processed 70% of previously unseen descriptions. A template was counted as correct only if all of the fillers in the template were filled correctly. In addition, LINK generated incomplete (but not incorrect) templates for another 15% of the descriptions. These previous domains were much narrower than the MUC-4 terrorism domain. As a comparison, the lexicons for the previous domains contained only 300-500 words, compared with 6700 words in our MUC-4 test configuration. Previous grammar size ranged from 75-100 rules, compared with over 500 rules in the MUC-4 knowledge base. In addition, the previous application domains consisted only of single-sentence inputs. Thus, the integration of information from multiple sentences was not an issue in our previous work.
Research And Development In Natural Language Understanding
Brief Summary of Objectives: There are three objectives of the contract: to perform research and development in parallel parsing, semantic representation, ill-formed input, discourse, and tools for linguistic knowledge acquisition, and to integrate software components from BBN and elsewhere to produce Janus, DARPA's New Generation Natural Language Interface, and to demonstrate state-of-the-art natural language technology in DARPA applications. The following software has been distributed: natural language system; IRACQ, knowledge acquisition system; System components and knowledge bases of Janus; KL-TWO knowledge representation and inference system integrated with Janus; various components for DARPA's Spoken Language Systems Project at BBN.
Automatic Pattern Acquisition For Japanese Information Extraction
One of the central issues for information extraction is the cost of customization from one scenario to another. Research on the automated acquisition of patterns is important for portability and scalability. In this paper, we introduce Tree-Based Pattern representation where a pattern is denoted as a path in the dependency tree of a sentence. We outline the procedure to acquire Tree-Based Patterns in Japanese from un-annotated text. The system extracts the relevant sentences from the training data based on TF/IDF scoring and the common paths in the parse tree of relevant sentences are taken as extracted patterns.
Detecting Dependencies Between Semantic Verb Subclasses And Subcategorization Frames In Text Corpora
We present a methodfor individuating dependencies between the semantic class of predicates and their associated subcategorization frames, and describe an implementation which allows the acquisition of such dependencies from bracketed texts.
Application-Driven Automatic Subgrammar Extraction
The space and run-time requirements of broad coverage grammars appear for many applications unreasonably large in relation to the relative simplicity of the task at hand. On the other hand, handcrafted development of application-dependent grammars is in danger of duplicating work which is then difficult to re-use in other contexts of application. To overcome this problem, we present in this paper a procedure for the automatic extraction of application-tuned consistent subgrammars from proved large-scale generation grammars. The procedure has been implemented for large-scale systemic grammars and builds on the formal equivalence between systemic grammars and typed unification based grammars. Its evaluation for the generation of encyclopedia entries is described, and directions of future development, applicability, and extensions are discussed.
Combining Text And Heuristics For Cost-Sensitive Spam Filtering
Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid misclassification of legitimate messages. We present a comparative evaluation of several machine learning algorithms applied to spam filtering, considering the text of the messages and a set of heuristics for the task. Cost-oriented biasing and evaluation is performed.
A Flexible Framework For Developing Mixed-Initiative Dialog Systems
We present a new framework for rapid development of mixed-initiative dialog systems. Using this framework, a developer can author sophisticated dialog systems for multiple channels of interaction by specifying an interaction modality, a rich task hierarchy and task parameters, and domain-specific modules. The framework includes a dialog history that tracks input, output, and results. We present the framework and preliminary results in two application domains.
Resolving Quasi Logical Forms
"SRI International Cambridge Research Centre 23 Millers Yard Cambridge CB2 IRQ U.K.The paper describes intermediate and resolved logical form representations of sentences involving referring expressions and a reference resolution process for mapping between these representations. The intermediate representation, Quasi Logical Form (or QLF), may contain unresolved terms corresponding to anaphoric noun phrases covering bound variable anaphora, reflexives, and definite descriptions. Implicit relations arising in constructs such as compound nominals appear in QLF as unresolved formulae. The QLF representation is also neutral with respect to ambiguities corresponding to quantifier scope and the collective/distributive distinction, the latter being treated as quantifier resolution. Reference candidates are proposed according to an ordered set of "reference resolution rules" producing possible resolved logical forms to which linguistic and pragmatic constraints are then applied."
Constituent-based Accent Prediction
Near-perfect automatic accent assignment is attainable for citation-style speech, but better computational models are needed to predict accent in extended, spontaneous discourses. This paper presents an empirically motivated theory of the discourse focusing nature of accent in spontaneous speech. Hypotheses based on this theory lead to a new approach to accent prediction, in which patterns of deviation from citation form accentuation, defined at the constituent or noun phrase level, are automatically learned from an annotated corpus. Machine learning experiments on 1031 noun phrases from eighteen spontaneous direction-giving monologues show that accent assignment can be significantly improved by up to 4%-6% relative to a hypothetical baseline system that would produce only citation-form accentuation, giving error rate reductions of 11%-25%.
The Italian Lexical Sample Task At Senseval-3
The Italian lexical sample task at SENSEVAL-3 provided a framework to evaluate supervised and semi-supervised WSD systems. This paper reports on the task preparation - which offered the opportunity to review and refine the Italian MultiWordNet - and on the results of the six participants, focussing on both the manual and automatic tagging procedures.
Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization
A key task in an extraction system for query-oriented multi-document summarisation, necessary for computing relevance and redundancy, is modelling text semantics. In the Embra system, we use a representation derived from the singular value decomposition of a term co-occurrence matrix. We present methods to show the reliability of performance improvements. We find that Embra performs better with dimensionality reduction.
Object- Extraction And Question-Parsing Using CCG
Accurate dependency recovery has recently been reported for a number of wide-coverage statistical parsers using Combinatory Categorial Grammar (ccg). However, overall figures give no indication of a parser's performance on specific constructions, nor how suitable a parser is for specific applications. In this paper we give a detailed evaluation of a ccg parser on object extraction dependencies found in wsj text. We also show how the parser can be used to parse questions for Question Answering. The accuracy of the original parser on questions is very poor, and we propose a novel technique for porting the parser to a new domain, by creating new labelled data at the lexical category level only. Using a supertagger to assign categories to words, trained on the new data, leads to a dramatic increase in question parsing accuracy.
Dependency-based paraphrasing for recognizing textual entailment
This paper addresses syntax-based paraphrasing methods for Recognizing Textual Entailment (RTE). In particular, we describe a dependency-based paraphrasing algorithm, using the DIRT data set, and its application in the context of a straightforward RTE system based on aligning dependency trees. We find a small positive effect of dependency-based paraphrasing on both the RTE3 development and test sets, but the added value of this type of paraphrasing deserves further analysis.
A Quantitative Approach To Preposition-Pronoun Contraction In Polish
This paper presents the current results of an ongoing research project on corpus distribution of prepositions and pronouns within Polish preposition-pronoun contractions. The goal of the project is to provide a quantitative description of Polish preposition-pronoun contractions taking into consideration morphosyntactic properties of their components. It is expected that the results will provide a basis for a revision of the traditionally assumed inflectional paradigms of Polish pronouns and, thus, for a possible remodeling of these paradigms. The results of corpus-based investigations of the distribution of prepositions within preposition-pronoun contractions can be used for grammar-theoretical and lexicographic purposes.
Lexicalized Phonotactic Word Segmentation
This paper presents a new unsupervised algorithm (WordEnds) for inferring word boundaries from transcribed adult conversations. Phone ngrams before and after observed pauses are used to bootstrap a simple discriminative model of boundary marking. This fast algorithm delivers high performance even on morphologically complex words in English and Arabic, and promising results on accurate phonetic transcriptions with extensive pronunciation variation. Expanding training data beyond the traditional miniature datasets pushes performance numbers well above those previously reported. This suggests that WordEnds is a viable model of child language acquisition and might be useful in speech understanding.
Mixed-Source Multi-Document Speech-to-Text Summarization
Speech-to-text summarization systems usually take as input the output of an automatic speech recognition (ASR) system that is affected by issues like speech recognition errors, disfluencies, or difficulties in the accurate identification of sentence boundaries. We propose the inclusion of related, solid background information to cope with the difficulties of summarizing spoken language and the use of multi-document summarization techniques in single document speech-to-text summarization. In this work, we explore the possibilities offered by phonetic information to select the background information and conduct a perceptual evaluation to better assess the relevance of the inclusion of that information. Results show that summaries generated using this approach are considerably better than those produced by an up-to-date latent semantic analysis (LSA) summarization method and suggest that humans prefer summaries restricted to the information conveyed in the input source.
Improving Bitext Word Alignments Via Syntax-Based Reordering Of English
We present an improved method for automated word alignment of parallel texts which takes advantage of knowledge of syntactic divergences, while avoiding the need for syntactic analysis of the less resource rich language, and retaining the robustness of syntactically agnostic approaches such as the IBM word alignment models. We achieve this by using simple, easily-elicited knowledge to produce syntax-based heuristics which transform the target language (e.g. English) into a form more closely resembling the source language, and then by using standard alignment methods to align the transformed bitext. We present experimental results under variable resource conditions. The method improves word alignment performance for language pairs such as English-Korean and English-Hindi, which exhibit longer-distance syntactic divergences.
Knowledge Representation Method Based On Predicate Calculus In An Intelligent CAI System
The knowledge representation method is introduced to be applied in the ICAI system to teach programming language. Knowledge about syntax and semantics of that language is represented by a set of axioms written in the predicate calculus language. The directed graph of concepts is mentioned as a method to represent an instructional structure of the domain knowledge. The proof procedure to answer student's questions is described.
Information Extraction And Semantic Constraints
We consider the problem of extracting specified types of information from natural language text. To properly analyze the text, we wish to apply semantic (selectional) constraints whenever possible; however, we cannot expect to have semantic patterns for all the input we may encounter in real texts. We therefore use preference semantics: selecting the analysis which maximizes the number of semantic patterns matched. We describe a specific information extraction task, and report on the benefits of using preference semantics for this task.
Whats in a Colour? Studying and Contrasting Colours with COMPARA
In this paper we present contrastive colour studies done using COMPARA, the largest edited parallel corpus in the world (as far as we know). The studies were the result of semantic annotation of the corpus in this domain. We chose to start with colour because it is a relatively contained lexical category and the subject of many arguments in linguistics. We begin by explaining the criteria involved in the annotation process, not only for the colour categories but also for the colour groups created in order to do finer-grained analyses, presenting also some quantitative data regarding these categories and groups. We proceed to compare the two languages according to the diversity of available lexical items, morphological and syntactic properties, and then try to understand the translation of colour. We end by explaining how any user who wants to do serious studies using the corpus can collaborate in enhancing the corpus and making their semantic annotations widely available as well.
Text Authoring, Knowledge Acquisition And Description Logics
We present a principled approach to the problem of connecting a controlled document authoring system with a knowledge base. We start by describing closed-world authoring situations, in which the knowledge base is used for constraining the possible documents and orienting the user’s selections. Then we move to open-world authoring situations in which, additionally, choices made during authoring are echoed back to the knowledge base. In this way the information implicitly encoded in a document becomes explicit in the knowledge base and can be re-exploited for simplifying the authoring of new documents. We show how a Datalog KB is sufficient for the closed-world situation, while a Description Logic KB is better-adapted to the more complex open-world situation. All along, we pay special attention to logically sound solutions and to decidability issues in the different processes.
Temporal Reasoning In Natural Language Understanding: The Temporal Structure Of The Narrative
This paper proposes a new framework for discourse analysis, in the spirit of Grosz and Sid-ner (1986), Webber (l987a,b) but differentiated with respect to the type or genre of discourse. It is argued that different genres call for different representations and processing strategies; particularly important is the distinction between subjective, performative discourse and objective discourse, of which narrative is a primary example. This paper concentrates on narratives and introduces the notions of temporal focus (proposed also in Webber (1987b)) and narrative move. The processing tasks involved in reconstructing the temporal structure of a narrative ( Webber 's e/s structure) are formulated in terms of these two notions. The remainder of the paper analyzes the durational and aspectual knowledge needed for those tasks. Distinctions are established between grammatical aspect, aspectual class and the aspectual perspective of a sentence in discourse; it is shown that in English, grammatical aspect under-determines the aspectual perspective.
Standardising Bilingual Lexical Resources According to the Lexicon Markup Framework
The Dutch HLT agency for language and speech technology (known as TST-centrale) at the Institute for Dutch Lexicology is responsible for the maintenance, distribution and accessibility of (Dutch) digital language resources. In this paper we present a project which aims to standardise the format of a set of bilingual lexicons in order to make them available to potential users, to facilitate the exchange of data (among the resources and with other (monolingual) resources ) and to enable reuse of these lexicons for NLP applications like machine translation and multilingual information retrieval. We pay special attention to the methods and tools we used and to some of the problematic issues we encountered during the conversion process. As these problems are mainly caused by the fact that the standard LMF model fails in representing the detailed semantic and pragmatic distinctions made in our bilingual data, we propose some modifications to the standard. In general, we think that a standard for lexicons should provide a model for bilingual lexicons that is able to represent all detailed and fine-grained translation information which is generally found in these types of lexicons.
Improving Chinese Semantic Role Classification with Hierarchical Feature Selection Strategy
In recent years, with the development of Chinese semantically annotated corpus, such as Chinese Proposition Bank and Normalization Bank, the Chinese semantic role labeling (SRL) task has been boosted. Similar to English, the Chinese SRL can be divided into two tasks : semantic role identification (SRI) and classification (SRC). Many features were introduced into these tasks and promising results were achieved. In this paper, we mainly focus on the second task : SRC. After exploiting the linguistic discrepancy between numbered arguments and ARGMs, we built a semantic role classifier based on a hierarchical feature selection strategy. Different from the previous SRC systems, we divided SRC into three sub tasks in sequence and trained models for each sub task. Under the hierarchical architecture, each argument should first be determined whether it is a numbered argument or an ARGM, and then be classified into fine-gained categories. Finally, we integrated the idea of exploiting argument interdependence into our system and further improved the performance. With the novel method, the classification precision of our system is 94.68%, which outperforms the strong baseline significantly. It is also the state-of-the-art on Chinese SRC.
Making Tree Kernels Practical For Natural Language Learning
In recent years tree kernels have been proposed for the automatic learning of natural language applications. Unfortunately, they show (a) an inherent super linear complexity and (b) a lower accuracy than traditional attribute/value methods. In this paper, we show that tree kernels are very helpful in the processing of natural language as (a) we provide a simple algorithm to compute tree kernels in linear average running time and (b) our study on the classification properties of diverse tree kernels show that kernel combinations always improve the traditional methods. Experiments with Support Vector Machines on the predicate argument classification task provide empirical support to our thesis.
Utilizing Domain-Specific Information For Processing Compact Text
This paper Identifies the types of sentence fragments found in the text of two domains : medical records and Navy equipment status messages. The fragment types are related to full sentence forms on the basis of the elements which were regularly deleted. A breakdown of the fragment types and their distributions In the two domains Is presented. An approach to reconstructing the semantic class of deleted elements In the medical records is proposed which Is based on the semantic patterns recognized In the domain.
Opportunities For Advanced Speech Processing In Military Computer-Based Systems
This paper presents a study of military applications of advanced speech processing technology which includes three major elements: (1) review and assessment of current efforts in military applications of speech technology ; (2) identification of opportunities for future military applications of advanced speech technology ; and (3) identification of problem areas where research in speech processing is needed to meet application requirements, and of current research thrusts which appear promising. The relationship of this study to previous assessments of military applications of speech technology is discussed, and substantial recent progress is noted. Current efforts in military applications of speech technology which are highlighted include : (1) narrowband (2400 b/s) and very low-rate (50-1200 b/s) secure voice communication ; (2) voice/data integration in computer networks ; (3) speech recognition in fighter aircraft, military helicopters, battle management, and air traffic control training systems ; and (4) noise and interference removal for human listeners. Opportunities for advanced applications are identified by means of descriptions of several generic systems which would be possible with advances in speech technology and in system integration. These generic systems include : (1) integrated multi-rate voice/data communications terminal ; (2) interactive speech enhancement system ; (3) voice-controlled pilot's associate system ; (4) advanced air traffic control training systems ; (5) battle management command and control support system with spoken natural language interface ; and (6) spoken language translation system. In identifying problem areas and research efforts to meet application requirements, it is observed that some of the most promising research involves the integration of speech algorithm techniques including speech coding, speech recognition, and speaker recognition.
A Framework For MT And Multilingual NLG Systems Based On Uniform Lexico-Structural Processing
In this paper we describe an implemented framework for developing monolingual or multilingual natural language generation (NLG) applications and machine translation (MT) applications. The framework demonstrates a uniform approach to generation and transfer based on declarative lexico-structural transformations of dependency structures of syntactic or conceptual levels ("uniform lexico-structural processing "). We describe how this framework has been used in practical NLG and MT applications, and report the lessons learned.
Structural Ambiguity And Conceptual Relations
Lexical co-occurrence statistics are becoming widely used in the syntactic analysis of unconstrained text. However, analyses based solely on lexical relationships suffer from sparseness of data: it is sometimes necessary to use a less informed model in order to reliably estimate statistical parameters. For example, the "lexical association" strategy for resolving ambiguous prepositional phrase attachments [Hindle and Rooth 1991] takes into account only the attachment site (a verb or its direct object ) and the preposition, ignoring the object of the preposition.We investigated an extension of the lexical association strategy to make use of noun class information, thus permitting a disambiguation strategy to take more information into account. Although in preliminary experiments the extended strategy did not yield improved performance over lexical association alone, a qualitative analysis of results suggests that the problem lies not in the noun class information, but rather in the multiplicity of classes available for each noun in the absence of sense disambiguation. This suggests several possible revisions of our proposal.
EFLUF - An Implementation Of A FLexible Unification Formalism
In this paper we describe EFLUF - an implementation of FLUF. The idea with this environment is to achieve a base for experimenting with unification grammars. In this environment we want to allow the user to affect as many features as possible of the formalism, thus being able to test and compare various constructions proposed for unification-based formalisms. The paper exemplifies the main features of EFLUF and shows how these can be used for defining a grammar. The most interesting features of EFLUF are the various possibilities to affect the behavior of the system. The user can define new constructions and how they unify, define new syntax for his constructions and use external unification modules. The paper also gives a discussion on how a system like EFLUF would work for a larger application and suggests some additional features and restrictions that would be needed for this.
Hybrid Text Chunking
This paper proposes an error-driven HMM-based text chunk tagger with context-dependent lexicon. Compared with standard HMM-based tagger, this tagger incorporates more contextual information into a lexical entry. Moreover, an error-driven learning approach is adopted to decrease the memory requirement by keeping only positive lexical entries and makes it possible to further incorporate more context-dependent lexical entries. Finally, memory-based learning is adopted to further improve the performance of the chunk tagger.
User-Friendly Text Prediction For Translators
Text prediction is a form of interactive machine translation that is well suited to skilled translators. In principle it can assist in the production of a target text with minimal disruption to a translator's normal routine. However, recent evaluations of a prototype prediction system showed that it significantly decreased the productivity of most translators who used it. In this paper, we analyze the reasons for this and propose a solution which consists in seeking predictions that maximize the expected benefit to the translator, rather than just trying to anticipate some amount of upcoming text. Using a model of a “typical translator” constructed from data collected in the evaluations of the prediction prototype, we show that this approach has the potential to turn text prediction into a help rather than a hindrance to a translator.
A Computational Grammar Of Discourse-Neutral Prosodic Phrasing In English
"We describe an experimental text-to-speech system that uses information about syntactic constituency, adjacency to a verb, and constituent length to determine prosodic phrasing for synthetic speech. Central goal of our work has been to characterize "discourse neutral" phrasing, i.e. sentence-level phrasing patterns that are independent of discourse semantics. Our account builds on Bachenko et al. (1986), but differs in its treatment of clausal structure and predicate-argument relations. Results so far indicate that the current system performs well when measured against a corpus of judgments of prosodic phrasing."
Noun-Phrase Co-occurrence Statistics for Semi-Automatic Semantic Lexicon Construction
" Generating semantic lexicons semi-automatically could be a great time saver, relative to creating them by hand. In this paper, we present an algorithm for extracting potential entries for a category from an on-line corpus, based upon a small set of exemplars. Our algorithm finds more correct terms and fewer incorrect ones than previous work in this area. Additionally, the entries that are generated potentially provide broader coverage of the category than would occur to an individual coding them by hand. Our algorithm finds many terms not included within Wordnet (many more than previous algorithms ), and could be viewed as an "enhancer" of existing broad-coverage resources. "
Simple Features For Statistical Word Sense Disambiguation
In this paper, we describe our experiments on statistical word sense disambiguation (WSD) using two systems based on different approaches : Naive Bayes on word tokens and Maximum Entropy on local syntactic and semantic features. In the first approach, we consider a context window and a sub-window within it around the word to disambiguate. Within the outside window, only content words are considered, but within the sub-window, all words are taken into account. Both window sizes are tuned by the system for each word to disambiguate and accuracies of 75% and 67% were respectively obtained for coarse and fine grained evaluations. In the second system, sense resolution is done using an approximate syntactic structure as well as semantics of neighboring nouns as features to a Maximum Entropy learner. Accuracies of 70% and 63% were obtained for coarse and fine grained evaluations.
DUC 2005: Evaluation Of Question-Focused Summarization Systems
The Document Understanding Conference (DUC) 2005 evaluation had a single user-oriented, question-focused summarization task, which was to synthesize from a set of 25-50 documents a well-organized, fluent answer to a complex question. The evaluation shows that the best summarization systems have difficulty extracting relevant sentences in response to complex questions (as opposed to representative sentences that might be appropriate to a generic summary ). The relatively generous allowance of 250 words for each answer also reveals how difficult it is for current summarization systems to produce fluent text from multiple documents.
Overview Of MUC-7
The tasks performed by the systems participating in the seventh Message Understanding Conference and the Second Multilingual Entity Task are described here in general terms with examples.
Textual Entailment Through Extended Lexical Overlap and Lexico-Semantic Matching
This paper presents two systems for textual entailment, both employing decision trees as a supervised learning algorithm. The first one is based primarily on the concept of lexical overlap, considering a bag of words similarity overlap measure to form a mapping of terms in the hypothesis to the source text. The second system is a lexico-semantic matching between the text and the hypothesis that attempts an alignment between chunks in the hypothesis and chunks in the text, and a representation of the text and hypothesis as two dependency graphs. Their performances are compared and their positive and negative aspects are analyzed.
ILR-Based MT Comprehension Test with Multi-Level Questions
We present results from a new Interagency Language Roundtable (ILR) based comprehension test. This new test design presents questions at multiple ILR difficulty levels within each document. We incorporated Arabic machine translation (MT) output from three independent research sites, arbitrarily merging these materials into one MT condition. We contrast the MT condition, for both text and audio data types, with high quality human reference Gold Standard (GS) translations. Overall, subjects achieved 95% comprehension for GS and 74% for MT, across 4 genres and 3 difficulty levels. Surprisingly, comprehension rates do not correlate highly with translation error rates, suggesting that we are measuring an additional dimension of MT quality. We observed that it takes 15% more time overall to read MT than GS.
Event Matching Using the Transitive Closure of Dependency Relations
This paper describes a novel event-matching strategy using features obtained from the transitive closure of dependency relations. The method yields a model capable of matching events with an F-measure of 66.5%.
Economical Global Access to a VoiceXML Gateway Using Open Source Technologies
Voice over IP and the open source technologies are becoming popular choices for organizations. However, while accessing the VoiceXML gateways these systems fail to attract the global users economically. The objective of this paper is to demonstrate how an existing web application can be modified using VoiceXML to enable non-visual access from any phone. Moreover, we unleash a way for linking an existing PSTN-based phone line to a VoiceXML gateway even though the voice service provider (VSP) does not provide a local geographical number to global customers to access the application. In addition, we introduce an economical way for small sized businesses to overcome the high cost of setting up and using a commercial VoiceXML gateway. The method is based on Asterisk server. In order to elucidate the entire process, we present a sample Package Tracking System application, which is based on an existing website and provides the same functionality as the website does. We also present an online demonstration, which provides global access to commercial voice platforms (i.e. Voxeo, Tellme Studio, Bevocal and DemandVoice). This paper also discusses various scenarios in which spoken interaction can play a significant role.
Probabilistic Disambiguation Models For Wide-Coverage HPSG Parsing
This paper reports the development of log-linear models for the disambiguation in wide-coverage HPSG parsing. The estimation of log-linear models requires high computational cost, especially with wide-coverage grammars. Using techniques to reduce the estimation cost, we trained the models using 20 sections of Penn Tree-bank. A series of experiments empirically evaluated the estimation techniques, and also examined the performance of the disambiguation models on the parsing of real-world sentences.
Semi-Supervised Maximum Entropy Based Approach To Acronym And Abbreviation Normalization In Medical Texts
Text normalization is an important aspect of successful information retrieval from medical documents such as clinical notes, radiology reports and discharge summaries. In the medical domain, a significant part of the general problem of text normalization is abbreviation and acronym disambiguation. Numerous abbreviations are used routinely throughout such texts and knowing their meaning is critical to data retrieval from the document. In this paper I will demonstrate a method of automatically generating training data for Maximum Entropy (ME) modeling of abbreviations and acronyms and will show that using ME modeling is a promising technique for abbreviation and acronym normalization. I report on the results of an experiment involving training a number of ME models used to normalize abbreviations and acronyms on a sample of 10,000 rheumatology notes with ~89% accuracy.
From Cogram To Alcogram: Toward A Controlled English Grammar Checker
In this paper we describe the roots of Controlled English (CE), the analysis of several existing CE grammars, the development of a well-founded 150-rule CE grammar (COGRAM), the elaboration of an algorithmic variant (ALCOGRAM) as a basis for NLP applications, the use of ALCOGRAM in a CA1 program teaching writers how to use it effectively, and the preparatory study into a Controlled English grammar and style checker within a desktop publishing (DTP) environment.
Sparse Multi-Scale Grammars for Discriminative Latent Variable Parsing
We present a discriminative, latent variable approach to syntactic parsing in which rules exist at multiple scales of refinement. The model is formally a latent variable CRF grammar over trees, learned by iteratively splitting grammar productions (not categories ). Different regions of the grammar are refined to different degrees, yielding grammars which are three orders of magnitude smaller than the single-scale baseline and 20 times smaller than the split-and-merge grammars of Petrov et al. (2006). In addition, our discriminative approach integrally admits features beyond local tree configurations. We present a multi-scale training method along with an efficient CKY-style dynamic program. On a variety of domains and languages, this method produces the best published parsing accuracies with the smallest reported grammars.
Chinese Syntactic Parsing Based On Extended GLR Parsing Algorithm With PCFG
This paper presents an extended GLR parsing algorithm with grammar PCFG* that is based on Tomita's GLR parsing algorithm and extends it further. We also define a new grammar PCFG* that is based on PCFG and assigns not only probability but also frequency associated with each rule. So our syntactic parsing system is implemented based on rule-based approach and statistics approach. Furthermore our experiments are executed in two fields : Chinese base noun phrase identification and full syntactic parsing. And the results of these two fields are compared from three ways. The experiments prove that the extended GLR parsing algorithm with PCFG* is an efficient parsing method and a straightforward way to combine statistical property with rules. The experiment results of these two fields are presented in this paper.
Dialog Control In A Natural Language System
In this paper a method for controlling the dialog in a natural language (NL) system is presented. It provides a deep modeling of information processing based on time dependent propositional attitudes of the interacting agents. Knowledge about the state of the dialog is represented in a dedicated language and changes of this state are described by a compact set of rules. An appropriate organization of rule application is introduced including the initiation of an adequate system reaction. Finally the application of the method in an NL consultation system is outlined.
Designing and Evaluating a Russian Tagset
This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset and associated morphosyntactic specifications are based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 600 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set of tagging models and corpora that can be shared with other researchers.
Effect Of Cross-Language IR In Bilingual Lexicon Acquisition From Comparable Corpora
Within the framework of translation knowledge acquisition from WWW news sites, this paper studies issues on the effect of cross-language retrieval of relevant texts in bilingual lexicon acquisition from comparable corpora. We experimentally show that it is quite effective to reduce the candidate bilingual term pairs against which bilingual term correspondences are estimated, in terms of both computational complexity and the performance of precise estimation of bilingual term correspondences.
Creating A Test Collection For Citation- Based IR Experiments
We present an approach to building a test collection of research papers. The approach is based on the Cranfield 2 tests but uses as its vehicle a current conference; research questions and relevance judgements of all cited papers are elicited from conference authors. The resultant test collection is different from TREC's in that it comprises scientific articles rather than newspaper text and, thus, allows for IR experiments that include citation information. The test collection currently consists of 170 queries with relevance judgements; the document collection is the ACL Anthology. We describe properties of our queries and relevance judgements, and demonstrate the use of the test collection in an experimental setup. One potentially problematic property of our collection is that queries have a low number of relevant documents; we discuss ways of alleviating this.
On Reasoning With Ambiguities
The paper adresses the problem of reasoning with ambiguities. Semantic representations are presented that leave scope relations between quantifiers and/or other operators unspecified. Truth conditions are provided for these representations and different consequence relations are judged on the basis of intuitive correctness. Finally inference patterns are presented that operate directly on these underspecified structures, i.e. do not rely on any translation into the set of their disambiguations.
Information based Intonation Synthesis
This paper presents a model for generating prosodically appropriate synthesized responses to database queries using Combinatory Categorial Grammar (CCG - cf. [22]), a formalism which easily integrates the notions of syntactic constituency, prosodie phrasing and information structure. The model determines accent locations within phrases on the basis of contrastive sets derived from the discourse structure and a domain-independent knowledge base.
Coreference Resolution Strategies From An Application Perspective
As part of our TIPSTER III research program, we have continued our research into strategies to resolve coreferences within a free text document; this research was begun during our TIPSTER II research program. In the TIPSTER II Proceedings paper, "An Evaluation of Coreference Resolution Strategies for Acquiring Associated Information", the goal was to evaluate the contributions of various techniques for associating an entity with three types of information: 1) name variations, 2) descriptive phrases, and 3) location information. This paper discusses the evolution of the coreference resolution techniques of the NLToolset, as they have been applied to an information extraction application, similar to the MUC Scenario Template task. Development of this application motivated new coreference resolution algorithms which were specific to the type of entity being handled. It also has raised the importance of understanding the structure of a document in order to guide the coreference resolution process. In the following paper, Section 2 discusses entity related coreference resolution techniques and Section 3, the relevance of document zoning. Section 4 concludes with a discussion of future work, which will include location merging, event coreference resolution, and event merging.
How To Obey The 7 Commandments For Spoken Dialogue?
We describe the design and implementation of the dialogue management module in a voice operated car-driver information system. The literature on designing 'good' user interfaces involving natural language dialogue in general and speech in particular is abundant with useful guidelines for actual development. We have tried to summarize these guidelines in 7 'meta-guidelines', or commandments. Even though state-of-the-art Speech Recognition modules perform well, speech recognition errors cannot be precluded. For the current application, the fact that the car is an acoustically hostile environment is an extra complication. This means that special attention should be paid to effective methods to compensate for speech recognition errors. Moreover, this should be done in a way which is not disturbing for the driver. In this paper, we show how these constraints influence the design and subsequent implementation of the Dialogue Manager module, and how the additional requirements fit in with the 7 commandments.
HMM Word And Phrase Alignment For Statistical Machine Translation
HMM-based models are developed for the alignment of words and phrases in bitext. The models are formulated so that alignment and parameter estimation can be performed efficiently. We find that Chinese-English word alignment performance is comparable to that of IBM Model-4 even over large training bitexts. Phrase pairs extracted from word alignments generated under the model can also be used for phrase-based translation, and in Chinese to English and Arabic to English translation, performance is comparable to systems based on Model-4 alignments. Direct phrase pair induction under the model is described and shown to improve translation performance.
Query Translation In Chinese-English Cross-Language Information Retrieval
This paper proposed a new query translation method based on the mutual information matrices of terms in the Chinese and English corpora. Instead of looking up a bilingual phrase dictionary, the compositional phrase (the translation of phrase can be derived from the translation of its components) in the query can be indirectly translated via a general-purpose Chinese- English dictionary look-up procedure. A novel selection method for translations of query terms is also presented in detail. Our query translation method ultimately constructs an English query in which each query term has a weight. The evaluation results show that the retrieval performance achieved by our query translation method is about 73% of monolingual information retrieval and is about 28% higher than that of simple word-by-word translation way.
Analyzing The Reading Comprehension Task
In this paper we describe a method for analyzing the reading comprehension task. First, we describe a method of classifying facts (information) into categories or levels; where each level signifies a different degree of difficulty of extracting a fact from a piece of text containing it. We then proceed to show how one can use this model the analyze the complexity of the reading comprehension task. Finally, we analyze five different reading comprehension tasks and present results from this analysis.
Automatic Text Categorization In Terms Of Genre And Author
The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.
Identifying Cross-Document Relations between Sentences
A pair of sentences in different newspaper articles on an event can have one of several relations. Of these, we have focused on two, i.e., equivalence and transition. Equivalence is the relation between two sentences that have the same information on an event. Transition is the relation between two sentences that have the same information except for values of numeric attributes. We propose methods of identifying these relations. We first split a dataset consisting of pairs of sentences into clusters according to their similarities, and then construct a classifier for each cluster to identify equivalence relations. We also adopt a "coarse-to-fine" approach. We further propose using the identified equivalence relations to address the task of identifying transition relations.
Automatic Acquisition Of Bilingual Rules For Extraction Of Bilingual Word Pairs From Parallel Corpora
In this paper, we propose a new learning method to solve the sparse data problem in automatic extraction of bilingual word pairs from parallel corpora with various languages. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any bilingual resource (e.g., a bilingual dictionary, machine translation systems) beforehand. We call this method Inductive Chain Learning (ICL). The ICL can limit the search scope for the decision of equivalents. Using ICL, the recall in three systems based on similarity measures improved respectively 8.0, 6.1 and 6.0 percentage points. In addition, the recall value of GIZA++ improved 6.6 percentage points using ICL.
Detecting Parser Errors Using Web-based Semantic Filters
NLP systems for tasks such as question answering and information extraction typically rely on statistical parsers. But the efficacy of such parsers can be surprisingly low, particularly for sentences drawn from heterogeneous corpora such as the Web. We have observed that incorrect parses often result in wildly implausible semantic interpretations of sentences, which can be detected automatically using semantic information obtained from the Web. Based on this observation, we introduce Web-based semantic filtering — a novel, domain-independent method for automatically detecting and discarding incorrect parses. We measure the effectiveness of our filtering system, called WOODWARD, on two test collections. On a set of TREC questions, it reduces error by 67%. On a set of more complex Penn Treebank sentences, the reduction in error rate was 20%.
Semantic Composition with (Robust) Minimal Recursion Semantics
We discuss semantic composition in Minimal Recursion Semantics (MRS) and Robust Minimal Recursion Semantics (RMRS). We demonstrate that a previously defined formal algebra applies to grammar engineering across a much greater range of frameworks than was originally envisaged. We show how this algebra can be adapted to composition in grammar frameworks where a lexicon is not assumed, and how this underlies a practical implementation of semantic construction for the RASP system.
Exploiting Wikipedia as External Knowledge for Named Entity Recognition
We explore the use of Wikipedia as external knowledge to improve named entity recognition (NER). Our method retrieves the corresponding Wikipedia entry for each candidate word sequence and extracts a category label from the first sentence of the entry, which can be thought of as a definition part. These category labels are used as features in a CRF-based NE tagger. We demonstrate using the CoNLL 2003 dataset that the Wikipedia category labels extracted by such a simple method actually improve the accuracy of NER.
Poster paper : HunPos - an open source trigram tagger
In the world of non-proprietary NLP software the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the criticism aimed at HMM performance on languages with rich morphology should more properly be directed at TnT's peculiar license, free but not open source, since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
Syntactic Reordering Integrated with Phrase-based SMT
We present a novel approach to word reordering which successfully integrates syntactic structural knowledge with phrase-based SMT. This is done by constructing a lattice of alternatives based on automatically learned probabilistic syntactic rules. In decoding, the alternatives are scored based on the output word order, not the order of the input. Unlike previous approaches, this makes it possible to successfully integrate syntactic reordering with phrase-based SMT. On an English-Danish task, we achieve an absolute improvement in translation quality of 1.1% BLEU. Manual evaluation supports the claim that the present approach is significantly superior to previous approaches.
A Syllable based Word Recognition Model For Korean Noun Extraction
Noun extraction is very important for many NLP applications such as information retrieval, automatic text classification, and information extraction. Most of the previous Korean noun extraction systems use a morphological analyzer or a Part-of- Speech (POS) tagger. Therefore, they require much of the linguistic knowledge such as morpheme dictionaries and rules (e.g. morphosyntactic rules and morphological rules). This paper proposes a new noun extraction method that uses the syllable based word recognition model. It finds the most probable syllable-tag sequence of the input sentence by using automatically acquired statistical information from the POS tagged corpus and extracts nouns by detecting word boundaries. Furthermore, it does not require any labor for constructing and maintaining linguistic knowledge. We have performed various experiments with a wide range of variables influencing the performance. The experimental results show that without morphological analysis or POS tagging, the proposed method achieves comparable performance with the previous methods.
Word Alignment For Languages With Scarce Resources Using Bilingual Corpora Of Other Language Pairs
This paper proposes an approach to improve word alignment for languages with scarce resources using bilingual corpora of other language pairs. To perform word alignment between languages L1 and L2, we introduce a third language L3. Although only small amounts of bilingual data are available for the desired language pair L1-L2, large-scale bilingual corpora in L1-L3 and L2-L3 are available. Based on these two additional corpora and with L3 as the pivot language, we build a word alignment model for L1 and L2. This approach can build a word alignment model for two languages even if no bilingual corpus is available in this language pair. In addition, we build another word alignment model for L1 and L2 using the small L1-L2 bilingual corpus. Then we interpolate the above two models to further improve word alignment between L1 and L2. Experimental results indicate a relative error rate reduction of 21.30% as compared with the method only using the small bilingual corpus in L1 and L2.
Presuppositions As Beliefs
Most theories of presupposition implicitly assume that presuppositions are facts, and that all agents involved in a discourse share belief in the presuppositions that it generates. These unrealistic assumptions can be eliminated if each presupposition is treated as the belief of an agent. However, it is not enough to consider only the beliefs of the speaker; we show that the beliefs of other agents are often involved. We describe a new model, including an improved definition of presupposition, that treats presuppositions as beliefs and considers the beliefs of all agents involved in the discourse. We show that treating presuppositions as beliefs makes it possible to explain phenomena that cannot be explained otherwise.
Extracting Nested Collocations
This paper provides an approach to the semi-automatic extraction of collocations from corpora using statistics. The growing availability of large textual corpora, and the increasing number of applications of collocation extraction, has given rise to various approaches on the topic. In this paper, we address the problem of nested collocations; that is, those being part of longer collocations. Most approaches till now, treated substring of collocations as collocations only if they appeared frequently enough by themselves in the corpus. These techniques left a lot of collocations unextracted. In this paper, we propose an algorithm for a semi-automatic extraction of nested uninterrupted and interrupted collocations, paying particular attention to nested collocation.
The Role Of Initiative In Tutorial Dialogue
This work is the first systematic investigation of initiative in human-human tutorial dialogue. We studied initiative management in two dialogue strategies: didactic tutoring and Socratic tutoring. We hypothesized that didactic tutoring would be mostly tutor-initiative while Socratic tutoring would be mixed-initiative, and that more student initiative would lead to more learning (i.e., task success for the tutor). Surprisingly, students had initiative more of the time in the didactic dialogues (21% of the turns) than in the Socratic dialogues (10% of the turns), and there was no direct relationship between student initiative and learning. However, Socratic dialogues were more interactive than didactic dialogues as measured by percentage of tutor utterances that were questions and percentage of words in the dialogue uttered by the student, and interactivity had a positive correlation with learning.
Linguistically Informed Statistical Models Of Constituent Structure For Ordering In Sentence Realization
We present several statistical models of syntactic constituent order for sentence realization. We compare several models, including simple joint models inspired by existing statistical parsing models, and several novel conditional models. The conditional models leverage a large set of linguistic features without manual feature selection. We apply and evaluate the models in sentence realization for French and German and find that a particular conditional model outperforms all others. We employ a version of that model in an evaluation on unordered trees from the Penn TreeBank. We offer this result on standard data as a reference-point for evaluations of ordering in sentence realization.
Linguistic Resources for Reconstructing Spontaneous Speech Text
The output of a speech recognition system is not always ideal for subsequent downstream processing, in part because speakers themselves often make mistakes. A system would accomplish speech reconstruction of its spontaneous speech input if its output were to represent, in flawless, fluent, and content-preserving English, the message that the speaker intended to convey. These cleaner speech transcripts would allow for more accurate language processing as needed for NLP tasks such as machine translation and conversation summarization, which often rely on grammatical input. Recognizing that supervised statistical methods to identify and transform ill-formed areas of the transcript will require richly labeled resources, we have built the Spontaneous Speech Reconstruction corpus. This small corpus of reconstructed and aligned conversational telephone speech transcriptions for the Fisher conversational telephone speech corpus ( Strassel and Walker, 2004 ) was annotated on several levels including string transformations and predicate-argument structure, and will be shared with the linguistic research community.
RUNDKAST: an Annotated Norwegian Broadcast News Speech Corpus
This paper describes the Norwegian broadcast news speech corpus RUNDKAST. The corpus contains recordings of approximately 77 hours of broadcast news shows from the Norwegian broadcasting company NRK. The corpus covers both read and spontaneous speech as well as spontaneous dialogues and multipart discussions, including frequent occurrences of non-speech material (e.g. music, jingles). The recordings have large variations in speaking styles, dialect use and recording/transmission quality. RUNDKAST has been annotated for research in speech technology. The entire corpus has been manually segmented and transcribed using hierarchical levels. A subset of one hour of read and spontaneous speech from 10 different speakers has been manually annotated using broad phonetic labels. We provide a description of the database content, the annotation tools and strategies, and the conventions used for the different levels of annotation. A corpus of this kind has up to this point not been available for Norwegian, but is considered a necessary part of the infrastructure for language technology research in Norway. The RUNDKAST corpus is planned to be included in a future national Norwegian language resource bank.
Frequency Estimates For Statistical Word Similarity Measures
Statistical measures of word similarity have application in many areas of natural language processing, such as language modeling and information retrieval. We report a comparative study of two methods for estimating word cooccurrence frequencies required by word similarity measures. Our frequency estimates are generated from a terabyte-sized corpus of Web data, and we study the impact of corpus size on the effectiveness of the measures. We base the evaluation on one TOEFL question set and two practice questions sets, each consisting of a number of multiple choice questions seeking the best synonym for a given target word. For two question sets, a context for the target word is provided, and we examine a number of word similarity measures that exploit this context. Our best combination of similarity measure and frequency estimation method answers 6-8% more questions than the best results previously reported for the same question sets.
TRIPHONE Analysis: A Combined Method For The Correction Of Orthographical And Typographical Errors
Most existing systems for the correction of word level errors are oriented toward either typographical or orthographical errors. Triphone analysis is a new correction strategy which combines phonemic transcription with trigram analysis. It corrects both kinds of errors (also in combination) and is superior for orthographical errors.
An Efficient A* Stack Decoder Algorithm For Continuous Speech Recognition With A Stochastic Language Model
The stack decoder is an attractive algorithm for controlling the acoustic and language model matching in a continuous speech recognizer. A previous paper described a near-optimal admissible Viterbi A* search algorithm for use with non-cross-word acoustic models and no-grammar language models [16]. This paper extends this algorithm to include unigram language models and describes a modified version of the algorithm which includes the full (forward) decoder, cross-word acoustic models and longer-span language models. The resultant algorithm is not admissible, but has been demonstrated to have a low probability of search error and to be very efficient.
A Representation For Complex And Evolving Data Dependencies In Generation
This paper introduces an approach to representing the kinds of information that components in a natural language generation (NLG) system will need to communicate to one another. This information may be partial, may involve more than one level of analysis and may need to include information about the history of a derivation. We present a general representation scheme capable of handling these cases. In addition, we make a proposal for organising intermodule communication in an NLG system by having a central server for this information. We have validated the approach by a reanalysis of an existing NLG system and through a full implementation of a runnable specification.
Example-Based Sense Tagging Of Running Chinese Text
This paper describes a sense tagging technique for the automatic sense tagging of running Chinese text. The system takes as input running Chinese text, and outputs sense disambiguated text. Whereas previous work (Yarowsky, 1992; Gale et al., 1992, 1993) relies heavily on the role of statistics, the present system makes use of Machine Readable/Tractable Dictionaries (Wilks et al., 1990; Guo, in press) and an example-based reasoning technique (Nagao, 1984; Sumita et al., 1990) to treat novel words, compound words, and phrases found in the input text.
Towards An Implementable Dependency Grammar
Syntactic models should be descriptively adequate and parsable. A syntactic description is autonomous in the sense that it has certain explicit formal properties. Such a description relates to the semantic interpretation of the sentences, and to the surface text. As the formalism is implemented in a broad-coverage syntactic parser, we concentrate on issues that must be resolved by any practical system that uses such models. The correspondence between the structure and linear order is discussed.
What's Yours And What's Mine: Determining Intellectual Attribution In Scientific Text
We believe that identifying the structure of scientific argumentation in articles can help in tasks such as automatic summarization or the automated construction of citation indexes. One particularly important aspect of this structure is the question of who a given scientific statement is attributed to: other researchers, the field in general, or the authors themselves. We present the algorithm and a systematic evaluation of a system which can recognize the most salient textual properties that contribute to the global argumentative structure of a text. In this paper we concentrate on two particular features, namely the occurrences of prototypical agents and their actions in scientific text.
Some Considerations On Guidelines For Bilingual Alignment And Terminology Extraction
Despite progress in the development of computational means, human input is still critical in the production of consistent and useable aligned corpora and term banks. This is especially true for specialized corpora and term banks whose end-users are often professionals with very stringent requirements for accuracy, consistency and coverage. In the compilation of a high quality Chinese-English legal glossary for ELDoS project, we have identified a number of issues that make the role human input critical for term alignment and extraction. They include the identification of low frequency terms, paraphrastic expressions, discontinuous units, and maintaining consistent term granularity, etc. Although manual intervention can more satisfactorily address these issues, steps must also be taken to address intra- and inter-annotator inconsistency.
The Reconstruction Engine: A Computer Implementation Of The Comparative Method
We describe the implementation of a computer program, the Reconstruction Engine (RE), which models the comparative method for establishing genetic affiliation among a group of languages. The program is a research tool designed to aid the linguist in evaluating specific hypotheses, by calculating the consequences of a set of postulated sound changes (proposed by the linguist) on complete lexicons of several languages. It divides the lexicons into a phonologically regular part and a part that deviates from the sound laws. RE is bi-directional: given words in modern languages, it can propose cognate sets (with reconstructions); given reconstructions, it can project the modern forms that would result from regular changes. RE operates either interactively, allowing word-by-word evaluation of hypothesized sound changes and semantic shifts, or in a "batch" mode, processing entire multilingual lexicons. We describe the algorithms implemented in RE, specifically the parsing and combinatorial techniques used to make projections upstream or downstream in the sense of time, the procedures for creating and consolidating cognate sets based on these projections, and the ad hoc techniques developed for handling the semantic component of the comparative method. Other programs and computational approaches to historical linguistics are briefly reviewed. Some results from a study of the Tamang languages of Nepal (a subgroup of the Tibeto-Burman family) are presented, and data from these languages are used throughout for exemplification of the operation of the program. Finally, we discuss features of RE that make it possible to handle the complex and sometimes imprecise representations of lexical items, and speculate on possible directions for future research.
Using Chunk Based Partial Parsing of Spontaneous Speech in Unrestricted Domains for Reducing Word Error Rate in Speech Recognition
In this paper, we present a chunk based partial parsing system for spontaneous, conversational speech in unrestricted domains. We show that the chunk parses produced by this parsing system can be usefully applied to the task of reranking N-best lists from a speech recognizer, using a combination of chunk-based n-gram model scores and chunk coverage scores. The input for the system is N-best lists generated from speech recognizer lattices. The hypotheses from the N-best lists are tagged for part of speech, "cleaned up" by a preprocessing pipe, parsed by a part of speech based chunk parser, and rescored using a backpropagation neural net trained on the chunk based scores. Finally, the reranked N-best lists are generated. The results of a system evaluation are promising in that a chunk accuracy of 87.4% is achieved and the best performance on a randomly selected test set is a decrease in word error rate of 0.3 percent (absolute), measured on the new first hypotheses in the reranked N-best lists.
Studying Feature Generation From Various Data Representations For Answer Extraction
In this paper, we study how to generate features from various data representations, such as surface texts and parse trees, for answer extraction. Besides the featuresgenerated from the surface texts, we mainly discuss the feature generation in the parse trees. We propose and compare three methods, including feature vector, string kernel and tree kernel, to represent the syntactic features in Support Vector Machines. The experiment on the TREC question answering task shows that the features generated from the more structured data representations significantly improve the performance based on the features generated from the surface texts. Furthermore, the contribution of the individual feature will be discussed in detail.
Exploring Semantic Constraints For Document Retrieval
In this paper, we explore the use of structured content as semantic constraints for enhancing the performance of traditional term-based document retrieval in special domains. First, we describe a method for automatic extraction of semantic content in the form of attribute-value (AV) pairs from natural language texts based on domain modelsconstructed from a semi-structured web resource. Then, we explore the effect of combining a state-of-the-art term-based IR system and a simple constraint-based search system that uses the extracted AV pairs. Our evaluation results have shown that such combination produces some improvement in IR performance over the term-based IR system on our test collection.
Correlations in the Organization of Large-Scale Syntactic Dependency Networks
We study the correlations in the connectivity patterns of large scale syntactic dependency networks. These networks are induced from treebanks: their vertices denote word forms which occur as nuclei of dependency trees. Their edges connect pairs of vertices if at least two instance nuclei of these vertices are linked in the dependency structure of a sentence. We examine the syntactic dependency networks of seven languages. In all these cases, we consistently obtain three findings. Firstly, clustering, i.e., the probability that two vertices which are linked to a common vertex are linked on their part, is much higher than expected by chance. Secondly, the mean clustering of vertices decreases with their degree - this finding suggests the presence of a hierarchical network organization. Thirdly, the mean degree of the nearest neighbors of a vertex x tends to decrease as the degree of x grows - this finding indicates disassortative mixing in the sense that links tend to connect vertices of dissimilar degrees. Our results indicate the existence of common patterns in the large scale organization of syntactic dependency networks.
Aiduti in Japanese Multi-Party Design Conversations
Japanese backchannel utterances, aizuti, in a multi-party design conversation were examined, and aizuti functions were analyzed in comparison with its functions in two-party dialogues. In addition to the two major functions, signaling acknowledgment and turn-management, it was argued that aizuti in multi-party conversations are involved in joint construction of design plans through management of the floor structure, and display of participants' readiness to engage in collaborative elaboration of jointly constructed proposals.
Mildly Context-Sensitive Dependency Languages
Dependency-based representations of natural language syntax require a fine balance between structural flexibility and computational complexity. In previous work, several constraints have been proposed to identify classes of dependency structures that are well-balanced in this sense; the best-known but also most restrictive of these is projectivity. Most constraints are formulated on fully specified structures, which makes them hard to integrate into models where structures are composed from lexical information. In this paper, we show how two empirically relevant relaxations of projectivity can be lexicalized, and how combining the resulting lexicons with a regular means of syntactic composition gives rise to a hierarchy of mildly context-sensitive dependency languages.
Segmentation for English-to-Arabic Statistical Machine Translation
In this paper, we report on a set of initial results for English-to-Arabic Statistical Machine Translation (SMT). We show that morphological decomposition of the Arabic source is beneficial, especially for smaller-size corpora, and investigate different recombination techniques. We also report on the use of Factored Translation Models for English-to-Arabic translation.
Discriminative vs. Generative Approaches in Semantic Role Labeling
This paper describes the two algorithms we developed for the CoNLL 2008 Shared Task "Joint learning of syntactic and semantic dependencies". Both algorithms start parsing the sentence using the same syntactic parser. The first algorithm uses machine learning methods to identify the semantic dependencies in four stages: identification and labeling of predicates, identification and labeling of arguments. The second algorithm uses a generative probabilistic model, choosing the semantic dependencies that maximize the probability with respect to the model. A hybrid algorithm combining the best stages of the two algorithms attains 86.62% labeled syntactic attachment accuracy, 73.24% labeled semantic dependency F1 and 79.93% labeled macro Fl score for the combined WSJ and Brown test sets.
Modelling The Substitutability Of Discourse Connectives
Processing discourse connectives is important for tasks such as discourse parsing and generation. For these tasks, it is useful to know which connectives can signal the same coherence relations. This paper presents experiments into modelling the substitutability of discourse connectives. It shows that substitutability effects distributional similarity. A novel variance-based function for comparing probability distributions is found to assist in predicting substitutability.
An Approach To Non-Singular Terms In Discourse
A new Theory of Names and Descriptions that offers a uniform treatment for many types of non-singular concepts found in natural language discourse is presented. We introduce a layered model of the language denotational base (the universe) in which every world object is assigned a layer (level) reflecting its relative singularity with respect to other objects in the universe. We define the notion of relative singularity of world objects as an abstraction class of the layer-membership relation.
Syntactic Constraints On Relativization In Japanese
This paper discusses the formalization of relative clauses in Japanese based on JPSG framework. We characterize them as adjuncts to nouns, and formalize them in terms of constraints among grammatical features. Furthermore, we claim that there is a constraint on the number of slash elements and show the supporting facts.
A Corpus-Based Approach To Deriving Lexical Mappings
This paper proposes a novel, corpus-based method for producing mappings between lexical resources. Results from a preliminary experiment using part of speech tags suggests this is a promising area for future research.
Combining Hierarchical Clustering And Machine Learning To Predict High-Level Discourse Structure
We propose a novel method to predict the inter-paragraph discourse structure of text, i.e. to infer which paragraphs are related to each other and form larger segments on a higher level. Our method combines a clustering algorithm with a model of segment "relatedness" acquired in a machine learning step. The model integrates information from a variety of sources, such as word co-occurrence, lexical chains, cue phrases, punctuation, and tense. Our method outperforms an approach that relies on word co-occurrence alone.
Comparatives And Ellipsis
This paper analyses the syntax and semantics of English comparatives, and some types of ellipsis. It improves on other recent analyses in the computational linguistics literature in three respects: (i) it uses no tree- or logical-form rewriting devices in building meaning representations (ii) this results in a fully reversible linguistic description, equally suited for analysis or generation (iii) the analysis extends to types of elliptical comparative not elsewhere treated.
Spock - a Spoken Corpus Client
Spock is an open source tool for the easy deployment of time-aligned corpora. It is fully web-based, and has very limited server-side requirements. It allows the end-user to search the corpus in a text-driven manner, obtaining both the transcription and the corresponding sound fragment in the result page. Spock has an administration environment to help manage the sound files and their respective transcription files, and also provides statistical data about the files at hand. Spock uses a proprietary file format for storing the alignment data but the integrated admin environment allows you to import files from a number of common file formats. Spock is not intended as a transcriber program: it is not meant as an alternative to programs such as ELAN, Wavesurfer, or Transcriber, but rather to make corpora created with these tools easily available on line. For the end user, Spock provides a very easy way of accessing spoken corpora, without the need of installing any special software, which might make time-aligned corpora accessible to a large group of users who might otherwise never look at them.
Comparison And Classification Of Dialects
This project measures and classifies language variation. In contrast to earlier dialectology, we seek a comprehensive characterization of (potentially gradual) differences between dialects, rather than a geographic delineation of (discrete) features of individual words or pronunciations. More general characterizations of dialect differences then become available. We measure phonetic (un)relatedness between dialects using Levenshtein distance, and classify by clustering distances but also by analysis through multidimensional scaling.
Learning To Detect Conversation Focus Of Threaded Discussions
In this paper we present a novel feature-enriched approach that learns to detect the conversation focus of threaded discussions by combining NLP analysis and IR techniques. Using the graph-based algorithm HITS, we integrate different features such as lexical similarity, poster trustworthiness, and speech act analysis of human conversations with feature-oriented link generation functions. It is the first quantitative study to analyze human conversation focus in the context of online discussions that takes into account heterogeneous sources of evidence. Experimental results using a threaded discussion corpus from an undergraduate class show that it achieves significant performance improvements compared with the baseline system.
Zero Pronoun Resolution In A Machine Translation System By Using Japanese To English Verbal Semantic Attributes
A method of anaphoral resolution of zero pronouns in Japanese language texts using the verbal semantic attributes is suggested. This method focuses attention on the semantic attributes of verbs and examines the context from the relationship between the semantic attributes of verbs governing zero pronouns and the semantic attributes of verbs governing their referents. The semantic attributes of verbs are created using 2 different viewpoints: dynamic characteristics of verbs and the relationship of verbs to cases. By using this method, it is shown that, in the case of translating newspaper articles, the major portion (93%) of anaphoral resolution of zero pronouns necessary for machine translation can be achieved by using only linguistic knowledge.Factors to be given special attention when incorporating this method into a machine translation system are examined, together with suggested conditions for the detection of zero pronouns and methods for their conversion. This study considers four factors that are important when implementing this method in a Japanese to English machine translation system: the difference in conception between Japanese and English expressions, the difference in case frame patterns between Japanese and English, restrictions by voice and restriction by translation structure. Implementation of the proposed method with due consideration of these points leads to a viable method for anaphoral resolution of zero pronouns in a practical machine translation system.
Automatic Extraction Of Grammars From Annotated Text
The primary objective of this project is to develop a robust, high-performance parser for English by automatically extracting a grammar from an annotated corpus of bracketed sentences, called the Treebank. The project is a collaboration between the IBM Continuous Speech Recognition Group and the University of Pennsylvania Department of Computer Sciences.
The Text REtrieval Conferences (TRECs) - Summary
There have been four Text REtrieval Conferences (TRECs); TREC-1 in November 1992, TREC-2 in August 1993, TREC-3 in November 1994 and TREC-4 in November 1995. The number of participating systems has grown from 25 in TREC-1 to 36 in TREC-4, including most of the major text retrieval software companies and most of the universities doing research in text retrieval (see table for some of the participants). The diversity of the participating groups has ensured that TREC represents many different approaches to text retrieval, while the emphasis on individual experiments evaluated in a common setting has proven to be a major strength of TREC. The test design and test collection used for document detection in TIPSTER was also used in TREC. The participants ran the various tasks, sent results into NIST for evaluation, presented the results at the TREC conferences, and submitted papers for a proceedings. The test collection consists of over 1 million documents from diverse full-text sources, 250 topics, and the set of relevant documents or "right answers" to those topics. A Spanish collection has been built and used during TREC-3 and TREC-4, with a total of 50 topics. TREC-1 required significant system rebuilding by most groups due to the huge increase in the size of the document collection (from a traditional test collection of several megabytes in size to the 2 gigabyte TIPSTER collection). The results from TREC-2 showed significant improvements over the TREC-1 results, and should be viewed as the appropriate baseline representing state-of-the-art retrieval techniques as scaled up to handling a 2 gigabyte collection. TREC-3 therefore provided the first opportunity for more complex experimentation. The major experiments in TREC-3 included the development of automatic query expansion techniques, the use of passages or sub-documents to increase the precision of retrieval results, and the use of the training information to select only the best terms for routing queries. Some groups explored hybrid approaches (such as the use of the Rocchio methodology in systems not using a vector space model), and others tried approaches that were radically different from their original approaches. TREC-4 allowed a continuation of many of these complex experiments. The topics were made much shorter and this change triggered extensive investigations in automatic query expansion. There were also five new tasks called tracks. These were added to help focus research on certain known problem areas, and included such issues as investigating searching as an interactive task by examining the process as well as the outcome, investigating techniques for merging results from the various TREC subcollections, examining the effects of corrupted data, and evaluating routing systems using a specific effectiveness measure. Additionally more groups participated in a track for Spanish retrieval. The TREC conferences have proven to be very successful, allowing broad participation in the overall DARPA TIPSTER effort, and causing widespread use of a very large test collection. All conferences have had very open, honest discussions of technical issues, and there have been large amounts of "cross-fertilization" of ideas. This will be a continuing effort, with a TREC-5 conference scheduled in November of 1996.
Morphological Productivity In The Lexicon
In this paper we outline a lexical organization for Turkish that makes use of lexical rules for inflections, derivations, and lexical category changes to control the proliferation of lexical entries. Lexical rules handle changes in grammatical roles, enforce type constraints, and control the mapping of sub-categorization frames in valency changing operations. A lexical inheritance hierarchy facilitates the enforcement of type constraints. Semantic compositions in inflections and derivations are constrained by the properties of the terms and predicates.The design has been tested as part of a HPSG grammar for Turkish. In terms of performance, run-time execution of the rules seems to be a far better alternative than pre-compilation. The latter causes exponential growth in the lexicon due to intensive use of inflections and derivations in Turkish.
Reconciliation Of Unsupervised Clustering, Segmentation And Cohesion
This extended abstract examines the progress of a project on unsupervised language learning, and focuses on two different approaches to segmentation, as well as how cohesion may be generalized from it definitive morpho-syntactic instantiation. It is intended as a discussion paper, and outlines the specific hypotheses currenlty being tested.
Detection Of Language (Model) Errors
The bigram language models are popular, in much language processing applications, in both Indo-European and Asian languages. However, when the language model for Chinese is applied in a novel domain, the accuracy is reduced significantly, from 96% to 78% in our evaluation. We apply pattern recognition techniques (i.e. Bayesian, decision tree and neural network classifiers) to discover language model errors. We have examined 2 general types of features: model-based and language-specific features. In our evaluation, Bayesian classifiers produce the best recall performance of 80% but the precision is low (60%). Neural network produced good recall (75%) and precision (80%) but both Bayesian and Neural network have low skip ratio (65%). The decision tree classifier produced the best precision (81%) and skip ratio (76%) but its recall is the lowest (73%).
Why Can't Jose Read ? The Problem Of Learning Semantic Associations In A Robot Environment
We study the problem of learning to recognise objects in the context of autonomous agents. We cast object recognition as the process of attaching meaningful concepts to specific regions of an image. In other words, given a set of images and their captions, the goal is to segment the image, in either an intelligent or naive fashion, then to find the proper mapping between words and regions. In this paper, we demonstrate that a model that learns spatial relationships between individual words not only provides accurate annotations, but also allows one to perform recognition that respects the real-time constraints of an autonomous, mobile robot.
Critical Tokenization And Its Properties
Tokenization is the process of mapping sentences from character strings into strings of words. This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding. The main results are as follows: (1) Critical points are all and only unambiguous token boundaries for any character string on a complete dictionary; (2) Any critically tokenized word string is a minimal element in the partially ordered set of all tokenized word strings with respect to the word string cover relation; (3) Any tokenized string can be reproduced from a critically tokenized word string but not vice versa; (4) Critical tokenization forms the sound mathematical foundation for categorizing tokenization ambiguity into critical and hidden types, a precise mathematical understanding of conventional concepts like combinational and overlapping ambiguities; (5) Many important maximum tokenization variations, such as forward and backward maximum matching and shortest tokenization, are all true subclasses of critical tokenization.It is believed that critical tokenization provides a precise mathematical description of the principle of maximum tokenization. Important implications and practical applications of critical tokenization in effective ambiguity resolution and in efficient tokenization implementation are also carefully examined.
An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework
This paper reports our empirical evaluation and comparison of several popular goodness measures for unsupervised segmentation of Chinese texts using Bakeoff-3 data sets with a unified framework. Assuming no prior knowledge about Chinese, this framework relies on a goodness measure to identify word candidates from unlabeled texts and then applies a generalized decoding algorithm to find the optimal segmentation of a sentence into such candidates with the greatest sum of goodness scores. Experiments show that description length gain outperforms other measures because of its strength for identifying short words. Further performance improvement is also reported, achieved by proper candidate pruning and by assemble segmentation to integrate the strengths of individual measures.
Semantic Role Labeling As Sequential Tagging
In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task. The system makes use of partial and full syntactic information and converts the task into a sequential BlO-tagging. As a result, the labeling architecture is very simple. Building on a state-of-the-art set of features, a binary classifier for each label is trained using AdaBoost with fixed depth decision trees. The final system, which combines the outputs of two base systems performed Fi=76.59 on the official test set. Additionally, we provide results comparing the system when using partial vs. full parsing input information.
Total Rank Distance And Scaled Total Rank Distance: Two Alternative Metrics In Computational Linguistics
In this paper we propose two metrics to be used in various fields of computational linguistics area. Our construction is based on the supposition that in most of the natural languages the most important information is carried by the first part of the unit. We introduce total rank distance and scaled total rank distance, we prove that they are metrics and investigate their max and expected values. Finally, a short application is presented: we investigate the similarity of Romance languages by computing the scaled total rank distance between the digram rankings of each language.
Learning Dependency Relations of Japanese Compound Functional Expressions
This paper proposes an approach of processing Japanese compound functional expressions by identifying them and analyzing their dependency relations through a machine learning technique. First, we formalize the task of identifying Japanese compound functional expressions in a text as a machine learning based chunking problem. Next, against the results of identifying compound functional expressions, we apply the method of dependency analysis based on the cascaded chunking model. The results of experimental evaluation show that, the dependency analysis model achieves improvements when applied after identifying compound functional expressions, compared with the case where it is applied without identifying compound functional expressions.
Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing
We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit distance among the paths between the protein names. Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Machines and k-nearest-neighbor, and the semi-supervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively. Significant improvement over the previous results in the literature is reported as well as a new benchmark dataset is introduced. Semi-supervised algorithms perform better than their supervised version by a wide margin especially when the amount of labeled data is limited.
Unsupervised Language Model Adaptation Incorporating Named Entity Information
Language model (LM) adaptation is important for both speech and language processing. It is often achieved by combining a generic LM with a topic-specific model that is more relevant to the target document. Unlike previous work on unsupervised LM adaptation, this paper investigates how effectively using named entity (NE) information, instead of considering all the words, helps LM adaptation. We evaluate two latent topic analysis approaches in this paper, namely, clustering and Latent Dirichlet Allocation (LDA). In addition, a new dynamically adapted weighting scheme for topic mixture models is proposed based on LDA topic analysis. Our experimental results show that the NE-driven LM adaptation framework outperforms the baseline generic LM. The best result is obtained using the LDA-based approach by expanding the named entities with syntactically filtered words, together with using a large number of topics, which yields a perplexity reduction of 14.23% compared to the baseline generic LM.
Improving Alignments for Better Confusion Networks for Combining Machine Translation Systems
The state-of-the-art system combination method for machine translation (MT) is the word-based combination using confusion networks. One of the crucial steps in confusion network decoding is the alignment of different hypotheses to each other when building a network. In this paper, we present new methods to improve alignment of hypotheses using word synonyms and a two-pass alignment strategy. We demonstrate that combination with the new alignment technique yields up to 2.9 BLEU point improvement over the best input system and up to 1.3 BLEU point improvement over a state-of-the-art combination method on two different language pairs.
Approximate Factoring for A* Search
We present a novel method for creating A* estimates for structured search problems. In our approach, we project a complex model onto multiple simpler models for which exact inference is efficient. We use an optimization framework to estimate parameters for these projections in a way which bounds the true costs. Similar to Klein and Manning (2003), we then combine completion estimates from the simpler models to guide search in the original complex model. We apply our approach to bitext parsing and lexicalized parsing, demonstrating its effectiveness in these domains.
Parsing Aligned Parallel Corpus By Projecting Syntactic Relations From Annotated Source Corpus
Example-based parsing has already been proposed in literature. In particular, attempts are being made to develop techniques for language pairs where the source and target languages are different, e.g. Direct Projection Algorithm (Hwa et al., 2005). This enables one to develop parsed corpus for target languages having fewer linguistic tools with the help of a resource-rich source language. The DPA algorithm works on the assumption of Direct Correspondence which simply means
that the relation between two words of the source language sentence can be projected directly between the corresponding words of the parallel target language sentence. However, we find that this assumption does not hold good all the time. This leads to wrong parsed structure of the target language sentence. As a solution we propose an algorithm called pseudo DPA (pDPA) that can work even if Direct Correspondence assumption is not guaranteed. The proposed algorithm works in a recursive manner by considering the embedded phrase structures from outermost level to the innermost. The present work discusses the pDPA algorithm, and illustrates it with respect to English-Hindi language pair. Link Grammar based parsing has been considered as the underlying parsing scheme for this work.
Chart Parsing According To The Slot And Filler Principle
A parser is an algorithm that assigns a structural description to a string according to a grammar. It follows from this definition that there are three general issues in parser design: the structure to be assigned, the type of grammar, the recognition algorithm. Common parsers employ phrase structure descriptions, rule-based grammars, and derivation or transition oriented recognition. The following choices result in a new parser: The structure to be assigned to the input is a dependency tree with lexical, morpho-syntactic and functional-syntactic information associated with each node and coded by complex categories which are subject to unification. The grammar is lexicalized, i.e. the syntactical relationships are stated as part of the lexical descriptions of the elements of the language. The algorithm relies on the slot and filler principle in order to draw up complex structures. It utilizes a well-formed substring table (chart) which allows for discontinuous segments.
Communication In Large Distributed AI Systems For Natural Language Processing
We are going to describe the design and implementation of a communication system for large AI projects, capable of supporting various software components in a heterogeneous hardware and programming-language environment. The system is based on a modification of the channel approach introduced by Hoare (1978). It is a three-layered approach with a de facto standard network layer (PVM.), core routines, and interfaces to five different programming languages together with support for the transparent exchange of complex data types. A special component takes over name service functions. It also records the actual configuration of the >modules present in the application and the created channels. We describe the integration of this communication facility in two versions of a speech-to-speech translation system, which dilfer with regard to quality and quantity of data distributed within the applications and with regard to the degree of interactivity involved in processing.
Czech-English Dependency Tree-Based Machine Translation
We present some preliminary results of a Czech-English translation system based on dependency trees. The fully automated process includes: morphological tagging, analytical and tectogrammatical parsing of Czech, tectogrammatical transfer based on lexical substitution using word-to-word translation dictionaries enhanced by the information from the English-Czech parallel corpus of WSJ, and a simple rule-based system for generation from English tectogrammatical representation. In the evaluation part, we compare results of the fully automated and the manually annotated processes of building the tectogrammatical representation.
Optimizing Disambiguation In Swahili
It is argued in this paper that an optimal solution to disambiguation is a combination of linguistically motivated rules and resolution based on probability or heuristic rules. By disambiguation is here meant ambiguity resolution on all levels of language analysis, including morphology and semantics. The discussion is based on Swahili, for which a comprehensive analysis system has been developed by using two-level description in morphology and constraint grammar formalism in disambiguation. Particular attention is paid to optimising the use of different solutions for achieving maximal precision with minimal rule writing.
Automatic Detection Of Omissions In Translations
ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-constructed gold standard for evaluating bitext mapping algorithms. Quantitative evaluation on simulated omissions showed that, even with today's poor bitext mapping technology, ADOMIT is a valuable quality control tool for translators and translation bureaus.
Spatiotemporal Annotation Using MiniSTEx: how to deal with Alternative, Foreign, Vague and/or Obsolete Names?
We are currently developing MiniSTEx, a spatiotemporal annotation system to handle temporal and/or geospatial information directly and indirectly expressed in texts. In the end, the aim is to locate all eventualities in a text on a time axis and/or a map to ensure an optimal base for automatic temporal and geospatial reasoning. A first version of MiniSTEx was originally developed for Dutch, keeping in mind that it should also be useful for other European languages, and for multilingual applications. In order to meet these desiderata we need the MiniSTEx system to be able to draw the conclusions human readers belonging to the intended audience would also draw, e.g. based on their (spatiotemporal) world knowledge, i.e. the common knowledge such readers share. The world knowledge MiniSTEx uses is contained in interconnected tables in a database. At the moment it is used for Dutch and English. Special attention will be paid to the problems we face when looking at older texts or recent historical or encyclopedic texts, i.e. texts with lots of references to times and locations that are not compatible with our current maps and calendars.
Semantic Construction In F-TAG
We propose a semantic construction method for Feature-Based Tree Adjoining Grammar which is based on the derived tree, compare it with related proposals and briefly discuss some implementation possibilities.
BBN: Description Of The PLUM System As Used For MUC-3
Traditional approaches to the problem of extracting data from texts have emphasized handcrafted linguistic knowledge. In contrast, BBN's PLUM system (Probabilistic Language Understanding Model) was developed as part of a DARPA-funded research effort on integrating probabilistic language models with more traditional linguistic techniques. Our research and development goals are more rapid development of new applications, the ability to train (and re-train) systems based on user markings of correct and incorrect output, more accurate selection among interpretations when more than one is found, an more robust partial interpretation when no complete interpretation can be found. We have previously performed experiments on components of the system with texts from the Wall Street Journal, however, the MUC-3 task is the first end-to-end application of plum. All components except parsing were developed in the last 5 months, and cannot therefore be considered fully mature. The parsing component, the MIT Fast Parser [4], originated outside BBN and has a more extensive history prior to MUC-3. A central assumption of our approach is that in processing unrestricted text for data extraction, a non-trivial amount of the text will not be understood. As a result, all components of plum are designed to operate on partially understood input, taking advantage of information when available, and not failing when information is unavailable. The following section describes the major plum components.
Sentence Reduction For Automatic Text Summarization
We present a novel sentence reduction system for automatically removing extraneous phrases from sentences that are extracted from a document for summarization purpose. The system uses multiple sources of knowledge to decide which phrases in an extracted sentence can be removed, including syntactic knowledge, context information, and statistics computed from a corpus which consists of examples written by human professionals. Reduction can significantly improve the conciseness of automatic summaries.
Combining Knowledge Sources To Reorder N-Best Speech Hypothesis Lists
A simple and general method is described that can combine different knowledge sources to reorder N-best lists of hypotheses produced by a speech recognizer. The method is automatically trainable, acquiring information from both positive and negative examples. In experiments, the method was tested on a 1000-utterance sample of unseen ATIS data.
Lexical Concept Acquisition From Collocation Map
This paper introduces an algorithm for automatically acquiring the conceptual structure of each word from corpus. The concept of a word is defined within the probabilistic framework. A variation of Belief Net named as Collocation Map is used to compute the probabilities. The Belief Net captures the conditional independences of words, which is obtained from the cooccurrence relations. The computation in general Belief Nets is known to be NP-hard, so we adopted Gibbs sampling for the approximation of the probabilities.The use of Belief Net to model the lexical meaning is unique in that the network is larger than expected in most other applications, and this changes the attitude toward the use of Belief Net. The lexical concept obtained from the Collocation Map best reflects the subdomain of language usage. The potential application of conditional probabilities the Collocation Map provides may extend to cover very diverse areas of language processing such as sense disambiguation, thesaurus construction, automatic indexing, and document classification.
Prediction Of Vowel And Consonant Place Of Articulation
A deductive approach is used to predict vowel and consonant places of articulation. Based on two main criteria, viz. simple and efficient use of an acoustic tube, along with maximum acoustic dispersion, the Distinctive Regions Model (DRM) of speech production derives regions that closely correspond to established vowel and consonant places of articulation.
Bootstrapping Without The Boot
"Bootstrapping" methods for learning require a small amount of supervision to seed the learning process. We show that it is sometimes possible to eliminate this last bit of supervision, by trying many candidate seeds and selecting the one with the most plausible outcome. We discuss such "strapping" methods in general, and exhibit a particular method for strapping word-sense classifiers for ambiguous words. Our experiments on the Canadian Hansards show that our unsupervised technique is significantly more effective than picking seeds by hand (Yarowsky, 1995), which in turn is known to rival supervised methods. "
Adding Domain Specificity To An MT System
In the development of a machine translation system, one important issue is being able to adapt to a specific domain without requiring time-consuming lexical work. We have experimented with using a statistical word-alignment algorithm to derive word association pairs (French-English) that complement an existing multipurpose bilingual dictionary. This word association information is added to the system at the time of the automatic creation of our translation pattern database, thereby making this database more domain specific. This technique significantly improves the overall quality of translation, as measured in an independent blind evaluation.
Improving English-Spanish Statistical Machine Translation: Experiments in Domain Adaptation, Sentence Paraphrasing, Tokenization, and Recasing
We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT'08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentence-level syntactic paraphrases on the source-language side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09% Bleu score on the WMT'07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT'07: 33.10% (in fact, by our system). On the WMT'08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.
Learning N-Best Correction Models from Implicit User Feedback in a Multi-Modal Local Search Application
We describe a novel n-best correction model that can leverage implicit user feedback (in the form of clicks) to improve performance in a multi-modal speech-search application. The proposed model works in two stages. First, the n-best list generated by the speech recognizer is expanded with additional candidates, based on confusability information captured via user click statistics. In the second stage, this expanded list is rescored and pruned to produce a more accurate and compact n-best list. Results indicate that the proposed n-best correction model leads to significant improvements over the existing baseline, as well as other traditional n-best rescoring approaches.
Discourse-New Detectors For Definite Description Resolution: A Survey And A Preliminary Proposal
Vieira and Poesio (2000) proposed an algorithm for definite description (dd) resolution that incorporates a number of heuristics for detecting discourse-new descriptions. The inclusion of such detectors was motivated by the observation that more than 50% of definite descriptions (dds) in an average corpus are discourse new (Poesio and Vieira, 1998), but whereas the inclusion of detectors for non-anaphoric pronouns in algorithms such as Lap-pin and Leass' (1994) leads to clear improvements in precision, the improvements in anaphoric dd resolution (as opposed to classification) brought about by the detectors were rather small. In fact, Ng and Cardie (2002a) challenged the motivation for the inclusion of such detectors, reporting no improvements, or even worse performance. We re-examine the literature on the topic in detail, and propose a revised algorithm, taking advantage of the improved discourse-new detection techniques developed by Uryupina (2003).
Comparing Information Extraction Pattern Models
Several recently reported techniques for the automatic acquisition of Information Extraction (IE) systems have used dependency trees as the basis of their extraction pattern representation. These approaches have used a variety of pattern models (schemes for representing IE patterns based on particular parts of the dependency analysis). An appropriate model should be expressive enough to represent the information which is to be extracted from text without being overly complicated. Four previously reported pattern models are evaluated using existing IE evaluation corpora and three dependency parsers. It was found that one model, linked chains, could represent around 95% of the information of interest without generating an unwieldy number of possible patterns.
Multilingual Text Entry using Automatic Language Detection
Computer users increasingly need to produce text written in multiple languages. However, typical computer interfaces require the user to change the text entry software each time a different language is used. This is cumbersome, especially when language changes are frequent. To solve this problem, we propose TypeAny, a novel front-end interface that detects the language of the user's key entry and automatically dispatches the input to the appropriate text entry system. Unlike previously reported methods, TypeAny can handle more than two languages, and can easily support any new language even if the available corpus is small. When evaluating this method, we obtained language detection accuracy of 96.7% when an appropriate language had to be chosen from among three languages. The number of control actions needed to switch languages was decreased over 93% when using TypeAny rather than a conventional method.
A Semantic Approach To Textual Entailment: System Evaluation and Task Analysis
This paper discusses our contribution to the third RTE Challenge - the SALSA RTE system. It builds on an earlier system based on a relatively deep linguistic analysis, which we complement with a shallow component based on word overlap. We evaluate their (combined) performance on various data sets. However, earlier observations that the combination of features improves the overall accuracy could be replicated only partly.
Extracting Data Records from Unstructured Biomedical Full Text
In this paper, we address the problem of extracting data records and their attributes from unstructured biomedical full text. There has been little effort reported on this in the research community. We argue that semantics is important for record extraction or finer-grained language processing tasks. We derive a data record template including semantic language models from unstructured text and represent them with a discourse level Conditional Random Fields (CRF) model. We evaluate the approach from the perspective of Information Extraction and achieve significant improvements on system performance compared with other baseline systems.
Mining Wiki Resources for Multilingual Named Entity Recognition
In this paper, we describe a system by which the multilingual characteristics of Wikipedia can be utilized to annotate a large corpus of text with Named Entity Recognition (NER) tags requiring minimal human intervention and no linguistic expertise. This process, though of value in languages for which resources exist, is particularly useful for less commonly taught languages. We show how the Wikipedia format can be used to identify possible named entities and discuss in detail the process by which we use the Category structure inherent to Wikipedia to determine the named entity type of a proposed entity. We further describe the methods by which English language data can be used to bootstrap the NER process in other languages. We demonstrate the system by using the generated corpus as training sets for a variant of BBN's Identifinder in French, Ukrainian, Spanish, Polish, Russian, and Portuguese, achieving overall F-scores as high as 84.7% on independent, human-annotated corpora, comparable to a system trained on up to 40,000 words of human-annotated newswire.
Modeling Local Coherence: An Entity-Based Approach
This article proposes a novel framework for representing and measuring local coherence. Central to this approach is the entity-grid representation of discourse, which captures patterns of entity distribution in a text. The algorithm introduced in the article automatically abstracts a text into a set of entity transition sequences and records distributional, syntactic, and referential information about discourse entities. We re-conceptualize coherence assessment as a learning task and show that our entity-based representation is well-suited for ranking-based generation and text classification tasks. Using the proposed representation, we achieve good performance on text ordering, summary coherence evaluation, and readability assessment.
Discriminative Language Modeling With Conditional Random Fields And The Perceptron Algorithm
This paper describes discriminative language modeling for a large vocabulary speech recognition task. We contrast two parameter estimation methods: the perceptron algorithm, and a method based on conditional random fields (CRFs). The models are encoded as deterministic weighted finite state automata, and are applied by intersecting the automata with word-lattices that are the output from a baseline recognizer. The perceptron algorithm has the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data. However, using the feature set output from the perceptron algorithm (initialized with their weights), CRF training provides an additional 0.5% reduction in word error rate, for a total 1.8% absolute reduction from the baseline of 39.2%.
Archivus: A Multimodal System For Multimedia Meeting Browsing And Retrieval
This paper presents Archivus, a multimodal language-enabled meeting browsing and retrieval system. The prototype is in an early stage of development, and we are currently exploring the role of natural language for interacting in this relatively unfamiliar and complex domain. We briefly describe the design and implementation status of the system, and then focus on how this system is used to elicit useful data for supporting hypotheses about multimodal interaction in the domain of meeting retrieval and for developing NLP modules for this specific domain.
Feature Logic With Disjunctive Unification
We introduce feature terms containing sorts, variables, negation and named disjunction for the specification of feature structures. We show that the possibility to label distinctions with names has major advantages both for the use of feature logic in computational linguistics and its implementation. We give an open world semantics for feature terms, where the denotation of a term is determined in dependence on the disjunctive context, i.e. the choices taken for the disjunctions. We define context-unique feature descriptions, a relational, constraint-based representation language and give a normalization procedure that allows to test consistency of feature terms. This procedure does not only avoid expansion to disjunctive normal form but maintains also structure sharing between information contained in different disjuncts as much as possible. Context-unique feature descriptions can be easily implemented in environments that support ordinary unification (such as PROLOG).
MTriage: Web-enabled Software for the Creation, Machine Translation, and Annotation of Smart Documents
Progress in the Machine Translation (MT) research community, particularly for statistical approaches, is intensely data-driven. Acquiring source language documents for testing, creating training datasets for customized MT lexicons, and building parallel corpora for MT evaluation require translators and non-native speaking analysts to handle large document collections. These collections are further complicated by differences in format, encoding, source media, and access to metadata describing the documents. Automated tools that allow language professionals to quickly annotate, translate, and evaluate foreign language documents are essential to improving MT quality and efficacy. The purpose of this paper is present our research approach to improving MT through pre-processing source language documents. In particular, we will discuss the development and use of MTriage, an application environment that enables the translator to markup documents with metadata for MT parameterization and routing. The use of MTriage as a web-enabled front end to multiple MT engines has leveraged the capabilities of our human translators for creating lexicons from NFW (Not-Found-Word) lists, writing reference translations, and creating parallel corpora for MT development and evaluation.
Analyzing Japanese Double-Subject Construction Having An Adjective Predicate
This paper describes a method for analyzing Japanese double-subject construction having an adjective predicate based on the valency structure. A simple sentence usually has only one subjective case in most languages. However, many Japanese adjectives (and some verbs) can dominate two surface subjective cases within a simple sentence. Such sentence structure is called the double-subject construction. This paper classifies the Japanese double-subject construction into four types and describes problems arising when analyzing these types using ordinary Japanese construction approaches. This paper proposes a method for analyzing a Japanese double-subject construction having an adjective predicate in order to overcome thee problems described. By applying this method to Japanese sentence analysis in Japanese-to-English machine translation systems, translation accuracy can be improved because this method can analyze correctly the double-subject construction.
A Semantic-Based Approach To Interoperabiltity Of Classification Hierarchies: Evaluation Of Linguistic Techniques
Classification Hierarchies (CHs) are widely used to organize documents in a way that makes their retrieval easier. Common examples of CHs are Web directories, marketplace catalogs, and file systems. In this paper we discuss and evaluate CtxMatch, an approach to interoperability that discovers mappings among CHs considering the semantic interpretation of their nodes. CtxMatch performs a linguistic processing of the labels attached to the nodes, including tokenization, Part of Speech tagging, multiword recognition and word sense disambiguation. We present an evaluation of the overall performance of the approach over Web directories as well as a systematic analysis of the linguistic modules involved.
A Matching Technique in Example-Based Machine Translation
This paper addresses an important problem in Example-Based Machine Translation (EBMT), namely how to measure similarity between a sentence fragment and a set of stored examples. A new method is proposed that measures similarity according to both surface structure and content. A second contribution is the use of clustering to make retrieval of the best matching example from the database more efficient. Results on a large number of test cases from the CELEX database are presented.
Connecting Text Mining and Pathways using the PathText Resource
Many systems have been developed in the past few years to assist researchers in the discovery of knowledge published as English text, for example in the PubMed database. At the same time, higher level collective knowledge is often published using a graphical notation representing all the entities in a pathway and their interactions. We believe that these pathway visualizations could serve as an effective user interface for knowledge discovery if they can be linked to the text in publications. Since the graphical elements in a Pathway are of a very different nature than their corresponding descriptions in English text, we developed a prototype system called PathText. The goal of PathText is to serve as a bridge between these two different representations. In this paper, we first describe the overall architecture and the interfaces of the PathText system, and then provide some details about the core Text Mining components.