Activity detection for information access to oral communication Oral communication is ubiquitous and carries important information yet it is also time consuming to document. Given the development of storage media and networks one could just record and store a conversation for documentation. The question is, however, how an interesting information piece would be found in a large database . Traditional information retrieval techniques use a histogram of keywords as the document representation but oral communication may offer additional indices such as the time and place of the rejoinder and the attendance. An alternative index could be the activity such as discussing, planning, informing, story-telling, etc. This paper addresses the problem of the automatic detection of those activities in meeting situation and everyday rejoinders. Several extensions of this basic idea are being discussed and/or evaluated: Similar to activities one can define subsets of larger database and detect those automatically which is shown on a large database of TV shows . Emotions and other indices such as the dominance distribution of speakers might be available on the surface and could be used directly. Despite the small size of the databases used some results about the effectiveness of these indices can be obtained.
Dialogue Interaction with the DARPA Communicator Infrastructure: The Development of Useful Software To support engaging human users in robust, mixed-initiative speech dialogue interactions which reach beyond current capabilities in dialogue systems , the DARPA Communicator program [1] is funding the development of a distributed message-passing infrastructure for dialogue systems which all Communicator participants are using. In this presentation, we describe the features of and requirements for a genuinely useful software infrastructure for this purpose.
Interlingua-Based Broad-Coverage Korean-to-English Translation in CCLING At MIT Lincoln Laboratory, we have been developing a Korean-to-English machine translation system CCLINC (Common Coalition Language System at Lincoln Laboratory) . The CCLINC Korean-to-English translation system consists of two core modules , language understanding and generation modules mediated by a language neutral meaning representation called a semantic frame . The key features of the system include: (i) Robust efficient parsing of Korean (a verb final language with overt case markers , relatively free word order , and frequent omissions of arguments ). (ii) High quality translation via word sense disambiguation and accurate word order generation of the target language . (iii) Rapid system development and porting to new domains via knowledge-based automated acquisition of grammars . Having been trained on Korean newspaper articles on missiles and chemical biological warfare, the system produces the translation output sufficient for content understanding of the original document .
Is That Your Final Answer? The purpose of this research is to test the efficacy of applying automated evaluation techniques , originally devised for the evaluation of human language learners , to the output of machine translation (MT) systems . We believe that these evaluation techniques will provide information about both the human language learning process , the translation process and the development of machine translation systems . This, the first experiment in a series of experiments, looks at the intelligibility of MT output . A language learning experiment showed that assessors can differentiate native from non-native language essays in less than 100 words . Even more illuminating was the factors on which the assessors made their decisions. We tested this to see if similar criteria could be elicited from duplicating the experiment using machine translation output . Subjects were given a set of up to six extracts of translated newswire text . Some of the extracts were expert human translations , others were machine translation outputs . The subjects were given three minutes per extract to determine whether they believed the sample output to be an expert human translation or a machine translation . Additionally, they were asked to mark the word at which they made this decision. The results of this experiment, along with a preliminary analysis of the factors involved in the decision making process will be presented here.
Listen-Communicate-Show (LCS): Spoken Language Command of Agent-based Remote Information Access Listen-Communicate-Show (LCS) is a new paradigm for human interaction with data sources . We integrate a spoken language understanding system with intelligent mobile agents that mediate between users and information sources . We have built and will demonstrate an application of this approach called LCS-Marine . Using LCS-Marine , tactical personnel can converse with their logistics system to place a supply or information request. The request is passed to a mobile, intelligent agent for execution at the appropriate database . Requestors can also instruct the system to notify them when the status of a request changes or when a request is complete. We have demonstrated this capability in several field exercises with the Marines and are currently developing applications of this technology in new domains .
On Combining Language Models : Oracle Approach In this paper, we address the problem of combining several language models (LMs) . We find that simple interpolation methods , like log-linear and linear interpolation , improve the performance but fall short of the performance of an oracle . The oracle knows the reference word string and selects the word string with the best performance (typically, word or semantic error rate ) from a list of word strings , where each word string has been obtained by using a different LM . Actually, the oracle acts like a dynamic combiner with hard decisions using the reference . We provide experimental results that clearly show the need for a dynamic language model combination to improve the performance further . We suggest a method that mimics the behavior of the oracle using a neural network or a decision tree . The method amounts to tagging LMs with confidence measures and picking the best hypothesis corresponding to the LM with the best confidence .
Towards an Intelligent Multilingual Keyboard System This paper proposes a practical approach employing n-gram models and error-correction rules for Thai key prediction and Thai-English language identification . The paper also proposes rule-reduction algorithm applying mutual information to reduce the error-correction rules . Our algorithm reported more than 99% accuracy in both language identification and key prediction .
SPoT: A Trainable Sentence Planner Sentence planning is a set of inter-related but distinct tasks, one of which is sentence scoping , i.e. the choice of syntactic structure for elementary speech acts and the decision of how to combine them into one or more sentences . In this paper, we present SPoT , a sentence planner , and a new methodology for automatically training SPoT on the basis of feedback provided by human judges . We reconceptualize the task into two distinct phases. First, a very simple, randomized sentence-plan-generator (SPG) generates a potentially large list of possible sentence plans for a given text-plan input . Second, the sentence-plan-ranker (SPR) ranks the list of output sentence plans , and then selects the top-ranked plan . The SPR uses ranking rules automatically learned from training data . We show that the trained SPR learns to select a sentence plan whose rating on average is only 5% worse than the top human-ranked sentence plan .
Low-cost, High-performance Translation Retrieval: Dumber is Better In this paper, we compare the relative effects of segment order , segmentation and segment contiguity on the retrieval performance of a translation memory system . We take a selection of both bag-of-words and segment order-sensitive string comparison methods , and run each over both character- and word-segmented data , in combination with a range of local segment contiguity models (in the form of N-grams ). Over two distinct datasets , we find that indexing according to simple character bigrams produces a retrieval accuracy superior to any of the tested word N-gram models . Further,in their optimum configuration , bag-of-words methods are shown to be equivalent to segment order-sensitive methods in terms of retrieval accuracy , but much faster. We also provide evidence that our findings are scalable.
Guided Parsing of Range Concatenation Languages The theoretical study of the range concatenation grammar [RCG] formalism has revealed many attractive properties which may be used in NLP . In particular, range concatenation languages [RCL] can be parsed in polynomial time and many classical grammatical formalisms can be translated into equivalent RCGs without increasing their worst-case parsing time complexity . For example, after translation into an equivalent RCG , any tree adjoining grammar can be parsed in O(n6) time . In this paper, we study a parsing technique whose purpose is to improve the practical efficiency of RCL parsers . The non-deterministic parsing choices of the main parser for a language L are directed by a guide which uses the shared derivation forest output by a prior RCL parser for a suitable superset of L . The results of a practical evaluation of this method on a wide coverage English grammar are given.
Extracting Paraphrases from a Parallel Corpus While paraphrasing is critical both for interpretation and generation of natural language , current systems use manual or semi-automatic methods to collect paraphrases . We present an unsupervised learning algorithm for identification of paraphrases from a corpus of multiple English translations of the same source text . Our approach yields phrasal and single word lexical paraphrases as well as syntactic paraphrases .
Alternative Phrases and Natural Language Information Retrieval This paper presents a formal analysis for a large class of words called alternative markers , which includes other (than) , such (as) , and besides . These words appear frequently enough in dialog to warrant serious attention , yet present natural language search engines perform poorly on queries containing them. I show that the performance of a search engine can be improved dramatically by incorporating an approximation of the formal analysis that is compatible with the search engine 's operational semantics . The value of this approach is that as the operational semantics of natural language applications improve, even larger improvements are possible.
Extending Lambek grammars: a logical account of minimalist grammars We provide a logical definition of Minimalist grammars , that are Stabler's formalization of Chomsky's minimalist program . Our logical definition leads to a neat relation to categorial grammar , (yielding a treatment of Montague semantics ), a parsing-as-deduction in a resource sensitive logic , and a learning algorithm from structured data (based on a typing-algorithm and type-unification ). Here we emphasize the connection to Montague semantics which can be viewed as a formal computation of the logical form .
Evaluating a Trainable Sentence Planner for a Spoken Dialogue System Techniques for automatically training modules of a natural language generator have recently been proposed, but a fundamental concern is whether the quality of utterances produced with trainable components can compete with hand-crafted template-based or rule-based approaches . In this paper We experimentally evaluate a trainable sentence planner for a spoken dialogue system by eliciting subjective human judgments . In order to perform an exhaustive comparison, we also evaluate a hand-crafted template-based generation component , two rule-based sentence planners , and two baseline sentence planners . We show that the trainable sentence planner performs better than the rule-based systems and the baselines , and as well as the hand-crafted system .
Using Machine Learning Techniques to Interpret WH-questions We describe a set of supervised machine learning experiments centering on the construction of statistical models of WH-questions . These models , which are built from shallow linguistic features of questions , are employed to predict target variables which represent a user's informational goals . We report on different aspects of the predictive performance of our models , including the influence of various training and testing factors on predictive performance , and examine the relationships among the target variables.
Effective Utterance Classification with Unsupervised Phonotactic Models This paper describes a method for utterance classification that does not require manual transcription of training data . The method combines domain independent acoustic models with off-the-shelf classifiers to give utterance classification performance that is surprisingly close to what can be achieved using conventional word-trigram recognition requiring manual transcription . In our method, unsupervised training is first used to train a phone n-gram model for a particular domain ; the output of recognition with this model is then passed to a phone-string classifier . The classification accuracy of the method is evaluated on three different spoken language system domains .
In Question Answering, Two Heads Are Better Than One Motivated by the success of ensemble methods in machine learning and other areas of natural language processing , we developed a multi-strategy and multi-source approach to question answering which is based on combining the results from different answering agents searching for answers in multiple corpora . The answering agents adopt fundamentally different strategies, one utilizing primarily knowledge-based mechanisms and the other adopting statistical techniques . We present our multi-level answer resolution algorithm that combines results from the answering agents at the question, passage, and/or answer levels . Experiments evaluating the effectiveness of our answer resolution algorithm show a 35.0% relative improvement over our baseline system in the number of questions correctly answered , and a 32.8% improvement according to the average precision metric .
Semantic Coherence Scoring Using an Ontology In this paper we present ONTOSCORE , a system for scoring sets of concepts on the basis of an ontology . We apply our system to the task of scoring alternative speech recognition hypotheses (SRH) in terms of their semantic coherence . We conducted an annotation experiment and showed that human annotators can reliably differentiate between semantically coherent and incoherent speech recognition hypotheses . An evaluation of our system against the annotated data shows that, it successfully classifies 73.2% in a German corpus of 2.284 SRHs as either coherent or incoherent (given a baseline of 54.55%).
Statistical Phrase-Based Translation We propose a new phrase-based translation model and decoding algorithm that enables us to evaluate and compare several, previously proposed phrase-based translation models . Within our framework, we carry out a large number of experiments to understand better and explain why phrase-based models outperform word-based models . Our empirical results, which hold for all examined language pairs , suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translations . Surprisingly, learning phrases longer than three words and learning phrases from high-accuracy word-level alignment models does not have a strong impact on performance. Learning only syntactically motivated phrases degrades the performance of our systems.
A Generative Probabilistic OCR Model for NLP Applications In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework , progressing from generation of true text through its transformation into the noisy output of an OCR system . The model is designed for use in error correction , with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks . We present an implementation of the model based on finite-state models , demonstrate the model 's ability to significantly reduce character and word error rate , and provide evaluation results involving automatic extraction of translation lexicons from printed text .
Statistical Sentence Condensation using Ambiguity Packing and Stochastic Disambiguation Methods for Lexical-Functional Grammar We present an application of ambiguity packing and stochastic disambiguation techniques for Lexical-Functional Grammars (LFG) to the domain of sentence condensation . Our system incorporates a linguistic parser/generator for LFG , a transfer component for parse reduction operating on packed parse forests , and a maximum-entropy model for stochastic output selection . Furthermore, we propose the use of standard parser evaluation methods for automatically evaluating the summarization quality of sentence condensation systems . An experimental evaluation of summarization quality shows a close correlation between the automatic parse-based evaluation and a manual evaluation of generated strings . Overall summarization quality of the proposed system is state-of-the-art, with guaranteed grammaticality of the system output due to the use of a constraint-based parser/generator .
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation , (ii) broad use of lexical features , including jointly conditioning on multiple consecutive words , (iii) effective use of priors in conditional loglinear models , and (iv) fine-grained modeling of unknown word features . Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ , an error reduction of 4.4% on the best previous single automatically learned tagging result.
Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task , but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams .
Adaptation Using Out-of-Domain Corpus within EBMT In order to boost the translation quality of EBMT based on a small-sized bilingual corpus , we use an out-of-domain bilingual corpus and, in addition, the language model of an in-domain monolingual corpus . We conducted experiments with an EBMT system . The two evaluation measures of the BLEU score and the NIST score demonstrated the effect of using an out-of-domain bilingual corpus and the possibility of using the language model .
Unsupervised Learning of Morphology for English and Inuktitut We describe a simple unsupervised technique for learning morphology by identifying hubs in an automaton . For our purposes, a hub is a node in a graph with in-degree greater than one and out-degree greater than one. We create a word-trie , transform it into a minimal DFA , then identify hubs . Those hubs mark the boundary between root and suffix , achieving similar performance to more complex mixtures of techniques.
Word Alignment with Cohesion Constraint We present a syntax-based constraint for word alignment , known as the cohesion constraint . It requires disjoint English phrases to be mapped to non-overlapping intervals in the French sentence . We evaluate the utility of this constraint in two different algorithms. The results show that it can provide a significant improvement in alignment quality .
Bootstrapping for Named Entity Tagging Using Concept-based Seeds A novel bootstrapping approach to Named Entity (NE) tagging using concept-based seeds and successive learners is presented. This approach only requires a few common noun or pronoun seeds that correspond to the concept for the targeted NE , e.g. he/she/man/woman for PERSON NE . The bootstrapping procedure is implemented as training two successive learners . First, decision list is used to learn the parsing-based NE rules . Then, a Hidden Markov Model is trained on a corpus automatically tagged by the first learner . The resulting NE system approaches supervised NE performance for some NE types .
A Phrase-Based Unigram Model for Statistical Machine Translation In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrase-based models . The units of translation are blocks - pairs of phrases . During decoding , we use a block unigram model and a word-based trigram language model . During training , the blocks are learned from source interval projections using an underlying word alignment . We show experimental results on block selection criteria based on unigram counts and phrase length.
Cooperative Model Based Language Understanding in Dialogue In this paper, we propose a novel Cooperative Model for natural language understanding in a dialogue system . We build this based on both Finite State Model (FSM) and Statistical Learning Model (SLM) . FSM provides two strategies for language understanding and have a high accuracy but little robustness and flexibility. Statistical approach is much more robust but less accurate. Cooperative Model incorporates all the three strategies together and thus can suppress all the shortcomings of different strategies and has all the advantages of the three strategies.
JAVELIN: A Flexible, Planner-Based Architecture for Question Answering The JAVELIN system integrates a flexible, planning-based architecture with a variety of language processing modules to provide an open-domain question answering capability on free text . The demonstration will focus on how JAVELIN processes questions and retrieves the most likely answer candidates from the given text corpus . The operation of the system will be explained in depth through browsing the repository of data objects created by the system during each question answering session .
Using Predicate-Argument Structures for Information Extraction In this paper we present a novel, customizable : IE paradigm that takes advantage of predicate-argument structures . We also introduce a new way of automatically identifying predicate argument structures , which is central to our IE paradigm . It is based on: (1) an extended set of features ; and (2) inductive decision tree learning . The experimental results prove our claim that accurate predicate-argument structures enable high quality IE results.
Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data This paper proposes the Hierarchical Directed Acyclic Graph (HDAG) Kernel for structured natural language data . The HDAG Kernel directly accepts several levels of both chunks and their relations , and then efficiently computes the weighed sum of the number of common attribute sequences of the HDAGs . We applied the proposed method to question classification and sentence alignment tasks to evaluate its performance as a similarity measure and a kernel function . The results of the experiments demonstrate that the HDAG Kernel is superior to other kernel functions and baseline methods .
Clustering Polysemic Subcategorization Frame Distributions Semantically Previous research has demonstrated the utility of clustering in inducing semantic verb classes from undisambiguated corpus data . We describe a new approach which involves clustering subcategorization frame (SCF) distributions using the Information Bottleneck and nearest neighbour methods. In contrast to previous work, we particularly focus on clustering polysemic verbs . A novel evaluation scheme is proposed which accounts for the effect of polysemy on the clusters , offering us a good insight into the potential and limitations of semantically classifying undisambiguated SCF data .
A Machine Learning Approach to Pronoun Resolution in Spoken Dialogue We apply a decision tree based approach to pronoun resolution in spoken dialogue . Our system deals with pronouns with NP- and non-NP-antecedents . We present a set of features designed for pronoun resolution in spoken dialogue and determine the most promising features . We evaluate the system on twenty Switchboard dialogues and show that it compares well to Byron's (2002) manually tuned system .
Optimizing Story Link Detection is not Equivalent to Optimizing New Event Detection Link detection has been regarded as a core technology for the Topic Detection and Tracking tasks of new event detection . In this paper we formulate story link detection and new event detection as information retrieval task and hypothesize on the impact of precision and recall on both systems. Motivated by these arguments, we introduce a number of new performance enhancing techniques including part of speech tagging , new similarity measures and expanded stop lists . Experimental results validate our hypothesis.
Corpus-based Discourse Understanding in Spoken Dialogue Systems This paper concerns the discourse understanding process in spoken dialogue systems . This process enables the system to understand user utterances based on the context of a dialogue . Since multiple candidates for the understanding result can be obtained for a user utterance due to the ambiguity of speech understanding , it is not appropriate to decide on a single understanding result after each user utterance . By holding multiple candidates for understanding results and resolving the ambiguity as the dialogue progresses, the discourse understanding accuracy can be improved. This paper proposes a method for resolving this ambiguity based on statistical information obtained from dialogue corpora . Unlike conventional methods that use hand-crafted rules , the proposed method enables easy design of the discourse understanding process . Experiment results have shown that a system that exploits the proposed method performs sufficiently and that holding multiple candidates for understanding results is effective.
Flexible Guidance Generation using User Model in Spoken Dialogue Systems We address appropriate user modeling in order to generate cooperative responses to each user in spoken dialogue systems . Unlike previous studies that focus on user 's knowledge or typical kinds of users , the user model we propose is more comprehensive. Specifically, we set up three dimensions of user models : skill level to the system, knowledge level on the target domain and the degree of hastiness . Moreover, the models are automatically derived by decision tree learning using real dialogue data collected by the system. We obtained reasonable classification accuracy for all dimensions. Dialogue strategies based on the user modeling are implemented in Kyoto city bus information system that has been developed at our laboratory. Experimental evaluation shows that the cooperative responses adaptive to individual users serve as good guidance for novice users without increasing the dialogue duration for skilled users .
Unsupervised Learning of Arabic Stemming using a Parallel Corpus This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer . The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources . No parallel text is needed after the training phase . Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre . Examples and results will be given for Arabic , but the approach is applicable to any language that needs affix removal . Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules , affix lists , and human annotated text , in addition to an unsupervised component . Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text , and 96% of the performance of the proprietary stemmer above.
Language Model Based Arabic Word Segmentation We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme ). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus . The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input . The language model is initially estimated from a small manually segmented corpus of about 110,000 words . To improve the segmentation accuracy , we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus , and re-estimate the model parameters with the expanded vocabulary and training corpus . The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens . We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study
A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning . In this paper, we evaluate an approach to automatically acquire sense-tagged training data from English-Chinese parallel corpora , which are then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task . Our investigation reveals that this method of acquiring sense-tagged data is promising. On a subset of the most difficult SENSEVAL-2 nouns , the accuracy difference between the two approaches is only 14.0%, and the difference could narrow further to 6.5% if we disregard the advantage that manually sense-tagged data have in their sense coverage . Our analysis also highlights the importance of the issue of domain dependence in evaluating WSD programs .
Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation We describe the ongoing construction of a large, semantically annotated corpus resource as reliable basis for the large-scale acquisition of word-semantic information , e.g. the construction of domain-independent lexica . The backbone of the annotation are semantic roles in the frame semantics paradigm . We report experiences and evaluate the annotated data from the first project stage. On this basis, we discuss the problems of vagueness and ambiguity in semantic annotation .
Towards a Model of Face-to-Face Grounding We investigate the verbal and nonverbal means for grounding , and propose a design for embodied conversational agents that relies on both kinds of signals to establish common ground in human-computer interaction . We analyzed eye gaze , head nods and attentional focus in the context of a direction-giving task . The distribution of nonverbal behaviors differed depending on the type of dialogue move being grounded, and the overall pattern reflected a monitoring of lack of negative feedback . Based on these results, we present an ECA that uses verbal and nonverbal grounding acts to update dialogue state .
Comparison between CFG filtering techniques for LTAG and HPSG An empirical comparison of CFG filtering techniques for LTAG and HPSG is presented. We demonstrate that an approximation of HPSG produces a more effective CFG filter than that of LTAG . We also investigate the reason for that difference.
Lower and higher estimates of the number of "true analogies" between sentences contained in a large multilingual corpus The reality of analogies between words is refuted by noone (e.g., I walked is to to walk as I laughed is to to laugh, noted I walked : to walk :: I laughed : to laugh). But computational linguists seem to be quite dubious about analogies between sentences : they would not be enough numerous to be of any use. We report experiments conducted on a multilingual corpus to estimate the number of analogies among the sentences that it contains. We give two estimates, a lower one and a higher one. As an analogy must be valid on the level of form as well as on the level of meaning , we relied on the idea that translation should preserve meaning to test for similar meanings .
Evaluating Multiple Aspects of Coherence in Student Essays CriterionSM Online Essay Evaluation Service includes a capability that labels sentences in student writing with essay-based discourse elements (e.g., thesis statements ). We describe a new system that enhances Criterion 's capability, by evaluating multiple aspects of coherence in essays . This system identifies features of sentences based on semantic similarity measures and discourse structure . A support vector machine uses these features to capture breakdowns in coherence due to relatedness to the essay question and relatedness between discourse elements . Intra-sentential quality is evaluated with rule-based heuristics . Results indicate that the system yields higher performance than a baseline on all three aspects.
Improving Multilingual Summarization: Using Redundancy in the Input to Correct MT errors In this paper, we use the information redundancy in multilingual input to correct errors in machine translation and thus improve the quality of multilingual summaries . We consider the case of multi-document summarization , where the input documents are in Arabic , and the output summary is in English . Typically, information that makes it to a summary appears in many different lexical-syntactic forms in the input documents . Further, the use of multiple machine translation systems provides yet more redundancy , yielding different ways to realize that information in English . We demonstrate how errors in the machine translations of the input Arabic documents can be corrected by identifying and generating from such redundancy , focusing on noun phrases .
A Maximum Entropy Word Aligner for Arabic-English Machine Translation This paper presents a maximum entropy word alignment algorithm for Arabic-English based on supervised training data . We demonstrate that it is feasible to create training material for problems in machine translation and that a mixture of supervised and unsupervised methods yields superior performance . The probabilistic model used in the alignment directly models the link decisions . Significant improvement over traditional word alignment techniques is shown as well as improvement on several machine translation tests . Performance of the algorithm is contrasted with human annotation performance .
Translating with non-contiguous phrases This paper presents a phrase-based statistical machine translation method , based on non-contiguous phrases , i.e. phrases with gaps. A method for producing such phrases from a word-aligned corpora is proposed. A statistical translation model is also presented that deals such phrases , as well as a training method based on the maximization of translation accuracy , as measured with the NIST evaluation metric . Translations are produced by means of a beam-search decoder . Experimental results are presented, that demonstrate how the proposed method allows to better generalize from the training data .
Automatically Evaluating Answers to Definition Questions Following recent developments in the automatic evaluation of machine translation and document summarization , we present a similar approach, implemented in a measure called POURPRE , for automatically evaluating answers to definition questions . Until now, the only way to assess the correctness of answers to such questions involves manual determination of whether an information nugget appears in a system's response. The lack of automatic methods for scoring system output is an impediment to progress in the field, which we address with this work. Experiments with the TREC 2003 and TREC 2004 QA tracks indicate that rankings produced by our metric correlate highly with official rankings , and that POURPRE outperforms direct application of existing metrics.
Pattern Visualization for Machine Translation Output We describe a method for identifying systematic patterns in translation data using part-of-speech tag sequences . We incorporate this analysis into a diagnostic tool intended for developers of machine translation systems , and demonstrate how our application can be used by developers to explore patterns in machine translation output .
Evaluating the Word Sense Disambiguation Performance of Statistical Machine Translation We present the first known empirical test of an increasingly common speculative claim, by evaluating a representative Chinese-to-English SMT model directly on word sense disambiguation performance , using standard WSD evaluation methodology and datasets from the Senseval-3 Chinese lexical sample task . Much effort has been put in designing and evaluating dedicated word sense disambiguation (WSD) models , in particular with the Senseval series of workshops. At the same time, the recent improvements in the BLEU scores of statistical machine translation (SMT) suggests that SMT models are good at predicting the right translation of the words in source language sentences . Surprisingly however, the WSD accuracy of SMT models has never been evaluated and compared with that of the dedicated WSD models . We present controlled experiments showing the WSD accuracy of current typical SMT models to be significantly lower than that of all the dedicated WSD models considered. This tends to support the view that despite recent speculative claims to the contrary, current SMT models do have limitations in comparison with dedicated WSD models , and that SMT should benefit from the better predictions made by the WSD models .
Statistical Machine Translation Part I: Hands-On Introduction Statistical machine translation (SMT) is currently one of the hot spots in natural language processing . Over the last few years dramatic improvements have been made, and a number of comparative evaluations have shown, that SMT gives competitive results to rule-based translation systems , requiring significantly less development time. This is particularly important when building translation systems for new language pairs or new domains . This workshop is intended to give an introduction to statistical machine translation with a focus on practical considerations. Participants should be able, after attending this workshop, to set out building an SMT system themselves and achieving good baseline results in a short time. The tutorial will cover the basics of SMT : Theory will be put into practice. STTK , a statistical machine translation tool kit , will be introduced and used to build a working translation system . STTK has been developed by the presenter and co-workers over a number of years and is currently used as the basis of CMU's SMT system . It has also successfully been coupled with rule-based and example based machine translation modules to build a multi engine machine translation system . The source code of the tool kit will be made available.
Harvesting the Bitexts of the Laws of Hong Kong From the Web In this paper we present our recent work on harvesting English-Chinese bitexts of the laws of Hong Kong from the Web and aligning them to the subparagraph level via utilizing the numbering system in the legal text hierarchy . Basic methodology and practical techniques are reported in detail. The resultant bilingual corpus , 10.4M English words and 18.3M Chinese characters , is an authoritative and comprehensive text collection covering the specific and special domain of HK laws. It is particularly valuable to empirical MT research . This piece of work has also laid a foundation for exploring and harvesting English-Chinese bitexts in a larger volume from the Web .
Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence The task of machine translation (MT) evaluation is closely related to the task of sentence-level semantic equivalence classification . This paper investigates the utility of applying standard MT evaluation methods (BLEU, NIST, WER and PER) to building classifiers to predict semantic equivalence and entailment . We also introduce a novel classification method based on PER which leverages part of speech information of the words contributing to the word matches and non-matches in the sentence . Our results show that MT evaluation techniques are able to produce useful features for paraphrase classification and to a lesser extent entailment . Our technique gives a substantial improvement in paraphrase classification accuracy over all of the other models used in the experiments.
Automatic generation of paraphrases to be used as translation references in objective evaluation measures of machine translation We propose a method that automatically generates paraphrase sets from seed sentences to be used as reference sets in objective machine translation evaluation measures like BLEU and NIST . We measured the quality of the paraphrases produced in an experiment, i.e., (i) their grammaticality : at least 99% correct sentences ; (ii) their equivalence in meaning : at least 96% correct paraphrases either by meaning equivalence or entailment ; and, (iii) the amount of internal lexical and syntactical variation in a set of paraphrases : slightly superior to that of hand-produced sets . The paraphrase sets produced by this method thus seem adequate as reference sets to be used for MT evaluation .
Annotating Honorifics Denoting Social Ranking of Referents This paper proposes an annotating scheme that encodes honorifics (respectful words). Honorifics are used extensively in Japanese , reflecting the social relationship (e.g. social ranks and age) of the referents . This referential information is vital for resolving zero pronouns and improving machine translation outputs . Annotating honorifics is a complex task that involves identifying a predicate with honorifics , assigning ranks to referents of the predicate , calibrating the ranks , and connecting referents with their predicates .
Discriminative Reranking for Natural Language Parsing This article considers approaches which rerank the output of an existing probabilistic parser . The base parser produces a set of candidate parses for each input sentence , with associated probabilities that define an initial ranking of these parses . A second model then attempts to improve upon this initial ranking , using additional features of the tree as evidence. The strength of our approach is that it allows a tree to be represented as an arbitrary set of features , without concerns about how these features interact or overlap and without the need to define a derivation or a generative model which takes these features into account . We introduce a new method for the reranking task , based on the boosting approach to ranking problems described in Freund et al. (1998). We apply the boosting method to parsing the Wall Street Journal treebank . The method combined the log-likelihood under a baseline model (that of Collins [1999]) with evidence from an additional 500,000 features over parse trees that were not included in the original model . The new model achieved 89.75% F-measure , a 13% relative decrease in F-measure error over the baseline model's score of 88.2%. The article also introduces a new algorithm for the boosting approach which takes advantage of the sparsity of the feature space in the parsing data . Experiments show significant efficiency gains for the new algorithm over the obvious implementation of the boosting approach . We argue that the method is an appealing alternative - in terms of both simplicity and efficiency - to work on feature selection methods within log-linear (maximum-entropy) models . Although the experiments in this article are on natural language parsing (NLP) , the approach should be applicable to many other NLP problems which are naturally framed as ranking tasks , for example, speech recognition , machine translation , or natural language generation .
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora We present a novel method for discovering parallel sentences in comparable, non-parallel corpora . We train a maximum entropy classifier that, given a pair of sentences , can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora . We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system . We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words ) and exploiting a large non-parallel corpus . Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.
Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations. We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure . We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality .
Dependency Treelet Translation: Syntactically Informed Phrasal SMT We describe a novel approach to statistical machine translation that combines syntactic information in the source language with recent advances in phrasal translation . This method requires a source-language dependency parser , target language word segmentation and an unsupervised word alignment component . We align a parallel corpus , project the source dependency parse onto the target sentence , extract dependency treelet translation pairs , and train a tree-based ordering model . We describe an efficient decoder and show that using these tree-based models in combination with conventional SMT models provides a promising approach that incorporates the power of phrasal SMT with the linguistic generality available in a parser .
Word Sense Disambiguation vs. Statistical Machine Translation We directly investigate a subject of much recent debate: do word sense disambigation models help statistical machine translation quality ? We present empirical results casting doubt on this common, but unproved, assumption. Using a state-of-the-art Chinese word sense disambiguation model to choose translation candidates for a typical IBM statistical MT system , we find that word sense disambiguation does not yield significantly better translation quality than the statistical machine translation system alone. Error analysis suggests several key factors behind this surprising finding, including inherent limitations of current statistical MT architectures .
Machine Translation Using Probabilistic Synchronous Dependency Insertion Grammars Syntax-based statistical machine translation (MT) aims at applying statistical models to structured data . In this paper, we present a syntax-based statistical machine translation system based on a probabilistic synchronous dependency insertion grammar . Synchronous dependency insertion grammars are a version of synchronous grammars defined on dependency trees . We first introduce our approach to inducing such a grammar from parallel corpora . Second, we describe the graphical model for the machine translation task , which can also be viewed as a stochastic tree-to-tree transducer . We introduce a polynomial time decoding algorithm for the model . We evaluate the outputs of our MT system using the NIST and Bleu automatic MT evaluation software . The result shows that our system outperforms the baseline system based on the IBM models in both translation speed and quality .
A Localized Prediction Model for Statistical Machine Translation In this paper, we present a novel training method for a localized phrase-based prediction model for statistical machine translation (SMT) . The model predicts blocks with orientation to handle local phrase re-ordering . We use a maximum likelihood criterion to train a log-linear block bigram model which uses real-valued features (e.g. a language model score ) as well as binary features based on the block identities themselves, e.g. block bigram features. Our training algorithm can easily handle millions of features . The best system obtains a 18.6% improvement over the baseline on a standard Arabic-English translation task .
Paraphrasing with Bilingual Parallel Corpora Previous work has used monolingual parallel corpora to extract and generate paraphrases . We show that this task can be done using bilingual parallel corpora , a much more commonly available resource . Using alignment techniques from phrase-based statistical machine translation , we show how paraphrases in one language can be identified using a phrase in another language as a pivot. We define a paraphrase probability that allows paraphrases extracted from a bilingual parallel corpus to be ranked using translation probabilities , and show how it can be refined to take contextual information into account. We evaluate our paraphrase extraction and ranking methods using a set of manual word alignments , and contrast the quality with paraphrases extracted from automatic alignments .
Dependency-Based Statistical Machine Translation We present a Czech-English statistical machine translation system which performs tree-to-tree translation of dependency structures . The only bilingual resource required is a sentence-aligned parallel corpus . All other resources are monolingual . We also refer to an evaluation method and plan to compare our system's output with a benchmark system .
Word Sense Induction: Triplet-Based Clustering and Automatic Evaluation In this paper a novel solution to automatic and unsupervised word sense induction (WSI) is introduced. It represents an instantiation of the one sense per collocation observation (Gale et al., 1992). Like most existing approaches it utilizes clustering of word co-occurrences . This approach differs from other approaches to WSI in that it enhances the effect of the one sense per collocation observation by using triplets of words instead of pairs. The combination with a two-step clustering process using sentence co-occurrences as features allows for accurate results. Additionally, a novel and likewise automatic and unsupervised evaluation method inspired by Schutze's (1992) idea of evaluation of word sense disambiguation algorithms is employed. Offering advantages like reproducability and independency of a given biased gold standard it also enables automatic parameter optimization of the WSI algorithm .
Addressee Identification in Face-to-Face Meetings We present results on addressee identification in four-participants face-to-face meetings using Bayesian Network and Naive Bayes classifiers . First, we investigate how well the addressee of a dialogue act can be predicted based on gaze , utterance and conversational context features . Then, we explore whether information about meeting context can aid classifiers ' performances . Both classifiers perform the best when conversational context and utterance features are combined with speaker's gaze information . The classifiers show little gain from information about meeting context .
CDER: Efficient MT Evaluation Using Block Movements Most state-of-the-art evaluation measures for machine translation assign high costs to movements of word blocks. In many cases though such movements still result in correct or almost correct sentences . In this paper, we will present a new evaluation measure which explicitly models block reordering as an edit operation . Our measure can be exactly calculated in quadratic time . Furthermore, we will show how some evaluation measures can be improved by the introduction of word-dependent substitution costs . The correlation of the new measure with human judgment has been investigated systematically on two different language pairs . The experimental results will show that it significantly outperforms state-of-the-art approaches in sentence-level correlation . Results from experiments with word dependent substitution costs will demonstrate an additional increase of correlation between automatic evaluation measures and human judgment .
Automatic Segmentation of Multiparty Dialogue In this paper, we investigate the problem of automatically predicting segment boundaries in spoken multiparty dialogue . We extend prior work in two ways. We first apply approaches that have been proposed for predicting top-level topic shifts to the problem of identifying subtopic boundaries . We then explore the impact on performance of using ASR output as opposed to human transcription . Examination of the effect of features shows that predicting top-level and predicting subtopic boundaries are two distinct tasks: (1) for predicting subtopic boundaries , the lexical cohesion-based approach alone can achieve competitive results, (2) for predicting top-level boundaries , the machine learning approach that combines lexical-cohesion and conversational features performs best, and (3) conversational cues , such as cue phrases and overlapping speech , are better indicators for the top-level prediction task. We also find that the transcription errors inevitable in ASR output have a negative impact on models that combine lexical-cohesion and conversational features , but do not change the general preference of approach for the two tasks.
Ensemble Methods for Unsupervised WSD Combination methods are an effective way of improving system performance . This paper examines the benefits of system combination for unsupervised WSD . We investigate several voting- and arbiter-based combination strategies over a diverse pool of unsupervised WSD systems . Our combination methods rely on predominant senses which are derived automatically from raw text . Experiments using the SemCor and Senseval-3 data sets demonstrate that our ensembles yield significantly better results when compared with state-of-the-art.
An Improved Redundancy Elimination Algorithm for Underspecified Representations We present an efficient algorithm for the redundancy elimination problem : Given an underspecified semantic representation (USR) of a scope ambiguity , compute an USR with fewer mutually equivalent readings . The algorithm operates on underspecified chart representations which are derived from dominance graphs ; it can be applied to the USRs computed by large-scale grammars . We evaluate the algorithm on a corpus , and show that it reduces the degree of ambiguity significantly while taking negligible runtime.
Using Machine Learning Techniques to Build a Comma Checker for Basque In this paper, we describe the research using machine learning techniques to build a comma checker to be integrated in a grammar checker for Basque . After several experiments, and trained with a little corpus of 100,000 words , the system guesses correctly not placing commas with a precision of 96% and a recall of 98%. It also gets a precision of 70% and a recall of 49% in the task of placing commas . Finally, we have shown that these results can be improved using a bigger and a more homogeneous corpus to train, that is, a bigger corpus written by one unique author .
Unsupervised Relation Disambiguation Using Spectral Clustering This paper presents an unsupervised learning approach to disambiguate various relations between named entities by use of various lexical and syntactic features from the contexts . It works by calculating eigenvectors of an adjacency graph 's Laplacian to recover a submanifold of data from a high dimensionality space and then performing cluster number estimation on the eigenvectors . Experiment results on ACE corpora show that this spectral clustering based approach outperforms the other clustering methods .
Automatic Construction of Polarity-tagged Corpus from HTML Documents This paper proposes a novel method of building polarity-tagged corpus from HTML documents . The characteristics of this method is that it is fully automatic and can be applied to arbitrary HTML documents . The idea behind our method is to utilize certain layout structures and linguistic pattern . By using them, we can automatically extract such sentences that express opinion. In our experiment, the method could construct a corpus consisting of 126,610 sentences .
Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers
In this paper we show how two standard outputs from information extraction (IE) systems - named entity annotations and scenario templates - can be used to enhance access to text collections via a standard text browser . We describe how this information is used in a prototype system designed to support information workers ' access to a pharmaceutical news archive as part of their industry watch function. We also report results of a preliminary, qualitative user evaluation of the system, which while broadly positive indicates further work needs to be done on the interface to make users aware of the increased potential of IE-enhanced text browsers .
Natural Language Generation in Dialog Systems
Recent advances in Automatic Speech Recognition technology have put the goal of naturally sounding dialog systems within reach. However, the improved speech recognition has brought to light a new problem: as dialog systems understand more of what the user tells them, they need to be more sophisticated at responding to the user . The issue of system response to users has been extensively studied by the natural language generation community , though rarely in the context of dialog systems . We show how research in generation can be adapted to dialog systems , and how the high cost of hand-crafting knowledge-based generation systems can be overcome by employing machine learning techniques .
A Three-Tiered Evaluation Approach for Interactive Spoken Dialogue Systems
We describe a three-tiered approach for evaluation of spoken dialogue systems . The three tiers measure user satisfaction , system support of mission success and component performance . We describe our use of this approach in numerous fielded user studies conducted with the U.S. military.
TAP-XL: An Automated Analyst's Assistant
The TAP-XL Automated Analyst's Assistant is an application designed to help an English -speaking analyst write a topical report , culling information from a large inflow of multilingual, multimedia data . It gives users the ability to spend their time finding more data relevant to their task, and gives them translingual reach into other languages by leveraging human language technology .
Some Computational Complexity Results for Synchronous Context-Free Grammars
This paper investigates some computational problems associated with probabilistic translation models that have recently been adopted in the literature on machine translation . These models can be viewed as pairs of probabilistic context-free grammars working in a 'synchronous' way. Two hardness results for the class NP are reported, along with an exponential time lower-bound for certain classes of algorithms that are currently used in the literature.
BLEU in characters: towards automatic MT evaluation in languages without word delimiters
Automatic evaluation metrics for Machine Translation (MT) systems , such as BLEU or NIST , are now well established. Yet, they are scarcely used for the assessment of language pairs like English-Chinese or English-Japanese , because of the word segmentation problem . This study establishes the equivalence between the standard use of BLEU in word n-grams and its application at the character level. The use of BLEU at the character level eliminates the word segmentation problem : it makes it possible to directly compare commercial systems outputting unsegmented texts with, for instance, statistical MT systems which usually segment their outputs .
Interactively Exploring a Machine Translation Model
This paper describes a method of interactively visualizing and directing the process of translating a sentence . The method allows a user to explore a model of syntax-based statistical machine translation (MT) , to understand the model 's strengths and weaknesses, and to compare it to other MT systems . Using this visualization method , we can find and address conceptual and practical problems in an MT system . In our demonstration at ACL , new users of our tool will drive a syntax-based decoder for themselves.
Computational Complexity of Statistical Machine Translation
In this paper we study a set of problems that are of considerable importance to Statistical Machine Translation (SMT) but which have not been addressed satisfactorily by the SMT research community . Over the last decade, a variety of SMT algorithms have been built and empirically tested whereas little is known about the computational complexity of some of the fundamental problems of SMT . Our work aims at providing useful insights into the the computational complexity of those problems. We prove that while IBM Models 1-2 are conceptually and computationally simple, computations involving the higher (and more useful) models are hard . Since it is unlikely that there exists a polynomial time solution for any of these hard problems (unless P = NP and P#P = P ), our results highlight and justify the need for developing polynomial time approximations for these computations. We also discuss some practical ways of dealing with complexity .
Structuring Knowledge for Reference Generation: A Clustering Algorithm
This paper discusses two problems that arise in the Generation of Referring Expressions : (a) numeric-valued attributes , such as size or location; (b) perspective-taking in reference . Both problems, it is argued, can be resolved if some structure is imposed on the available knowledge prior to content determination . We describe a clustering algorithm which is sufficiently general to be applied to these diverse problems, discuss its application, and evaluate its performance.
Answering the Question You Wish They Had Asked: The Impact of Paraphrasing for Question Answering
State-of-the-art Question Answering (QA) systems are very sensitive to variations in the phrasing of an information need . Finding the preferred language for such a need is a valuable task. We investigate that claim by adopting a simple MT-based paraphrasing technique and evaluating QA system performance on paraphrased questions . We found a potential increase of 35% in MRR with respect to the original question .
A Comparison of Tagging Strategies for Statistical Information Extraction
There are several approaches that model information extraction as a token classification task , using various tagging strategies to combine multiple tokens . We describe the tagging strategies that can be found in the literature and evaluate their relative performances. We also introduce a new strategy, called Begin/After tagging or BIA , and show that it is competitive to the best other strategies.
InfoMagnets: Making Sense of Corpus Data
We introduce a new interactive corpus exploration tool called InfoMagnets . InfoMagnets aims at making exploratory corpus analysis accessible to researchers who are not experts in text mining . As evidence of its usefulness and usability, it has been used successfully in a research context to uncover relationships between language and behavioral patterns in two distinct domains: tutorial dialogue (Kumar et al., submitted) and on-line communities (Arguello et al., 2006). As an educational tool , it has been used as part of a unit on protocol analysis in an Educational Research Methods course .
Polarized Unification Grammars
This paper proposes a generic mathematical formalism for the combination of various structures : strings , trees , dags , graphs , and products of them. The polarization of the objects of the elementary structures controls the saturation of the final structure . This formalism is both elementary and powerful enough to strongly simulate many grammar formalisms , such as rewriting systems , dependency grammars , TAG , HPSG and LFG .
Word Vectors and Two Kinds of Similarity
This paper examines what kind of similarity between words can be represented by what kind of word vectors in the vector space model . Through two experiments, three methods for constructing word vectors , i.e., LSA-based, cooccurrence-based and dictionary-based methods , were compared in terms of the ability to represent two kinds of similarity , i.e., taxonomic similarity and associative similarity . The result of the comparison was that the dictionary-based word vectors better reflect taxonomic similarity , while the LSA-based and the cooccurrence-based word vectors better reflect associative similarity .
Investigations on Event-Based Summarization
We investigate independent and relevant event-based extractive mutli-document summarization approaches . In this paper, events are defined as event terms and associated event elements . With independent approach, we identify important contents by frequency of events . With relevant approach, we identify important contents by PageRank algorithm on the event map constructed from documents . Experimental results are encouraging.
FERRET: Interactive Question-Answering for Real-World Environments
This paper describes FERRET , an interactive question-answering (Q/A) system designed to address the challenges of integrating automatic Q/A applications into real-world environments. FERRET utilizes a novel approach to Q/A known as predictive questioning which attempts to identify the questions (and answers ) that users need by analyzing how a user interacts with a system while gathering information related to a particular scenario.
Computational Analysis of Move Structures in Academic Abstracts
This paper introduces a method for computational analysis of move structures in abstracts of research articles . In our approach, sentences in a given abstract are analyzed and labeled with a specific move in light of various rhetorical functions . The method involves automatically gathering a large number of abstracts from the Web and building a language model of abstract moves . We also present a prototype concordancer , CARE , which exploits the move-tagged abstracts for digital learning . This system provides a promising approach to Web-based computer-assisted academic writing .
Re-Usable Tools for Precision Machine Translation
The LOGON MT demonstrator assembles independently valuable general-purpose NLP components into a machine translation pipeline that capitalizes on output quality . The demonstrator embodies an interesting combination of hand-built, symbolic resources and stochastic processes .
Testing The Psychological Reality of a Representational Model
A research program is described in which a particular representational format for meaning is tested as broadly as possible. In this format, developed by the LNR research group at The University of California at San Diego, verbs are represented as interconnected sets of subpredicates . These subpredicates may be thought of as the almost inevitable inferences that a listener makes when a verb is used in a sentence . They confer a meaning structure on the sentence in which the verb is used.
Fragments of a Theory of Human Plausible Reasoning
The paper outlines a computational theory of human plausible reasoning constructed from analysis of people's answers to everyday questions. Like logic , the theory is expressed in a content-independent formalism . Unlike logic , the theory specifies how different information in memory affects the certainty of the conclusions drawn. The theory consists of a dimensionalized space of different inference types and their certainty conditions , including a variety of meta-inference types where the inference depends on the person's knowledge about his own knowledge. The protocols from people's answers to questions are analyzed in terms of the different inference types . The paper also discusses how memory is structured in multiple ways to support the different inference types , and how the information found in memory determines which inference types are triggered.
PATH-BASED AND NODE-BASED INFERENCE IN SEMANTIC NETWORKS
Two styles of performing inference in semantic networks are presented and compared. Path-based inference allows an arc or a path of arcs between two given nodes to be inferred from the existence of another specified path between the same two nodes . Path-based inference rules may be written using a binary relational calculus notation . Node-based inference allows a structure of nodes to be inferred from the existence of an instance of a pattern of node structures . Node-based inference rules can be constructed in a semantic network using a variant of a predicate calculus notation . Path-based inference is more efficient, while node-based inference is more general. A method is described of combining the two styles in a single system in order to take advantage of the strengths of each. Applications of path-based inference rules to the representation of the extensional equivalence of intensional concepts , and to the explication of inheritance in hierarchies are sketched.
ON FROFF: A TEXT PROCESSING SYSTEM FOR ENGLISH TEXTS AND FIGURES
In order to meet the needs of a publication of papers in English, many systems to run off texts have been developed. In this paper, we report a system FROFF which can make a fair copy of not only texts but also graphs and tables indispensable to our papers. Its selection of fonts , specification of character size are dynamically changeable, and the typing location can be also changed in lateral or longitudinal directions. Each character has its own width and a line length is counted by the sum of each character . By using commands or rules which are defined to facilitate the construction of format expected or some mathematical expressions , elaborate and pretty documents can be successfully obtained.
ATNS USED AS A PROCEDURAL DIALOG MODEL
An attempt has been made to use an Augmented Transition Network as a procedural dialog model . The development of such a model appears to be important in several respects: as a device to represent and to use different dialog schemata proposed in empirical conversation analysis ; as a device to represent and to use models of verbal interaction ; as a device combining knowledge about dialog schemata and about verbal interaction with knowledge about task-oriented and goal-directed dialogs . A standard ATN should be further developed in order to account for the verbal interactions of task-oriented dialogs .
Metaphor - A Key to Extensible Semantic Analysis
Interpreting metaphors is an integral and inescapable process in human understanding of natural language . This paper discusses a method of analyzing metaphors based on the existence of a small number of generalized metaphor mappings . Each generalized metaphor contains a recognition network , a basic mapping , additional transfer mappings , and an implicit intention component . It is argued that the method reduces metaphor interpretation from a reconstruction to a recognition task . Implications towards automating certain aspects of language learning are also discussed.
Expanding the Horizons of Natural Language Interfaces
Current natural language interfaces have concentrated largely on determining the literal meaning of input from their users . While such decoding is an essential underpinning, much recent work suggests that natural language interfaces will never appear cooperative or graceful unless they also incorporate numerous non-literal aspects of communication , such as robust communication procedures . This paper defends that view, but claims that direct imitation of human performance is not the best way to implement many of these non-literal aspects of communication ; that the new technology of powerful personal computers with integral graphics displays offers techniques superior to those of humans for these aspects, while still satisfying human communication needs . The paper proposes interfaces based on a judicious mixture of these techniques and the still valuable methods of more traditional natural language interfaces.
Flexiable Parsing
When people use natural language in natural settings, they often use it ungrammatically, missing out or repeating words, breaking-off and restarting, speaking in fragments, etc.. Their human listeners are usually able to cope with these deviations with little difficulty. If a computer system wishes to accept natural language input from its users on a routine basis, it must display a similar indifference. In this paper, we outline a set of parsing flexibilities that such a system should provide. We go, on to describe FlexP , a bottom-up pattern-matching parser that we have designed and implemented to provide these flexibilities for restricted natural language input to a limited-domain computer system.
AN IMPROVED LEFT-CORNER PARSING ALGORITHM
This paper proposes a series of modifications to the left corner parsing algorithm for context-free grammars . It is argued that the resulting algorithm is both efficient and flexible and is, therefore, a good choice for the parser used in a natural language interface .
An Efficient Easily Adaptable System for Interpreting Natural Language Queries
This paper gives an overall account of a prototype natural language question answering system , called Chat-80 . Chat-80 has been designed to be both efficient and easily adaptable to a variety of applications. The system is implemented entirely in Prolog , a programming language based on logic . With the aid of a logic-based grammar formalism called extraposition grammars , Chat-80 translates English questions into the Prolog subset of logic . The resulting logical expression is then transformed by a planning algorithm into efficient Prolog , cf. query optimisation in a relational database . Finally, the Prolog form is executed to yield the answer.
Scruffy Text Understanding: Design and Implementation of 'Tolerant' Understanders
Most large text-understanding systems have been designed under the assumption that the input text will be in reasonably neat form, e.g., newspaper stories and other edited texts . However, a great deal of natural language texts e.g., memos , rough drafts , conversation transcripts etc., have features that differ significantly from neat texts , posing special problems for readers, such as misspelled words , missing words , poor syntactic construction , missing periods , etc. Our solution to these problems is to make use of expectations , based both on knowledge of surface English and on world knowledge of the situation being described. These syntactic and semantic expectations can be used to figure out unknown words from context , constrain the possible word-senses of words with multiple meanings ( ambiguity ), fill in missing words ( ellipsis ), and resolve referents ( anaphora ). This method of using expectations to aid the understanding of scruffy texts has been incorporated into a working computer program called NOMAD , which understands scruffy texts in the domain of Navy messages.
LIMITED DOMAIN SYSTEMS FOR LANGUAGE TEACHING
This abstract describes a natural language system which deals usefully with ungrammatical input and describes some actual and potential applications of it in computer aided second language learning . However, this is not the only area in which the principles of the system might be used, and the aim in building it was simply to demonstrate the workability of the general mechanism, and provide a framework for assessing developments of it.
A PROPER TREATMEMT OF SYNTAX AND SEMANTICS IN MACHINE TRANSLATION
A proper treatment of syntax and semantics in machine translation is introduced and discussed from the empirical viewpoint. For English-Japanese machine translation , the syntax directed approach is effective where the Heuristic Parsing Model (HPM) and the Syntactic Role System play important roles. For Japanese-English translation , the semantics directed approach is powerful where the Conceptual Dependency Diagram (CDD) and the Augmented Case Marker System (which is a kind of Semantic Role System ) play essential roles. Some examples of the difference between Japanese sentence structure and English sentence structure , which is vital to machine translation are also discussed together with various interesting ambiguities .
Entity-Oriented Parsing
An entity-oriented approach to restricted-domain parsing is proposed. In this approach, the definitions of the structure and surface representation of domain entities are grouped together. Like semantic grammar , this allows easy exploitation of limited domain semantics . In addition, it facilitates fragmentary recognition and the use of multiple parsing strategies , and so is particularly useful for robust recognition of extra-grammatical input . Several advantages from the point of view of language definition are also noted. Representative samples from an entity-oriented language definition are presented, along with a control structure for an entity-oriented parser , some parsing strategies that use the control structure , and worked examples of parses . A parser incorporating the control structure and the parsing strategies is currently under implementation .
A COMPUTATIONAL THEORY OF DISPOSITIONS
Informally, a disposition is a proposition which is preponderantly, but not necessarily always, true. For example, birds can fly is a disposition , as are the propositions Swedes are blond and Spaniards are dark. An idea which underlies the theory described in this paper is that a disposition may be viewed as a proposition with implicit fuzzy quantifiers which are approximations to all and always, e.g., almost all, almost always, most, frequently, etc. For example, birds can fly may be interpreted as the result of suppressing the fuzzy quantifier most in the proposition most birds can fly. Similarly, young men like young women may be read as most young men like mostly young women. The process of transforming a disposition into a proposition is referred to as explicitation or restoration . Explicitation sets the stage for representing the meaning of a proposition through the use of test-score semantics (Zadeh, 1978, 1982). In this approach to semantics , the meaning of a proposition , p, is represented as a procedure which tests, scores and aggregates the elastic constraints which are induced by p. The paper closes with a description of an approach to reasoning with dispositions which is based on the concept of a fuzzy syllogism . Syllogistic reasoning with dispositions has an important bearing on commonsense reasoning as well as on the management of uncertainty in expert systems . As a simple application of the techniques described in this paper, we formulate a definition of typicality -- a concept which plays an important role in human cognition and is of relevance to default reasoning .
Controlling Lexical Substitution in Computer Text Generation
This report describes Paul , a computer text generation system designed to create cohesive text through the use of lexical substitutions . Specifically, this system is designed to deterministically choose between pronominalization , superordinate substitution , and definite noun phrase reiteration . The system identifies a strength of antecedence recovery for each of the lexical substitutions , and matches them against the strength of potential antecedence of each element in the text to select the proper substitutions for these elements.
A LOGICAL FORMALISM FOR THE REPRESENTATION OF DETERMINERS
Determiners play an important role in conveying the meaning of an utterance , but they have often been disregarded, perhaps because it seemed more important to devise methods to grasp the global meaning of a sentence , even if not in a precise way. Another problem with determiners is their inherent ambiguity . In this paper we propose a logical formalism , which, among other things, is suitable for representing determiners without forcing a particular interpretation when their meaning is still not clear.
SYNTHESIZING WEATHER FORECASTS FROM FORMATFED DATA
This paper describes a system ( RAREAS ) which synthesizes marine weather forecasts directly from formatted weather data . Such synthesis appears feasible in certain natural sublanguages with stereotyped text structure . RAREAS draws on several kinds of linguistic and non-linguistic knowledge and mirrors a forecaster's apparent tendency to ascribe less precise temporal adverbs to more remote meteorological events. The approach can easily be adapted to synthesize bilingual or multi-lingual texts .
THE CORRECTION OF ILL-FORMED INPUT USING HISTORY-BASED EXPECTATION WITH APPLICATIONS TO SPEECH UNDERSTANDING
A method for error correction of ill-formed input is described that acquires dialogue patterns in typical usage and uses these patterns to predict new inputs. Error correction is done by strongly biasing parsing toward expected meanings unless clear evidence from the input shows the current sentence is not expected. A dialogue acquisition and tracking algorithm is presented along with a description of its implementation in a voice interactive system . A series of tests are described that show the power of the error correction methodology when stereotypic dialogue occurs.
Attention, Intentions, And The Structure Of Discourse
In this paper we explore a new theory of discourse structure that stresses the role of purpose and processing in discourse . In this theory, discourse structure is composed of three separate but interrelated components: the structure of the sequence of utterances (called the linguistic structure ), a structure of purposes (called the intentional structure ), and the state of focus of attention (called the attentional state ). The linguistic structure consists of segments of the discourse into which the utterances naturally aggregate. The intentional structure captures the discourse-relevant purposes , expressed in each of the linguistic segments as well as relationships among them. The attentional state is an abstraction of the focus of attention of the participants as the discourse unfolds. The attentional state , being dynamic, records the objects, properties, and relations that are salient at each point of the discourse . The distinction among these components is essential to provide an adequate explanation of such discourse phenomena as cue phrases , referring expressions , and interruptions . The theory of attention, intention, and aggregation of utterances is illustrated in the paper with a number of example discourses . Various properties of discourse are described, and explanations for the behaviour of cue phrases , referring expressions , and interruptions are explored. This theory provides a framework for describing the processing of utterances in a discourse . Discourse processing requires recognizing how the utterances of the discourse aggregate into segments , recognizing the intentions expressed in the discourse and the relationships among intentions , and tracking the discourse through the operation of the mechanisms associated with attentional state . This processing description specifies in these recognition tasks the role of information from the discourse and from the participants ' knowledge of the domain.
REFERENCE IDENTIFICATION AND REFERENCE IDENTIFICATION FAILURES
The goal of this work is the enrichment of human-machine interactions in a natural language environment . Because a speaker and listener cannot be assured to have the same beliefs , contexts , perceptions , backgrounds , or goals , at each point in a conversation , difficulties and mistakes arise when a listener interprets a speaker's utterance . These mistakes can lead to various kinds of misunderstandings between speaker and listener , including reference failures or failure to understand the speaker's intention . We call these misunderstandings miscommunication . Such mistakes can slow, and possibly break down, communication . Our goal is to recognize and isolate such miscommunications and circumvent them. This paper highlights a particular class of miscommunication --- reference problems --- by describing a case study and techniques for avoiding failures of reference . We want to illustrate a framework less restrictive than earlier ones by allowing a speaker leeway in forming an utterance about a task and in determining the conversational vehicle to deliver it. The paper also promotes a new view for extensional reference .
The Relationship Between Tree Adjoining Grammars And Head Grammars
We examine the relationship between the two grammatical formalisms : Tree Adjoining Grammars and Head Grammars . We briefly investigate the weak equivalence of the two formalisms . We then turn to a discussion comparing the linguistic expressiveness of the two formalisms .
A LOGICAL SEMANTICS FOR FEATURE STRUCTURES
Unification-based grammar formalisms use structures containing sets of features to describe linguistic objects . Although computational algorithms for unification of feature structures have been worked out in experimental research, these algorithms become quite complicated, and a more precise description of feature structures is desirable. We have developed a model in which descriptions of feature structures can be regarded as logical formulas , and interpreted by sets of directed graphs which satisfy them. These graphs are, in fact, transition graphs for a special type of deterministic finite automaton . This semantics for feature structures extends the ideas of Pereira and Shieber [11], by providing an interpretation for values which are specified by disjunctions and path values embedded within disjunctions . Our interpretation differs from that of Pereira and Shieber by using a logical model in place of a denotational semantics . This logical model yields a calculus of equivalences , which can be used to simplify formulas . Unification is attractive, because of its generality, but it is often computationally inefficient. Our model allows a careful examination of the computational complexity of unification . We have shown that the consistency problem for formulas with disjunctive values is NP-complete . To deal with this complexity , we describe how disjunctive values can be specified in a way which delays expansion to disjunctive normal form .
The Multimedia Articulation of Answers in a Natural Language Database Query System
This paper describes a domain independent strategy for the multimedia articulation of answers elicited by a natural language interface to database query applications . Multimedia answers include videodisc images and heuristically-produced complete sentences in text or text-to-speech form . Deictic reference and feedback about the discourse are enabled. The interface thus presents the application as cooperative and conversational.
An Architecture for Anaphora Resolution
In this paper, we describe the pronominal anaphora resolution module of Lucy , a portable English understanding system . The design of this module was motivated by the observation that, although there exist many theories of anaphora resolution , no one of these theories is complete. Thus we have implemented a blackboard-like architecture in which individual partial theories can be encoded as separate modules that can interact to propose candidate antecedents and to evaluate each other's proposals.
Machine Translation Using Isomorphic UCGs
This paper discusses the application of Unification Categorial Grammar (UCG) to the framework of Isomorphic Grammars for Machine Translation pioneered by Landsbergen. The Isomorphic Grammars approach to MT involves developing the grammars of the Source and Target languages in parallel, in order to ensure that SL and TL expressions which stand in the translation relation have isomorphic derivations . The principle advantage of this approach is that knowledge concerning translation equivalence of expressions may be directly exploited, obviating the need for answers to semantic questions that we do not yet have. Semantic and other information may still be incorporated, but as constraints on the translation relation , not as levels of textual representation . After introducing this approach to MT system design, and the basics of monolingual UCG , we will show how the two can be integrated, and present an example from an implemented bi-directional English-Spanish fragment . Finally we will present some outstanding problems with the approach.
On the Generation and Interpretation of Demonstrative Expressions
This paper presents necessary and sufficient conditions for the use of demonstrative expressions in English and discusses implications for current discourse processing algorithms . We examine a broad range of texts to show how the distribution of demonstrative forms and functions is genre dependent . This research is part of a larger study of anaphoric expressions , the results of which will be incorporated into a natural language generation system .
Parsing with Category Coocurrence Restrictions
This paper summarizes the formalism of Category Cooccurrence Restrictions (CCRs) and describes two parsing algorithms that interpret it. CCRs are Boolean conditions on the cooccurrence of categories in local trees which allow the statement of generalizations which cannot be captured in other current syntax formalisms . The use of CCRs leads to syntactic descriptions formulated entirely with restrictive statements . The paper shows how conventional algorithms for the analysis of context free languages can be adapted to the CCR formalism . Special attention is given to the part of the parser that checks the fulfillment of logical well-formedness conditions on trees .
Solving Some Persistent Presupposition Problems
Soames 1979 provides some counterexamples to the theory of natural language presuppositions that is presented in Gazdar 1979. Soames 1982 provides a theory which explains these counterexamples. Mercer 1987 rejects the solution found in Soames 1982 leaving these counterexamples unexplained. By reappraising these insightful counterexamples, the inferential theory for natural language presuppositions described in Mercer 1987, 1988 gives a simple and straightforward explanation for the presuppositional nature of these sentences .
Directing the Generation of Living Space Descriptions
We have developed a computational model of the process of describing the layout of an apartment or house, a much-studied discourse task first characterized linguistically by Linde (1974). The model is embodied in a program, APT , that can reproduce segments of actual tape-recorded descriptions, using organizational and discourse strategies derived through analysis of our corpus .
Island Parsing and Bidirectional Charts
Chart parsing is directional in the sense that it works from the starting point (usually the beginning of the sentence) extending its activity usually in a rightward manner. We shall introduce the concept of a chart that works outward from islands and makes sense of as much of the sentence as it is actually possible, and after that will lead to predictions of missing fragments . So, for any place where the easily identifiable fragments occur in the sentence , the process will extend to both the left and the right of the islands , until possibly completely missing fragments are reached. At that point, by virtue of the fact that both a left and a right context were found, heuristics can be introduced that predict the nature of the missing fragments .
Interactive Translation : a new approach
A new approach for Interactive Machine Translation where the author interacts during the creation or the modification of the document is proposed. The explanation of an ambiguity or an error for the purposes of correction does not use any concepts of the underlying linguistic theory : it is a reformulation of the erroneous or ambiguous sentence . The interaction is limited to the analysis step of the translation process . This paper presents a new interactive disambiguation scheme based on the paraphrasing of a parser 's multiple output. Some examples of paraphrasing ambiguous sentences are presented.
NETL: A System for Representing and Using Real-World Knowledge
Computer programs so far have not fared well in modeling language acquisition . For one thing, learning methodology applicable in general domains does not readily lend itself in the linguistic domain . For another, linguistic representation used by language processing systems is not geared to learning . We introduced a new linguistic representation , the Dynamic Hierarchical Phrasal Lexicon (DHPL) [Zernik88], to facilitate language acquisition . From this, a language learning model was implemented in the program RINA , which enhances its own lexical hierarchy by processing examples in context. We identified two tasks: First, how linguistic concepts are acquired from training examples and organized in a hierarchy ; this task was discussed in previous papers [Zernik87]. Second, we show in this paper how a lexical hierarchy is used in predicting new linguistic concepts . Thus, a program does not stall even in the presence of a lexical unknown , and a hypothesis can be produced for covering that lexical gap .
COMPLEX: A Computational Lexicon for Natural Language Systems
Although every natural language system needs a computational lexicon , each system puts different amounts and types of information into its lexicon according to its individual needs. However, some of the information needed across systems is shared or identical information. This paper presents our experience in planning and building COMPLEX , a computational lexicon designed to be a repository of shared lexical information for use by Natural Language Processing (NLP) systems . We have drawn primarily on explicit and implicit information from machine-readable dictionaries (MRD's) to create a broad coverage lexicon .
MODELING THE USER IN NATURAL LANGUAGE SYSTEMS
For intelligent interactive systems to communicate with humans in a natural manner, they must have knowledge about the system users . This paper explores the role of user modeling in such systems . It begins with a characterization of what a user model is and how it can be used. The types of information that a user model may be required to keep about a user are then identified and discussed. User models themselves can vary greatly depending on the requirements of the situation and the implementation, so several dimensions along which they can be classified are presented. Since acquiring the knowledge for a user model is a fundamental problem in user modeling , a section is devoted to this topic. Next, the benefits and costs of implementing a user modeling component for a system are weighed in light of several aspects of the interaction requirements that may be imposed by the system. Finally, the current state of research in user modeling is summarized, and future research topics that must be addressed in order to achieve powerful, general user modeling systems are assessed.
Generation for Dialogue Translation Using Typed Feature Structure Unification
This article introduces a bidirectional grammar generation system called feature structure-directed generation , developed for a dialogue translation system . The system utilizes typed feature structures to control the top-down derivation in a declarative way. This generation system also uses disjunctive feature structures to reduce the number of copies of the derivation tree . The grammar for this generator is designed to properly generate the speaker's intention in a telephone dialogue .
Sentence disambiguation by document preference sets oriented
This paper proposes document oriented preference sets(DoPS) for the disambiguation of the dependency structure of sentences . The DoPS system extracts preference knowledge from a target document or other documents automatically. Sentence ambiguities can be resolved by using domain targeted preference knowledge without using complicated large knowledgebases . Implementation and empirical results are described for the the analysis of dependency structures of Japanese patent claim sentences .
A phonological knowledge base system using unification-based formalism: a case study of Korean phonology
This paper describes the framework of a Korean phonological knowledge base system using the unification-based grammar formalism : Korean Phonology Structure Grammar (KPSG) . The approach of KPSG provides an explicit development model for constructing a computational phonological system : speech recognition and synthesis system . We show that the proposed approach is more describable than other approaches such as those employing a traditional generative phonological approach .
Synchronous Tree-Adjoining Grammars
The unique properties of tree-adjoining grammars (TAG) present a challenge for the application of TAGs beyond the limited confines of syntax , for instance, to the task of semantic interpretation or automatic translation of natural language . We present a variant of TAGs , called synchronous TAGs , which characterize correspondences between languages . The formalism's intended usage is to relate expressions of natural languages to their associated semantics represented in a logical form language , or to their translates in another natural language ; in summary, we intend it to allow TAGs to be used beyond their role in syntax proper . We discuss the application of synchronous TAGs to concrete examples, mentioning primarily in passing some computational issues that arise in its interpretation.
Japanese Sentence Analysis as Argumentation
This paper proposes that sentence analysis should be treated as defeasible reasoning , and presents such a treatment for Japanese sentence analyses using an argumentation system by Konolige, which is a formalization of defeasible reasoning , that includes arguments and defeat rules that capture defeasibility .
Spelling-checking for Highly Inflective Languages
Spelling-checkers have become an integral part of most text processing software . From different reasons among which the speed of processing prevails they are usually based on dictionaries of word forms instead of words . This approach is sufficient for languages with little inflection such as English , but fails for highly inflective languages such as Czech , Russian , Slovak or other Slavonic languages . We have developed a special method for describing inflection for the purpose of building spelling-checkers for such languages. The speed of the resulting program lies somewhere in the middle of the scale of existing spelling-checkers for English and the main dictionary fits into the standard 360K floppy , whereas the number of recognized word forms exceeds 6 million (for Czech ). Further, a special method has been developed for easy word classification .
Toward a Real-Time Spoken Language System Using Commercial Hardware
We describe the methods and hardware that we are using to produce a real-time demonstration of an integrated Spoken Language System . We describe algorithms that greatly reduce the computation needed to compute the N-Best sentence hypotheses . To avoid grammar coverage problems we use a fully-connected first-order statistical class grammar . The speech-search algorithm is implemented on a board with a single Intel i860 chip , which provides a factor of 5 speed-up over a SUN 4 for straight C code . The board plugs directly into the VME bus of the SUN4 , which controls the system and contains the natural language system and application back end .
A New Paradigm for Speaker-Independent Training and Speaker Adaptation
This paper reports on two contributions to large vocabulary continuous speech recognition . First, we present a new paradigm for speaker-independent (SI) training of hidden Markov models (HMM) , which uses a large amount of speech from a few speakers instead of the traditional practice of using a little speech from many speakers . In addition, combination of the training speakers is done by averaging the statistics of independently trained models rather than the usual pooling of all the speech data from many speakers prior to training . With only 12 training speakers for SI recognition , we achieved a 7.5% word error rate on a standard grammar and test set from the DARPA Resource Management corpus . This performance is comparable to our best condition for this test suite, using 109 training speakers . Second, we show a significant improvement for speaker adaptation (SA) using the new SI corpus and a small amount of speech from the new (target) speaker . A probabilistic spectral mapping is estimated independently for each training (reference) speaker and the target speaker . Each reference model is transformed to the space of the target speaker and combined by averaging . Using only 40 utterances from the target speaker for adaptation , the error rate dropped to 4.1% --- a 45% reduction in error compared to the SI result.
AN EDITOR FOR THE EXPLANATORY AND COMBINATORY DICTIONARY OF CONTEMPORARY FRENCH (DECFC)
This paper presents a specialized editor for a highly structured dictionary . The basic goal in building that editor was to provide an adequate tool to help lexicologists produce a valid and coherent dictionary on the basis of a linguistic theory . If we want valuable lexicons and grammars to achieve complex natural language processing , we must provide very powerful tools to help create and ensure the validity of such complex linguistic databases . Our most important task in building the editor was to define a set of coherence rules that could be computationally applied to ensure the validity of lexical entries . A customized interface for browsing and editing was also designed and implemented.
Free Indexation: Combinatorial Analysis and A Compositional Algorithm
The principle known as free indexation plays an important role in the determination of the referential properties of noun phrases in the principle-and-parameters language framework . First, by investigating the combinatorics of free indexation , we show that the problem of enumerating all possible indexings requires exponential time . Secondly, we exhibit a provably optimal free indexation algorithm .
Robust Processing of Real-World Natural-Language Texts
It is often assumed that when natural language processing meets the real world, the ideal of aiming for complete and correct interpretations has to be abandoned. However, our experience with TACITUS ; especially in the MUC-3 evaluation , has shown that principled techniques for syntactic and pragmatic analysis can be bolstered with methods for achieving robustness. We describe three techniques for making syntactic analysis more robust---an agenda-based scheduling parser , a recovery technique for failed parses , and a new technique called terminal substring parsing . For pragmatics processing , we describe how the method of abductive inference is inherently robust, in that an interpretation is always possible, so that in the absence of the required world knowledge , performance degrades gracefully. Each of these techniques have been evaluated and the results of the evaluations are presented.
An Efficient Chart-based Algorithm for Partial-Parsing of Unrestricted Texts
We present an efficient algorithm for chart-based phrase structure parsing of natural language that is tailored to the problem of extracting specific information from unrestricted texts where many of the words are unknown and much of the text is irrelevant to the task. The parser gains algorithmic efficiency through a reduction of its search space . As each new edge is added to the chart , the algorithm checks only the topmost of the edges adjacent to it, rather than all such edges as in conventional treatments. The resulting spanning edges are insured to be the correct ones by carefully controlling the order in which edges are introduced so that every final constituent covers the longest possible span . This is facilitated through the use of phrase boundary heuristics based on the placement of function words , and by heuristic rules that permit certain kinds of phrases to be deduced despite the presence of unknown words . A further reduction in the search space is achieved by using semantic rather than syntactic categories on the terminal and non-terminal edges , thereby reducing the amount of ambiguity and thus the number of edges , since only edges with a valid semantic interpretation are ever introduced.
Temporal Structure of Discourse
In this paper discourse segments are defined and a method for discourse segmentation primarily based on abduction of temporal relations between segments is proposed. This method is precise and computationally feasible and is supported by previous work in the area of temporal anaphora resolution .
Syntactic Ambiguity Resolution Using A Discrimination and Robustness Oriented Adaptive Learning Algorithm
In this paper, a discrimination and robustness oriented adaptive learning procedure is proposed to deal with the task of syntactic ambiguity resolution . Owing to the problem of insufficient training data and approximation error introduced by the language model , traditional statistical approaches , which resolve ambiguities by indirectly and implicitly using maximum likelihood method , fail to achieve high performance in real applications. The proposed method remedies these problems by adjusting the parameters to maximize the accuracy rate directly. To make the proposed algorithm robust, the possible variations between the training corpus and the real tasks are also taken into consideration by enlarging the separation margin between the correct candidate and its competing members. Significant improvement has been observed in the test. The accuracy rate of syntactic disambiguation is raised from 46.0% to 60.62% by using this novel approach.
Quasi-Destructive Graph Unification with Structure-Sharing
Graph unification remains the most expensive part of unification-based grammar parsing . We focus on one speed-up element in the design of unification algorithms : avoidance of copying of unmodified subgraphs . We propose a method of attaining such a design through a method of structure-sharing which avoids log(d) overheads often associated with structure-sharing of graphs without any use of costly dependency pointers . The proposed scheme eliminates redundant copying while maintaining the quasi-destructive scheme's ability to avoid over copying and early copying combined with its ability to handle cyclic structures without algorithmic additions.
A Similarity-Driven Transfer System
The transfer phase in machine translation (MT) systems has been considered to be more complicated than analysis and generation , since it is inherently a conglomeration of individual lexical rules . Currently some attempts are being made to use case-based reasoning in machine translation , that is, to make decisions on the basis of translation examples at appropriate pints in MT . This paper proposes a new type of transfer system , called a Similarity-driven Transfer System (SimTran) , for use in such case-based MT (CBMT) .
Interactive Speech Understanding
This paper introduces a robust interactive method for speech understanding . The generalized LR parsing is enhanced in this approach. Parsing proceeds from left to right correcting minor errors. When a very noisy portion is detected, the parser skips that portion using a fake non-terminal symbol . The unidentified portion is resolved by re-utterance of that portion which is parsed very efficiently by using the parse record of the first utterance . The user does not have to speak the whole sentence again. This method is also capable of handling unknown words , which is important in practical systems. Detected unknown words can be incrementally incorporated into the dictionary after the interaction with the user . A pilot system has shown great effectiveness of this approach.
Recognizing Unregistered Names for Mandarin Word Identification
Word Identification has been an important and active issue in Chinese Natural Language Processing . In this paper, a new mechanism, based on the concept of sublanguage , is proposed for identifying unknown words , especially personal names , in Chinese newspapers . The proposed mechanism includes title-driven name recognition , adaptive dynamic word formation , identification of 2-character and 3-character Chinese names without title . We will show the experimental results for two corpora and compare them with the results by the NTHU's statistic-based system , the only system that we know has attacked the same problem. The experimental results have shown significant improvements over the WI systems without the name identification capability.
Reconstructing Spatial Image from Natural Language Texts
This paper describes the understanding process of the spatial descriptions in Japanese . In order to understand the described world , the authors try to reconstruct the geometric model of the global scene from the scenic descriptions drawing a space. It is done by an experimental computer program SPRINT , which takes natural language texts and produces a model of the described world . To reconstruct the model , the authors extract the qualitative spatial constraints from the text , and represent them as the numerical constraints on the spatial attributes of the entities . This makes it possible to express the vagueness of the spatial concepts and to derive the maximally plausible interpretation from a chunk of information accumulated as the constraints. The interpretation reflects the temporary belief about the world .
Multi-Site Data Collection for a Spoken Language Corpus: MADCOW
This paper describes a recently collected spoken language corpus for the ATIS (Air Travel Information System) domain . This data collection effort has been co-ordinated by MADCOW (Multi-site ATIS Data COllection Working group) . We summarize the motivation for this effort, the goals, the implementation of a multi-site data collection paradigm , and the accomplishments of MADCOW in monitoring the collection and distribution of 12,000 utterances of spontaneous speech from five sites for use in a multi-site common evaluation of speech, natural language and spoken language .
Spoken Language Processing in the Framework of Human-Machine Communication at LIMSI
The paper provides an overview of the research conducted at LIMSI in the field of speech processing , but also in the related areas of Human-Machine Communication , including Natural Language Processing , Non Verbal and Multimodal Communication . Also presented are the commercial applications of some of the research projects. When applicable, the discussion is placed in the framework of international collaborations.
The MIT ATIS System: February 1992 Progress Report
This paper describes the status of the MIT ATIS system as of February 1992, focusing especially on the changes made to the SUMMIT recognizer . These include context-dependent phonetic modelling , the use of a bigram language model in conjunction with a probabilistic LR parser , and refinements made to the lexicon . Together with the use of a larger training set , these modifications combined to reduce the speech recognition word and sentence error rates by a factor of 2.5 and 1.6, respectively, on the October '91 test set . The weighted error for the entire spoken language system on the same test set is 49.3%. Similar results were also obtained on the February '92 benchmark evaluation .
Recent Improvements and Benchmark Results for Paramax ATIS System
This paper describes three relatively domain-independent capabilities recently added to the Paramax spoken language understanding system : non-monotonic reasoning , implicit reference resolution , and database query paraphrase . In addition, we discuss the results of the February 1992 ATIS benchmark tests . We describe a variation on the standard evaluation metric which provides a more tightly controlled measure of progress. Finally, we briefly describe an experiment which we have done in extending the n-best speech/language integration architecture to improving OCR accuracy .
Towards History-based Grammars: Using Richer Models for Probabilistic Parsing
We describe a generative probabilistic model of natural language , which we call HBG , that takes advantage of detailed linguistic information to resolve ambiguity . HBG incorporates lexical, syntactic, semantic, and structural information from the parse tree into the disambiguation process in a novel way. We use a corpus of bracketed sentences , called a Treebank , in combination with decision tree building to tease out the relevant aspects of a parse tree that will determine the correct parse of a sentence . This stands in contrast to the usual approach of further grammar tailoring via the usual linguistic introspection in the hope of generating the correct parse . In head-to-head tests against one of the best existing robust probabilistic parsing models , which we call P-CFG, the HBG model significantly outperforms P-CFG , increasing the parsing accuracy rate from 60% to 75%, a 37% reduction in error.
MAP Estimation of Continuous Density HMM: Theory and Applications
We discuss maximum a posteriori estimation of continuous density hidden Markov models (CDHMM) . The classical MLE reestimation algorithms , namely the forward-backward algorithm and the segmental k-means algorithm , are expanded and reestimation formulas are given for HMM with Gaussian mixture observation densities . Because of its adaptive nature, Bayesian learning serves as a unified approach for the following four speech recognition applications, namely parameter smoothing , speaker adaptation , speaker group modeling and corrective training . New experimental results on all four applications are provided to show the effectiveness of the MAP estimation approach .
One Sense Per Discourse
It is well-known that there are polysemous words like sentence whose meaning or sense depends on the context of use. We have recently reported on two new word-sense disambiguation systems , one trained on bilingual material (the Canadian Hansards ) and the other trained on monolingual material ( Roget's Thesaurus and Grolier's Encyclopedia ). As this work was nearing completion, we observed a very strong discourse effect. That is, if a polysemous word such as sentence appears two or more times in a well-written discourse , it is extremely likely that they will all share the same sense . This paper describes an experiment which confirmed this hypothesis and found that the tendency to share sense in the same discourse is extremely strong (98%). This result can be used as an additional source of constraint for improving the performance of the word-sense disambiguation algorithm . In addition, it could also be used to help evaluate disambiguation algorithms that did not make use of the discourse constraint .
Exploring Correlation Of Dependency Relation Paths For Answer Extraction
In this paper, we explore correlation of dependency relation paths to rank candidate answers in answer extraction. Using the correlation measure, we compare dependency relations of a candidate answer and mapped question phrases in sentence with the corresponding relations in question. Different from previous studies, we propose an approximate phrase mapping algorithm and incorporate the mapping score into the correlation measure. The correlations are further incorporated into a Maximum Entropy-based ranking model which estimates path weights from training. Experimental results show that our method significantly outperforms state-of-the-art syntactic relation-based methods by up to 20% in MRR.
Automatic Processing Of Large Corpora For The Resolution Of Anaphora References
Manual acquisition of semantic constraints in broad domains is very expensive. This paper presents an automatic scheme for collecting statistics on cooccurrence patterns in a large corpus. To a large extent, these statistics reflect semantic constraints and thus are used to disambiguate anaphora references and syntactic ambiguities. The scheme was implemented by gathering statistics on the output of other linguistic tools. An experiment was performed to resolve references of the pronoun "it" in sentences that were randomly selected from the corpus. The results of the experiment show that in most of the cases the cooccurrence statistics indeed reflect the semantic constraints and thus provide a basis for a useful disambiguation tool.
Kullback-Leibler Distance Between Probabilistic Context-Free Grammars And Probabilistic Finite Automata
We consider the problem of computing the Kullback-Leibler distance, also called the relative entropy, between a probabilistic context-free grammar and a probabilistic finite automaton. We show that there is a closed-form (analytical) solution for one part of the Kullback-Leibler distance, viz. the cross-entropy. We discuss several applications of the result to the problem of distributional approximation of probabilistic context-free grammars by means of probabilistic finite automata.
Spelling Correction In Agglutinative Languages
Methods developed for spelling correction for languages like English (see the review by Kukich (Kukich, 1992)) are not readily applicable to agglutinative languages. This poster presents an approach to spelling correction in agglutinative languages that is based on two-level morphology and a dynamic-programming based search algorithm. After an overview of our approach, we present results from experiments with spelling correction in Turkish.
Robust Continuous Speech Recognition Technology Program Summary
The major objective of this program is to develop and demonstrate robust, high performance continuous speech recognition (CSR) techniques focussed on application in Spoken Language Systems (SLS) which will enhance the effectiveness of military and civilian computer-based systems. A key complementary objective is to define and develop applications of robust speech recognition and understanding systems, and to help catalyze the transition of spoken language technology into military and civilian systems, with particular focus on application of robust CSR to mobile military command and control. The research effort focusses on developing advanced acoustic modelling, rapid search, and recognition-time adaptation techniques for robust large-vocabulary CSR, and on applying these techniques to the new ARPA large-vocabulary CSR corpora and to military application tasks.
A Framework for Customizable Generation of Hypertext Presentations
In this paper, we present a framework, Presentor, for the development and customization of hypertext presentation generators. Presentor offers intuitive and powerful declarative languages specifying the presentation at different levels: macro-planning, micro-planning, realization, and formatting. Presentor is implemented and is portable cross-platform and cross-domain. It has been used with success in several application domains including weather forecasting, object modeling, system description and requirements summarization.
A Practical Methodology For The Evaluation Of Spoken Language Systems
A meaningful evaluation methodology can advance the state-of-the-art by encouraging mature, practical applications rather than "toy" implementations. Evaluation is also crucial to assessing competing claims and identifying promising technical approaches. While work in speech recognition (SR) has a history of evaluation methodologies that permit comparison among various systems, until recently no methodology existed for either developers of natural language (NL) interfaces or researchers in speech understanding (SU) to evaluate and compare the systems they developed. Recently considerable progress has been made by a number of groups involved in the DARPA Spoken Language Systems (SLS) program to agree on a methodology for comparative evaluation of SLS systems, and that methodology has been put into practice several times in comparative tests of several SLS systems. These evaluations are probably the only NL evaluations other than the series of Message Understanding Conferences (Sundheim, 1989; Sundheim, 1991) to have been developed and used by a group of researchers at different sites, although several excellent workshops have been held to study some of these problems (Palmer et al., 1989; Neal et al., 1991). This paper describes a practical "black-box" methodology for automatic evaluation of question-answering NL systems. While each new application domain will require some development of special resources, the heart of the methodology is domain-independent, and it can be used with either speech or text input. The particular characteristics of the approach are described in the following section: subsequent sections present its implementation in the DARPA SLS community, and some problems and directions for future development.
Integrating Syntactic Priming Into An Incremental Probabilistic Parser, With An Application To Psycholinguistic Modeling
The psycholinguistic literature provides evidence for syntactic priming, i.e., the tendency to repeat structures. This paper describes a method for incorporating priming into an incremental probabilistic parser. Three models are compared, which involve priming of rules between sentences, within sentences, and within coordinate structures. These models simulate the reading time advantage for parallel structures found in human data, and also yield a small increase in overall parsing accuracy.
Estimating Class Priors In Domain Adaptation For Word Sense Disambiguation
Instances of a word drawn from different domains may have different sense priors (the proportions of the different senses of a word). This in turn affects the accuracy of word sense disambiguation (WSD) systems trained and applied on different domains. This paper presents a method to estimate the sense priors of words drawn from a new domain, and highlights the importance of using well calibrated probabilities when performing these estimations. By using well calibrated probabilities, we are able to estimate the sense priors effectively to achieve significant improvements in WSD accuracy.
An Attempt To Automatic Thesaurus Construction From An Ordinary Japanese Language Dictionary
How to obtain hierarchical relations(e.g. superordinate -hyponym relation, synonym relation) is one of the most important problems for thesaurus construction. A pilot system for extracting these relations automatically from an ordinary Japanese language dictionary (Shinmeikai Kokugojiten, published by Sansei-do, in machine readable form) is given. The features of the definition sentences in the dictionary, the mechanical extraction of the hierarchical relations and the estimation of the results are discussed.
The Transfer Phase Of Mu Machine Translation System
The interlingual approach to MT has been repeatedly advocated by researchers originally interested in natural language understanding who take machine translation to be one possible application. However, not only the ambiguity but also the vagueness which every natural language inevitably has leads this approach into essential difficulties. In contrast, our project, the Mu-project, adopts the transfer approach as the basic framework of MT. This paper describes the detailed construction of the transfer phase of our system from Japanese to English, and gives some examples of problems which seem difficult to treat in the interlingual approach. The basic design principles of the transfer phase of our system have already been mentioned in (1) (2). Some of the principles which are relevant to the topic of this paper are: (a) Multiple Layer of Grammars (b) Multiple Layer Presentation (c) Lexicon Driven Processing (d) Form-Oriented Dictionary Description. This paper also shows how these principles are realized in the current system.
Why Nitpicking Works: Evidence For Occam's Razor In Error Correctors
Empirical experience and observations have shown us when powerful and highly tunable classifiers such as maximum entropy classifiers, boosting and SVMs are applied to language processing tasks, it is possible to achieve high accuracies, but eventually their performances all tend to plateau out at around the same point. To further improve performance, various error correction mechanisms have been developed, but in practice, most of them cannot be relied on to predictably improve performance on unseen data; indeed, depending upon the test set, they are as likely to degrade accuracy as to improve it. This problem is especially severe if the base classifier has already been finely tuned. In recent work, we introduced N-fold Templated Piped Correction, or NTPC ("nitpick"), an intriguing error corrector that is designed to work in these extreme operating conditions. Despite its simplicity, it consistently and robustly improves the accuracy of existing highly accurate base models. This paper investigates some of the more surprising claims made by NTPC, and presents experiments supporting an Occam's Razor argument that more complex models are damaging or unnecessary in practice.
Acquisition Of Verb Entailment From Text
The study addresses the problem of automatic acquisition of entailment relations between verbs. While this task has much in common with paraphrases acquisition which aims to discover semantic equivalence between verbs, the main challenge of entailment acquisition is to capture asymmetric, or directional, relations. Motivated by the intuition that it often underlies the local structure of coherent text, we develop a method that discovers verb entailment using evidence about discourse relations between clauses available in a parsed corpus. In comparison with earlier work, the proposed method covers a much wider range of verb entailment types and learns the mapping between verbs with highly varied argument structures.
Forest-Based Statistical Sentence Generation
This paper presents a new approach to statistical sentence generation in which alternative phrases are represented as packed sets of trees, or forests, and then ranked statistically to choose the best one. This representation offers advantages in compactness and in the ability to represent syntactic information. It also facilitates more efficient statistical ranking than a previous approach to statistical generation. An efficient ranking algorithm is described, together with experimental results showing significant improvements over simple enumeration or a lattice-based approach.
An NTU-Approach To Automatic Sentence Extraction For Summary Generation
Automatic summarization and information extraction are two important Internet services. MUC and SUMMAC play their appropriate roles in the next generation Internet. This paper focuses on the automatic summarization and proposes two different models to extract sentences for summary generation under two tasks initiated by SUMMAC-1. For categorization task, positive feature vectors and negative feature vectors are used cooperatively to construct generic, indicative summaries. For adhoc task, a text model based on relationship between nouns and verbs is used to filter out irrelevant discourse segment, to rank relevant sentences, and to generate the user-directed summaries. The result shows that the NormF of the best summary and that of the fixed summary for adhoc tasks are 0.456 and 0. 447. The NormF of the best summary and that of the fixed summary for categorization task are 0.4090 and 0.4023. Our system outperforms the average system in categorization task but does a common job in adhoc task.
A Method for Relating Multiple Newspaper Articles by Using Graphs, and Its Application to Webcasting
This paper describes methods for relating (threading) multiple newspaper articles, and for visualizing various characteristics of them by using a directed graph. A set of articles is represented by a set of word vectors, and the similarity between the vectors is then calculated. The graph is constructed from the similarity matrix. By applying some constraints on the chronological ordering of articles, an efficient threading algorithm that runs in 0(n) time (where n is the number of articles) is obtained. The constructed graph is visualized with words that represent the topics of the threads, and words that represent new information in each article. The threading technique is suitable for Webcasting (push) applications. A threading server determines relationships among articles from various news sources, and creates files containing their threading information. This information is represented in eXtended Markup Language (XML), and can be visualized on most Web browsers. The XML-based representation and a current prototype are described in this paper.
A Flexible Example-Based Parser Based on the SSTC
In this paper we sketch an approach for Natural Language parsing. Our approach is an example-based approach, which relies mainly on examples that already parsed to their representation structure, and on the knowledge that we can get from these examples the required information to parse a new input sentence. In our approach, examples are annotated with the Structured String Tree Correspondence (SSTC) annotation schema where each SSTC describes a sentence, a representation tree as well as the correspondence between substrings in the sentence and subtrees in the representation tree. In the process of parsing, we first try to build subtrees for phrases in the input sentence which have been successfully found in the example-base - a bottom up approach. These subtrees will then be combined together to form a single rooted representation tree based on an example with similar representation structure - a top down approach.Keywords:
Named Entity Recognition Using An HMM-Based Chunk Tagger
This paper proposes a Hidden Markov Model (HMM) and an HMM-based chunk tagger, from which a named entity (NE) recognition (NER) system is built to recognize and classify names, times and numerical quantities. Through the HMM, our system is able to apply and integrate four types of internal and external evidences : 1) simple deterministic internal feature of the words, such as capitalization and digitalization ; 2) internal semantic feature of important triggers ; 3) internal gazetteer feature; 4) external macro context feature. In this way, the NER problem can be resolved effectively. Evaluation of our system on MUC-6 and MUC-7 English NE tasks achieves F-measures of 96.6% and 94.1% respectively. It shows that the performance is significantly better than reported by any other machine-learning system. Moreover, the performance is even consistently better than those based on handcrafted rules.
Using A Hybrid System Of Corpus- And Knowledge-Based Techniques To Automate The Induction Of A Lexical Sublanguage Grammar
Porting a Natural Language Processing (NLP) system to a new domain remains one of the bottlenecks in syntactic parsing, because of the amount of effort required to fix gaps in the lexicon, and to attune the existing grammar to the idiosyncracies of the new sublanguage. This paper shows how the process of fitting a lexicalized grammar to a domain can be automated to a great extent by using a hybrid system that combines traditional knowledge-based techniques with a corpus-based approach.
Detecting Transliterated Orthographic Variants Via Two Similarity Metrics
We propose a detection method for orthographic variants caused by transliteration in a large corpus. The method employs two similarities. One is string similarity based on edit distance. The other is contextual similarity by a vector space model. Experimental results show that the method performed a 0.889 F-measure in an open test.
Understanding Temporal Expressions In Emails
Recent years have seen increasing research on extracting and using temporal information in natural language applications. However most of the works found in the literature have focused on identifying and understanding temporal expressions in newswire texts. In this paper we report our work on anchoring temporal expressions in a novel genre, emails. The highly under-specified nature of these expressions fits well with our constraint-based representation of time, Time Calculus for Natural Language (TCNL). We have developed and evaluated a Temporal Expression Anchoror (TEA), and the result shows that it performs significantly better than the baseline, and compares favorably with some of the closely related work.
Research And Development For Spoken Language Systems
The goal of this research is to develop a spoken language system that will demonstrate the usefulness of voice input for interactive problem solving. The system will accept continuous speech, and will handle multiple speakers without explicit speaker enrollment. Combining speech recognition and natural language processing to achieve speech understanding, the system will be demonstrated in an application domain relevant to the DoD. The objective of this project is to develop a robust and high-performance speech recognition system using a segment-based approach to phonetic recognition. The recognition system will eventually be integrated with natural language processing to achieve spoken language understanding.
Computer Generation Of Multiparagraph English Text
This paper reports recent research into methods for creating natural language text. A new processing paradigm called Fragment-and-Compose has been created and an experimental system implemented in it. The knowledge to be expressed in text is first divided into small propositional units, which are then composed into appropriate combinations and converted into text.KDS (Knowledge Delivery System), which embodies this paradigm, has distinct parts devoted to creation of the propositional units, to organization of the text, to prevention of excess redundancy, to creation of combinations of units, to evaluation of these combinations as potential sentences, to selection of the best among competing combinations, and to creation of the final text. The Fragment-and-Compose paradigm and the computational methods of KDS are described.
Design Of A Hybrid Deterministic Parser
A deterministic parser is under development which represents a departure from traditional deterministic parsers in that it combines both symbolic and connectionist components. The connectionist component is trained either from patterns derived from the rules of a deterministic grammar. The development and evolution of such a hybrid architecture has lead to a parser which is superior to any known deterministic parser. Experiments are described and powerful training techniques are demonstrated that permit decision-making by the connectionist component in the parsing process. This approach has permitted some simplifications to the rules of other deterministic parsers, including the elimination of rule packets and priorities. Furthermore, parsing is performed more robustly and with more tolerance for error. Data are presented which show how a connectionist (neural) network trained with linguistic rules can parse both expected (grammatical) sentences as well as some novel (ungrammatical or lexically ambiguous) sentences.
Supervised Ranking In Open-Domain Text Summarization
The paper proposes and empirically motivates an integration of supervised learning with unsupervised learning to deal with human biases in summarization. In particular, we explore the use of probabilistic decision tree within the clustering framework to account for the variation as well as regularity in human created summaries. The corpus of human created extracts is created from a newspaper corpus and used as a test set. We build probabilistic decision trees of different flavors and integrate each of them with the clustering framework. Experiments with the corpus demonstrate that the mixture of the two paradigms generally gives a significant boost in performance compared to cases where either ofthe two is considered alone.
Sequential Conditional Generalized Iterative Scaling
We describe a speedup for training conditional maximum entropy models. The algorithm is a simple variation on Generalized Iterative Scaling, but converges roughly an order of magnitude faster, depending on the number of constraints, and the way speed is measured. Rather than attempting to train all model parameters simultaneously, the algorithm trains them sequentially. The algorithm is easy to implement, typically uses only slightly more memory, and will lead to improvements for most maximum entropy problems.
Word Re-Ordering And DP-Based Search In Statistical Machine Translation
In this paper, we describe a search procedure for statistical machine translation (MT) based on dynamic programming (DP). Starting from a DP-based solution to the traveling salesman problem, we present a novel technique to restrict the possible word reordering between source and target language in order to achieve an efficient search algorithm. A search restriction especially useful for the translation direction from German to English is presented. The experimental tests are carried out on the Verbmobil task (German-English, 8000-word vocabulary), which is a limited-domain spoken-language task.
Role Of Word Sense Disambiguation In Lexical Acquisition : Predicting Semantics From Syntactic Cues
This paper addresses the issue of word-sense ambiguity in extraction from machine-readable resources for the construction of large-scale knowledge sources. We describe two experiments: one which ignored word-sense distinctions, resulting in 6.3% accuracy for semantic classification of verbs based on (Levin, 1993); and one which exploited word-sense distinctions, resulting in 97.9% accuracy. These experiments were dual purpose: (1) to validate the central thesis of the work of (Levin, 1993), i.e., that verb semantics and syntactic behavior are predictably related; (2) to demonstrate that a 15-fold improvement can be achieved in deriving semantic information from syntactic cues if we first divide the syntactic cues into distinct groupings that correlate with different word senses. Finally, we show that we can provide effective acquisition techniques for novel word senses using a combination of online sources.
PRC Inc: Description Of The PAKTUS System Used For MUC-3
The PRC Adaptive Knowledge-based Text Understanding System (PAKTUS) has been under development as an Independent Research and Development project at PRC since 1984. The objective is a generic system of tools, including a core English lexicon, grammar, and concept representations, for building natural language processing (NLP) systems for text understanding. Systems built with PAKTUS are intended to generate input to knowledge based systems ordata base systems. Input to the NLP system is typically derived from an existing electronic message stream, such as a news wire. PAKTUS supports the adaptation of the generic core to a variety of domains: JINTACCS messages, RAINFORM messages, news reports about a specific type of event, such as financial transfers or terrorist acts, etc., by acquiring sublanguage and domain-specific grammar, words, conceptual mappings, and discourse patterns. The long-term goal is a system that can support the processing of relatively long discourses in domains that are fairly broad with a high rate of success.
Memorisation for Glue Language Deduction and Categorial Parsing
The multiplicative fragment of linear logic has found a number of applications in computational linguistics: in the "glue language" approach to LFG semantics, and in the formulation and parsing of various categorial grammars. These applications call for efficient deduction methods. Although a number of deduction methods for multiplicative linear logic are known, none of them are tabular methods, which bring a substantial efficiency gain by avoiding redundant computation (cf. chart methods in CFG parsing): this paper presents such a method, and discusses its use in relation to the above applications.
Data-Driven Generation Of Emphatic Facial Displays
We describe an implementation of data-driven selection of emphatic facial displays for an embodied conversational agent in a dialogue system. A corpus of sentences in the domain of the target dialogue system was recorded, and the facial displays used by the speaker were annotated. The data from those recordings was used in a range of models for generating facial displays, each model making use of a different amount of context or choosing displays differently within a context. The models were evaluated in two ways: by cross-validation against the corpus, and by asking users to rate the output. The predictions of the cross-validation study differed from the actual user ratings. While the cross-validation gave the highest scores to models making a majority choice within a context, the user study showed a significant preference for models that produced more variation. This preference was especially strong among the female subjects.
Head-Driven Parsing For Word Lattices
We present the first application of the head-driven statistical parsing model of Collins (1999) as a simultaneous language model and parser for large-vocabulary speech recognition. The model is adapted to an online left to right chart-parser for word lattices, integrating acoustic, n-gram, and parser probabilities. The parser uses structural and lexical dependencies not considered by n-gram models, conditioning recognition on more linguistically-grounded relationships. Experiments on the Wall Street Journal treebank and lattice corpora show word error rates competitive with the standard n-gram language model while extracting additional structural information useful for speech understanding.
Improving Language Model Size Reduction Using Better Pruning Criteria
Reducing language model (LM) size is a critical issue when applying a LM to realistic applications which have memory constraints. In this paper, three measures are studied for the purpose of LM pruning. They are probability, rank, and entropy. We evaluated the performance of the three pruning criteria in a real application of Chinese text input in terms of character error rate (CER). We first present an empirical comparison, showing that rank performs the best in most cases. We also show that the high-performance of rank lies in its strong correlation with error rate. We then present a novel method of combining two criteria in model pruning. Experimental results show that the combined criterion consistently leads to smaller models than the models pruned using either of the criteria separately, at the same CER.
Finite-State Multimodal Parsing And Understanding
Multimodal interfaces require effective parsing and understanding of utterances whose content is distributed across multiple input modes. Johnston 1998 presents an approach in which strategies for multimodal integration are stated declaratively using a unification-based grammar that is used by a multidimensional chart parser to compose inputs. This approach is highly expressive and supports a broad class of interfaces, but offers only limited potential for mutual compensation among the input modes, is subject to significant concerns in terms of computational complexity, and complicates selection among alternative multimodal interpretations of the input. In this paper, we present an alternative approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation. This approach is significantly more efficient, enables tight-coupling of multimodal understanding with speech recognition, and provides a general probabilistic framework for multimodal ambiguity resolution.
Exploring Syntactic Features For Relation Extraction Using A Convolution Tree Kernel
This paper proposes to use a convolution kernel over parse trees to model syntactic structure information for relation extraction. Our study reveals that the syntactic structure features embedded in a parse tree are very effective for relation extraction and these features can be well captured by the convolution tree kernel. Evaluation on the ACE 2003 corpus shows that the convolution kernel over parse trees can achieve comparable performance with the previous best-reported feature-based methods on the 24 ACE relation subtypes. It also shows that our method significantly outperforms the previous two dependency tree kernels on the 5 ACE relation major types.
Performing Integrated Syntactic And Semantic Parsing Using Classification
This paper describes a particular approach to parsing that utilizes recent advances in unification-based parsing and in classification-based knowledge representation. As unification-based grammatical frameworks are extended to handle richer descriptions of linguistic information, they begin to share many of the properties that have been developed in KL-ONE-like knowledge representation systems. This commonality suggests that some of the classification-based representation techniques can be applied to unification-based linguistic descriptions. This merging supports the integration of semantic and syntactic information into the same system, simultaneously subject to the same types of processes, in an efficient manner. The result is expected to be more efficient parsing due to the increased organization of knowledge. The use of a KL-ONE style representation for parsing and semantic interpretation was first explored in the PSI-KLONE system [2], in which parsing is characterized as an inference process called incremental description refinement.
Developing And Empirically Evaluating Robust Explanation Generators : The KNIGHT Experiments
"To explain complex phenomena, an explanation system must be able to select information from a formal representation of domain knowledge, organize the selected information into multisentential discourse plans, and realize the discourse plans in text. Although recent years have witnessed significant progress in the development of sophisticated computational mechanisms for explanation, empirical results have been limited. This paper reports on a seven-year effort to empirically study explanation generation from semantically rich, large-scale knowledge bases. In particular, it describes a robust explanation system that constructs multisentential and multi-paragraph explanations from the a large-scale knowledge base in the domain of botanical anatomy, physiology, and development. We introduce the evaluation methodology and describe how performance was assessed with this methodology in the most extensive empirical evaluation conducted on an explanation system. In this evaluation, scored within ""half a grade"" of domain experts, and its performance exceeded that of one of the domain experts."
Automatic Acquisition Of English Topic Signatures Based On A Second Language
We present a novel approach for automatically acquiring English topic signatures. Given a particular concept, or word sense, a topic signature is a set of words that tend to co-occur with it. Topic signatures can be useful in a number of Natural Language Processing (NLP) applications, such as Word Sense Disambiguation (WSD) and Text Summarisation. Our method takes advantage of the different way in which word senses are lexicalised in English and Chinese, and also exploits the large amount of Chinese text available in corpora and on the Web. We evaluated the topic signatures on a WSD task, where we trained a second-order vector cooccurrence algorithm on standard WSD datasets, with promising results.
A Machine Learning Approach To German Pronoun Resolution
This paper presents a novel ensemble learning approach to resolving German pronouns. Boosting, the method in question, combines the moderately accurate hypotheses of several classifiers to form a highly accurate one. Experiments show that this approach is superior to a single decision-tree classifier. Furthermore, we present a standalone system that resolves pronouns in unannotated text by using a fully automatic sequence of preprocessing modules that mimics the manual annotation process. Although the system performs well within a limited textual domain, further research is needed to make it effective for open-domain question answering and text summarisation.
Classifying Ellipsis In Dialogue : A Machine Learning Approach
This paper presents a machine learning approach to bare slice disambiguation in dialogue. We extract a set of heuristic principles from a corpus-based sample and formulate them as probabilistic Horn clauses. We then use the predicates of such clauses to create a set of domain independent features to annotate an input dataset, and run two different machine learning algorithms : SLIPPER, a rule-based learning algorithm, and TiMBL, a memory-based system. Both learners perform well, yielding similar success rates of approx 90%. The results show that the features in terms of which we formulate our heuristic principles have significant predictive power, and that rules that closely resemble our Horn clauses can be learnt automatically from these features.
Feature Vector Quality And Distributional Similarity
We suggest a new goal and evaluation criterion for word similarity measures. The new criterion – meaning-entailing substitutability – fits the needs of semantic-oriented NLP applications and can be evaluated directly (independent of an application) at a good level of human agreement. Motivated by this semantic criterion we analyze the empirical quality of distributional word feature vectors and its impact on word similarity results, proposing an objective measure for evaluating feature vector quality. Finally, a novel feature weighting and selection function is presented, which yields superior feature vectors and better word similarity performance.
Filtering Speaker-Specific Words From Electronic Discussions
The work presented in this paper is the first step in a project which aims to cluster and summarise electronic discussions in the context of help-desk applications. The eventual objective of this project is to use these summaries to assist help-desk users and operators. In this paper, we identify features of electronic discussions that influence the clustering process, and offer a filtering mechanism that removes undesirable influences. We tested the clustering and filtering processes on electronic newsgroup discussions, and evaluated their performance by means of two experiments : coarse-level clustering simple information retrieval.
Part- Of- Speech Tagging In Context
We present a new HMM tagger that exploits context on both sides of a word to be tagged, and evaluate it in both the unsupervised and supervised case. Along the way, we present the first comprehensive comparison of unsupervised methods for part-of-speech tagging, noting that published results to date have not been comparable across corpora or lexicons. Observing that the quality of the lexicon greatly impacts the accuracy that can be achieved by the algorithms, we present a method of HMM training that improves accuracy when training of lexical probabilities is unstable. Finally, we show how this new tagger achieves state-of-the-art results in a supervised, non-training intensive framework.
Generation Of Relative Referring Expressions Based On Perceptual Grouping
Past work of generating referring expressions mainly utilized attributes of objects and binary relations between objects. However, such an approach does not work well when there is no distinctive attribute among objects. To overcome this limitation, this paper proposes a method utilizing the perceptual groups of objects and n-ary relations among them. The key is to identify groups of objects that are naturally recognized by humans. We conducted psychological experiments with 42 subjects to collect referring expressions in such situations, and built a generation algorithm based on the results. The evaluation using another 23 subjects showed that the proposed method could effectively generate proper referring expressions.
Direct Orthographical Mapping For Machine Transliteration
Machine transliteration/back-transliteration plays an important role in many multilingual speech and language applications. In this paper, a novel framework for machine transliteration/backtransliteration that allows us to carry out direct orthographical mapping (DOM) between two different languages is presented. Under this framework, a joint source-channel transliteration model, also called n-gram transliteration model (n-gram TM), is further proposed to model the transliteration process. We evaluate the proposed methods through several transliteration/backtransliteration experiments for English/Chinese and English/Japanese language pairs. Our study reveals that the proposed method not only reduces an extensive system development effort but also improves the transliteration accuracy significantly.
A Lemma- Based Approach To A Maximum Entropy Word Sense Disambiguation System For Dutch
In this paper, we present a corpus-based supervised word sense disambiguation (WSD) system for Dutch which combines statistical classification (maximum entropy) with linguistic information. Instead of building individual classifiers per ambiguous wordform, we introduce a lemma-based approach. The advantage of this novel method is that it clusters all inflected forms of an ambiguous word in one classifier, therefore augmenting the training material available to the algorithm. Testing the lemma-based model on the Dutch Senseval-2 test data, we achieve a significant increase in accuracy over the wordform model. Also, the WSD system based on lemmas is smaller and more robust.
Term Aggregation : Mining Synonymous Expressions Using Personal Stylistic Variations
We present a text mining method for finding synonymous expressions based on the distributional hypothesis in a set of coherent corpora. This paper proposes a new methodology to improve the accuracy of a term aggregation system using each author's text as a coherent corpus. Our approach is based on the idea that one person tends to use one expression for one meaning. According to our assumption, most of the words with similar context features in each author's corpus tend not to be synonymous expressions. Our proposed method improves the accuracy of our term aggregation system, showing that our approach is successful.
Detection Of Question- Answer Pairs In Email Conversations
While sentence extraction as an approach to summarization has been shown to work in documents of certain genres, because of the conversational nature of email communication where utterances are made in relation to one made previously, sentence extraction may not capture the necessary segments of dialogue that would make a summary coherent. In this paper, we present our work on the detection of question-answer pairs in an email conversation for the task of email summarization. We show that various features based on the structure of email-threads can be used to improve upon lexical similarity of discourse segments for question-answer pairing.
Fast Computation Of Lexical Affinity Models
We present a framework for the fast computation of lexical affinity models. The framework is composed of a novel algorithm to efficiently compute the co-occurrence distribution between pairs of terms, an independence model, and a parametric affinity model. In comparison with previous models, which either use arbitrary windows to compute similarity between words or use lexical affinity to create sequential models, in this paper we focus on models intended to capture the co-occurrence patterns of any pair of words or phrases at any distance in the corpus. The framework is flexible, allowing fast adaptation to applications and it is scalable. We apply it in combination with a terabyte corpus to answer natural language tests, achieving encouraging results.
Fine-Grained Word Sense Disambiguation Based On Parallel Corpora, Word Alignment, Word Clustering And Aligned Wordnets
The paper presents a method for word sense disambiguation based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in the corpus. The wordnets are aligned to the Princeton Wordnet, according to the principles established by EuroWordNet. The evaluation of the WSD system, implementing the method described herein showed very encouraging results. The same system used in a validation mode, can be used to check and spot alignment errors in multilingually aligned wordnets as BalkaNet and EuroWordNet.
Minimum Bayes-Risk Decoding For Statistical Machine Translation
We present Minimum Bayes-Risk (MBR) decoding for statistical machine translation. This statistical approach aims to minimize expected loss of translation errors under loss functions that measure translation performance. We describe a hierarchy of loss functions that incorporate different levels of linguistic information from word strings, word-to-word alignments from an MT system, and syntactic structure from parse-trees of source and target language sentences. We report the performance of the MBR decoders on a Chinese-to-English translation task. Our results show that MBR decoding can be used to tune statistical MT performance for specific loss functions.
Confidence Estimation For Information Extraction
Information extraction techniques automatically create structured databases from unstructured data sources, such as the Web or newswire documents. Despite the successes of these systems, accuracy will always be imperfect. For many reasons, it is highly desirable to accurately estimate the confidence the system has in the correctness of each extracted field. The information extraction system we evaluate is based on a linear-chain conditional random field (CRF), a probabilistic model which has performed well on information extraction tasks because of its ability to capture arbitrary, overlapping features of the input in a Markov model. We implement several techniques to estimate the confidence of both extracted fields and entire multi-field records, obtaining an average precision of 98% for retrieving correct fields and 87% for multi-field records.
GE NLTOOLSET: Description Of The System As Used For MUC-4
The GE NLToolset is a set of text interpretation tools designed to be easily adapted to new domains. This report summarizes the system and its performance on the MUC-4 task.
Exploring And Exploiting The Limited Utility Of Captions In Recognizing Intention In Information Graphics
This paper presents a corpus study that explores the extent to which captions contribute to recognizing the intended message of an information graphic. It then presents an implemented graphic interpretation system that takes into account a variety of communicative signals, and an evaluation study showing that evidence obtained from shallow processing of the graphic's caption has a significant impact on the system's success. This work is part of a larger project whose goal is to provide sight-impaired users with effective access to information graphics.
Log-Linear Models For Word Alignment
We present a framework for word alignment based on log-linear models. All knowledge sources are treated as feature functions, which depend on the source langauge sentence, the target language sentence and possible additional variables. Log-linear models allow statistical alignment models to be easily extended by incorporating syntactic information. In this paper, we use IBM Model 3 alignment probabilities, POS correspondence, and bilingual dictionary coverage as features. Our experiments show that log-linear models significantly outperform IBM translation models.
Automatic Induction Of A CCG Grammar For Turkish
This paper presents the results of automatically inducing a Combinatory Categorial Grammar (CCG) lexicon from a Turkish dependency treebank. The fact that Turkish is an agglutinating free word order language presents a challenge for language theories. We explored possible ways to obtain a compact lexicon, consistent with CCG principles, from a treebank which is an order of magnitude smaller than Penn WSJ.
Two-Phase Shift-Reduce Deterministic Dependency Parser of Chinese
In the Chinese language, a verb may have its dependents on its left, right or on both sides. The ambiguity resolution of right-side dependencies is essential for dependency parsing of sentences with two or more verbs. Previous works on shift-reduce dependency parsers may not guarantee the connectivity of a dependency tree due to their weakness at resolving the right-side dependencies. This paper proposes a two-phase shift-reduce dependency parser based on SVM learning. The left-side dependents and right-side nominal dependents are detected in Phase I, and right-side verbal dependents are decided in Phase II. In experimental evaluation, our proposed method outperforms previous shift-reduce dependency parsers for the Chine language, showing improvement of dependency accuracy by 10.08%.
Focusing On Focus : A Formalization
We present an operable definition of focus which is argued to be of a cognito-pragmatic nature and explore how it is determined in discourse in a formalized manner. For this purpose, a file card model of discourse model and knowledge store is introduced enabling the decomposition and formal representation of its determination process as a programmable algorithm (FDA). Interdisciplinary evidence from social and cognitive psychology is cited and the prospect of the integration of focus via FDA as a discourse-level construct into speech synthesis systems, in particular, concept-to-speech systems, is also briefly discussed.
A Comparison Of Rule-Invocation Strategies In Context-Free Chart Parsing
Currently several grammatical formalisms converge towards being declarative and towards utilizing context-free phrase-structure grammar as a backbone, e.g. LFG and PATR-II. Typically the processing of these formalisms is organized within a chart-parsing framework. The declarative character of the formalisms makes it important to decide upon an overall optimal control strategy on the part of the processor. In particular, this brings the rule-invocation strategy into critical focus: to gain maximal processing efficiency, one has to determine the best way of putting the rules to use. The aim of this paper is to provide a survey and a practical comparison of fundamental rule-invocation strategies within context-free chart parsing.
A Bidirectional Model For Natural Language Processing
In this paper I will argue for a model of grammatical processing that is based on uniform processing and knowledge sources. The main feature of this model is to view parsing and generation as two strongly interleaved tasks performed by a single parametrized deduction process. It will be shown that this view supports flexible and efficient natural language processing.
A Probabilistic Context-Free Grammar For Disambiguation In Morphological Parsing
One of the major problems one is faced with when decomposing words into their constituent parts is ambiguity: the generation of multiple analyses for one input word, many of which are implausible. In order to deal with ambiguity, the MORphological PArser MORPA is provided with a probabilistic context-free grammar (PCFG), i.e. it combines a "conventional" context-free morphological grammar to filter out ungrammatical segmentations with a probability-based scoring function which determines the likelihood of each successful parse. Consequently, remaining analyses can be ordered along a scale of plausibility. Test performance data will show that a PCFG yields good results in morphological parsing. MORPA is a fully implemented parser developed for use in a text-to-speech conversion system.
Chinese Word Segmentation in FTRD Beijing
This paper presents a word segmentation system in France Telecom R&D Beijing, which uses a unified approach to word breaking and OOV identification. The output can be customized to meet different segmentation standards through the application of an ordered list of transformation. The system participated in all the tracks of the segmentation bakeoff -- PK-open, PK-closed, AS-open, AS-closed, HK-open, HK-closed, MSR-open and MSR- closed -- and achieved the state-of-the-art performance in MSR-open, MSR-close and PK-open tracks. Analysis of the results shows that each component of the system contributed to the scores.
Coping With Derivation In A Morphological Component
In this paper a morphological component with a limited capability to automatically interpret (and generate) derived words is presented. The system combines an extended two-level morphology [Trost, 1991a; Trost, 1991b] with a feature-based word grammar building on a hierarchical lexicon. Polymorphemic stems not explicitly stored in the lexicon are given a compositional interpretation. That way the system allows to minimize redundancy in the lexicon because derived words that are transparent need not to be stored explicitly. Also, words formed ad-hoc can be recognized correctly. The system is implemented in CommonLisp and has been tested on examples from German derivation.
Full Text Parsing Using Cascades Of Rules: An Information Extraction Perspective
This paper proposes an approach to full parsing suitable for Information Extraction from texts. Sequences of cascades of rules deterministically analyze the text, building unambiguous structures. Initially basic chunks are analyzed; then argumental relations are recognized; finally modifier attachment is performed and the global parse tree is built. The approach was proven to work for three languages and different domains. It was implemented in the IE module of FACILE, a EU project for multilingual text classification and IE.
New Results With The Lincoln Tied-Mixture HMM CSR System
The following describes recent work on the Lincoln CSR system. Some new variations in semiphone modeling have been tested. A very simple improved duration model has reduced the error rate by about 10% in both triphone and semiphone systems. A new training strategy has been tested which, by itself, did not provide useful improvements but suggests that improvements can be obtained by a related rapid adaptation technique. Finally, the recognizer has been modified to use bigram back-off language models. The system was then transferred from the RM task to the ATIS CSR task and a limited number of development tests performed. Evaluation test results are presented for both the RM and ATIS CSR tasks.
Reading more into Foreign Languages
GLOSSER is designed to support reading and learning to read in a foreign language. There are four language pairs currently supported by GLOSSER: English-Bulgarian, English-Estonian, English-Hungarian and French-Dutch. The program is operational on UNIX and Windows '95 platforms, and has undergone a pilot user-study. A demonstration (in UNIX) for Applied Natural Language Processing emphasizes components put to novel technical uses in intelligent computer-assisted morphological analysis (ICALL), including disambiguated morphological analysis and lemmatized indexing for an aligned bilingual corpus of word examples.
Identifying Topics By Position
This paper addresses the problem of identifying likely topics of texts by their position in the text. It describes the automated training and evaluation of an Optimal Position Policy, a method of locating the likely positions of topic-bearing sentences based on genre-specific regularities of discourse structure. This method can be used in applications such as information retrieval, routing, and text summarization.
Hidden-Variable Models For Discriminative Reranking
We describe a new method for the representation of NLP structures within reranking approaches. We make use of a conditional log-linear model, with hidden variables representing the assignment of lexical items to word clusters or word senses. The model learns to automatically make these assignments based on a discriminative training criterion. Training and decoding with the model requires summing over an exponential number of hidden-variable assignments: the required summations can be computed efficiently and exactly using dynamic programming. As a case study, we apply the model to parse reranking. The model gives an F-measure improvement of ~1.25% beyond the base parser, and an ~0.25% improvement beyond Collins (2000) reranker. Although our experiments are focused on parsing, the techniques described generalize naturally to NLP structures other than parse trees.
Taiwan Child Language Corpus : Data Collection and Annotation
Taiwan Child Language Corpus contains scripts transcribed from about 330 hours of recordings of fourteen young children from Southern Min Chinese speaking families in Taiwan. The format of the corpus adopts the Child Language Data Exchange System (CHILDES). The size of the corpus is about 1.6 million words. In this paper, we describe data collection, transcription, word segmentation, and part-of-speech annotation of this corpus. Applications of the corpus are also discussed.
Dynamic Strategy Selection in Flexible Parsing
Robust natural language interpretation requires strong semantic domain models, fail-soft recovery heuristics, and very flexible control structures. Although single-strategy parsers have met with a measure of success, a multi-strategy approach is shown to provide a much higher degree of flexibility, redundancy, and ability to bring task-specific domain knowledge (in addition to general linguistic knowledge) to bear on both grammatical and ungrammatical input. A parsing algorithm is presented that integrates several different parsing strategies, with case-frame instantiation dominating. Each of these parsing strategies exploits different types of knowledge; and their combination provides a strong framework in which to process conjunctions, fragmentary input, and ungrammatical structures, as well as less exotic, grammatically correct input. Several specific heuristics for handling ungrammatical input are presented within this multi-strategy framework.
Parsing with Discontinuous Constituents
By generalizing the notion of location of a constituent to allow discontinuous locations, one can describe the discontinuous constituents of non-configurational languages. These discontinuous constituents can be described by a variant of definite clause grammars, and these grammars can be used in conjunction with a proof procedure to create a parser for non-configurational languages.
The Acquisition and Application of Context Sensitive Grammar for English
A system is described for acquiring a context-sensitive, phrase structure grammar which is applied by a best-path, bottom-up, deterministic parser. The grammar was based on English news stories and a high degree of success in parsing is reported. Overall, this research concludes that CSG is a computationally and conceptually tractable approach to the construction of phrase structure grammar for news story text.
A Quantitative Evaluation of Linguistic Tests for the Automatic Prediction of Semantic Markedness
We present a corpus-based study of methods that have been proposed in the linguistics literature for selecting the semantically unmarked term out of a pair of antonymous adjectives. Solutions to this problem are applicable to the more general task of selecting the positive term from the pair. Using automatically collected data, the accuracy and applicability of each method is quantified, and a statistical analysis of the significance of the results is performed. We show that some simple methods are indeed good indicators for the answer to the problem while other proposed methods fail to perform better than would be attributable to chance. In addition, one of the simplest methods, text frequency, dominates all others. We also apply two generic statistical learning methods for combining the indications of the individual methods, and compare their performance to the simple methods. The most sophisticated complex learning method offers a small, but statistically significant, improvement over the original tests.
Probing the lexicon in evaluating commercial MT systems
In the past the evaluation of machine translation systems has focused on single system evaluations because there were only few systems available. But now there are several commercial systems for the same language pair. This requires new methods of comparative evaluation. In the paper we propose a black-box method for comparing the lexical coverage of MT systems. The method is based on lists of words from different frequency classes. It is shown how these word lists can be compiled and used for testing. We also present the results of using our method on 6 MT systems that translate between English and German.
On Interpreting F-Structures as UDRSs
We describe a method for interpreting abstract flat syntactic representations, LFG f-structures, as underspecified semantic representations, here Underspecified Discourse Representation Structures (UDRSs). The method establishes a one-to-one correspondence between subsets of the LFG and UDRS formalisms. It provides a model theoretic interpretation and an inferential component which operates directly on underspecified representations for f-structures through the translation images of f-structures as UDRSs.
Construct Algebra : Analytical Dialog Management
In this paper we describe a systematic approach for creating a dialog management system based on a Construct Algebra, a collection of relations and operations on a task representation. These relations and operations are analytical components for building higher level abstractions called dialog motivators. The dialog manager, consisting of a collection of dialog motivators, is entirely built using the Construct Algebra.
Mining the Web for Bilingual Text
STRAND (Resnik, 1998) is a language-independent system for automatic discovery of text in parallel translation on the World Wide Web. This paper extends the preliminary STRAND results by adding automatic language identification, scaling up by orders of magnitude, and formally evaluating performance. The most recent end-product is an automatically acquired parallel corpus comprising 2491 English-French document pairs, approximately 1.5 million words per language.
Verb-Noun Collocation SyntLex Dictionary : Corpus-Based Approach
The project presented here is a part of a long term research program aiming at a full lexicon grammar for Polish (SyntLex). The main of this project is computer-assisted acquisition and morpho-syntactic description of verb-noun collocations in Polish. We present methodology and resources obtained in three main project phases which are: dictionary-based acquisition of collocation lexicon, feasibility study for corpus-based lexicon enlargement phase, corpus-based lexicon enlargement and collocation description. In this paper we focus on the results of the third phase. The presented here corpus-based approach permitted us to triple the size the verb-noun collocation dictionary for Polish. In the paper we describe the SyntLex Dictionary of Collocations and announce some future research intended to be a separate project continuation.
Czech MWE Database
In this paper we deal with a recently developed large Czech MWE database containing at the moment 160 000 MWEs (treated as lexical units). It was compiled from various resources such as encyclopedias and dictionaries, public databases of proper names and toponyms, collocations obtained from Czech WordNet, lists of botanical and zoological terms and others. We describe the structure of the database and give basic types of MWEs according to domains they belong to. We compare the built MWEs database with the corpus data from Czech National Corpus (approx. 100 mil. tokens) and present results of this comparison in the paper. These MWEs have not been obtained from the corpus since their frequencies in it are rather low. To obtain a more complete list of MWEs we propose and use a technique exploiting the Word Sketch Engine, which allows us to work with statistical parameters such as frequency of MWEs and their components as well as with the salience for the whole MWEs. We also discuss exploitation of the database for working out a more adequate tagging and lemmatization. The final goal is to be able to recognize MWEs in corpus text and lemmatize them as complete lexical units, i. e. to make tagging and lemmatization more adequate.
Using Log-linear Models for Tuning Machine Translation Output
We describe a set of experiments to explore statistical techniques for ranking and selecting the best translations in a graph of translation hypotheses. In a previous paper (Carl, 2007) we have described how the hypotheses graph is generated through shallow mapping and permutation rules. We have given examples of its nodes consisting of vectors representing morpho-syntactic properties of words and phrases. This paper describes a number of methods for elaborating statistical feature functions from some of the vector components. The feature functions are trained off-line on different types of text and their log-linear combination is then used to retrieve the best M translation paths in the graph. We compare two language modelling toolkits, the CMU and the SRI toolkit and arrive at three results: 1) word-lemma based feature function models produce better results than token-based models, 2) adding a PoS-tag feature function to the word-lemma model improves the output and 3) weights for lexical translations are suitable if the training material is similar to the texts to be translated.
Chinese Term Extraction Based on Delimiters
Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which are much more stable and domain independent. The proposed approach is not as sensitive to term frequency as that of previous works. This approach has no strict limit or hard rules and thus they can deal with all kinds of terms. It also requires no prior domain knowledge and no additional training to adapt to new domains. Consequently, the proposed approach can be applied to different domains easily and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques which verifies its efficiency and domain independent nature. Experiments on new term extraction indicate that the proposed approach can also serve as an effective tool for domain lexicon expansion.
From Sentence to Discourse : Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-day syntactico-semantic (tectogrammatical) annotation in the Prague Dependency Treebank, extend it for the purposes of a sentence-boundary-crossing representation and eventually to design a new, discourse level of annotation. In this paper, we propose a feasible process of such a transfer, comparing the possibilities the Praguian dependency-based approach offers with the Penn discourse annotation based primarily on the analysis and classification of discourse connectives.
Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.
A Multi-Path Architecture For Machine Translation Of English Text Into American Sign Language Animation
The translation of English text into American Sign Language (ASL) animation tests the limits of traditional MT architectural designs. A new semantic representation is proposed that uses virtual reality 3D scene modeling software to produce spatially complex ASL phenomena called "classifier predicates." The model acts as an interlingua within a new multi-pathway MT architecture design that also incorporates transfer and direct approaches into a single system.
Integrating Natural Language Components Into Graphical Discourse
In our current research into the design of cognitively well-motivated interfaces relying primarily on the display of graphical information, we have observed that graphical information alone does not provide sufficient support to users - particularly when situations arise that do not simply conform to the users' expectations. This can occur due to too much information being requested, too little, information of the wrong kind, etc. To solve this problem, we are working towards the integration of natural language generation to augment the interaction
The LIMSI Continuous Speech Dictation System
A major axis of research at LIMSI is directed at multilingual, speaker-independent, large vocabulary speech dictation. In this paper the LIMSI recognizer which was evaluated in the ARPA NOV93 CSR test is described, and experimental results on the WSJ and BREF corpora under closely matched conditions are reported. For both corpora word recognition experiments were carried out with vocabularies containing up to 20k words. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with a 20k-word vocabulary when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram language model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone duration models, and sex-dependent models.
Categorizing Unknown Words : Using Decision Trees To Identify Names And Misspellings
This paper introduces a system for categorizing unknown words. The system is based on a multi-component architecture where each component is responsible for identifying one class of unknown words. The focus of this paper is the components that identify names and spelling errors. Each component uses a decision tree architecture to combine multiple types of evidence about the unknown word. The system is evaluated using data from live closed captions - a genre replete with a wide variety of unknown words.
NEC Corporation And University Of Sheffield: Description Of NEC/Sheffleld System Used For MET Japanese
Recognition of proper nouns in Japanese text has been studied as a part of the more general problem of morphological analysis in Japanese text processing ([1] [2]). It has also been studied in the framework of Japanese information extraction ([3]) in recent years. Our approach to the Multi-lingual Evaluation Task (MET) for Japanese text is to consider the given task as a morphological analysis problem in Japanese. Our morphological analyzer has done all the necessary work for the recognition and classification of proper names, numerical and temporal expressions, i.e. Named Entity (NE) items in the Japanese text. The analyzer is called "Amorph". Amorph recognizes NE items in two stages: dictionary lookup and rule application. First, it uses several kinds of dictionaries to segment and tag Japanese character strings. Second, based on the information resulting from the dictionary lookup stage, a set of rules is applied to the segmented strings in order to identify NE items. When a segment is found to be an NE item, this information is added to the segment and it is used to generate the final output.
A Practically Unsupervised Learning Method To Identify Single-Snippet Answers To Definition Questions On The Web
We present a practically unsupervised learning method to produce single-snippet answers to definition questions in question answering systems that supplement Web search engines. The method exploits on-line encyclopedias and dictionaries to generate automatically an arbitrarily large number of positive and negative definition examples, which are then used to train an svm to separate the two classes. We show experimentally that the proposed method is viable, that it outperforms the alternative of training the system on questions and news articles from trec, and that it helps the search engine handle definition questions significantly better.
Lexically-Based Terminology Structuring : Some Inherent Limits
Terminology structuring has been the subject of much work in the context of terms extracted from corpora: given a set of terms, obtained from an existing resource or extracted from a corpus, identifying hierarchical (or other types of) relations between these terms. The present paper focusses on terminology structuring by lexical methods, which match terms on the basis on their content words, taking morphological variants into account. Experiments are done on a 'flat' list of terms obtained from an originally hierarchically-structured terminology: the French version of the US National Library of Medicine MeSH thesaurus. We compare the lexically-induced relations with the original MeSH relations: after a quantitative evaluation of their congruence through recall and precision metrics, we perform a qualitative, human analysis ofthe 'new' relations not present in the MeSH. This analysis shows, on the one hand, the limits of the lexical structuring method. On the other hand, it also reveals some specific structuring choices and naming conventions made by the MeSH designers, and emphasizes ontological commitments that cannot be left to automatic structuring.
Alignment And Extraction Of Bilingual Legal Terminology From Context Profiles
In this study, we propose a knowledge-independent method for aligning terms and thus extracting translations from a small, domain-specific corpus consisting of parallel English and Chinese court judgments from Hong Kong. With a sentence-aligned corpus, translation equivalences are suggested by analysing the frequency profiles of parallel concordances. The method overcomes the limitations of conventional statistical methods which require large corpora to be effective, and lexical approaches which depend on existing bilingual dictionaries. Pilot testing on a parallel corpus of about 113K Chinese words and 120K English words gives an encouraging 85% precision and 45% recall. Future work includes fine-tuning the algorithm upon the analysis of the errors, and acquiring a translation lexicon for legal terminology by filtering out general terms.
Coedition To Share Text Revision Across Languages And Improve MT A Posteriori
Coedition of a natural language text and its representation in some interlingual form seems the best and simplest way to share text revision across languages. For various reasons, UNL graphs are the best candidates in this context. We are developing a prototype where, in the simplest sharing scenario, naive users interact directly with the text in their language (L0), and indirectly with the associated graph. The modified graph is then sent to the UNL-L0 deconverter and the result shown. If is is satisfactory, the errors were probably due to the graph, not to the deconverter, and the graph is sent to deconverters in other languages. Versions in some other languages known by the user may be displayed, so that improvement sharing is visible and encouraging. As new versions are added with appropriate tags and attributes in the original multilingual document, nothing is ever lost, and cooperative working on a document is rendered feasible. On the internal side, liaisons are established between elements of the text and the graph by using broadly available resources such as a LO-English or better a L0-UNL dictionary, a morphosyntactic parser of L0, and a canonical graph2tree transformation. Establishing a "best" correspondence between the "UNL-tree+L0" and the "MS-L0 structure", a lattice, may be done using the dictionary and trying to align the tree and the selected trajectory with as few crossing liaisons as possible. A central goal of this research is to merge approaches from pivot MT, interactive MT, and multilingual text authoring.
Unsupervised Learning Of Word Sense Disambiguation Rules By Estimating An Optimum Iteration Number In The EM Algorithm
In this paper, we improve an unsupervised learning method using the Expectation-Maximization (EM) algorithm proposed by Nigam et al. for text classification problems in order to apply it to word sense disambiguation (WSD) problems. The improved method stops the EM algorithm at the optimum iteration number. To estimate that number, we propose two methods. In experiments, we solved 50 noun WSD problems in the Japanese Dictionary Task in SENSEVAL2. The score of our method is a match for the best public score of this task. Furthermore, our methods were confirmed to be effective also for verb WSD problems.
Modeling User Language Proficiency In A Writing Tutor For Deaf Learners Of English
In this paper we discuss a proposed user knowledge modeling architecture for the ICICLE system, a language tutoring application for deaf learners of written English. The model will represent the language proficiency of the user and is designed to be referenced during both writing analysis and feedback production. We motivate our model design by citing relevant research on second language and cognitive skill acquisition, and briefly discuss preliminary empirical evidence supporting the design. We conclude by showing how our design can provide a rich and robust information base to a language assessment / correction application by modeling user proficiency at a high level of granularity and specificity.
Using Decision Trees to Construct a Practical Parser
This paper describes novel and practical Japanese parsers that uses decision trees. First, we construct a single decision tree to estimate modification probabilities; how one phrase tends to modify another. Next, we introduce a boosting algorithm in which several decision trees are constructed and then combined for probability estimation. The two constructed parsers are evaluated by using the EDR Japanese annotated corpus. The single-tree method outperforms the conventional Japanese stochastic methods by 4%. Moreover, the boosting version is shown to have significant advantages; 1) better parsing accuracy than its single-tree counterpart for any amount of training data and 2) no over-fitting to data for various iterations.
Automatic Estimation of Word Significance oriented for Speech-based Information Retrieval
Automatic estimation of word significance oriented for speech-based Information Retrieval (IR) is addressed. Since the significance of words differs in IR, automatic speech recognition (ASR) performance has been evaluated based on weighted word error rate (WWER), which gives a weight on errors from the viewpoint of IR, instead of word error rate (WER), which treats all words uniformly. A decoding strategy that minimizes WWER based on a Minimum Bayes-Risk framework has been shown, and the reduction of errors on both ASR and IR has been reported. In this paper, we propose an automatic estimation method for word significance (weights) based on its influence on IR. Specifically, weights are estimated so that evaluation measures of ASR and IR are equivalent. We apply the proposed method to a speech-based information retrieval system, which is a typical IR system, and show that the method works well.
Paraphrasing Depending on Bilingual Context Toward Generalization of Translation Knowledge
This study presents a method to automatically acquire paraphrases using bilingual corpora, which utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language sentence based on statistical alignment techniques. Since the paraphrasing method is capable of clearly disambiguating the sense of an original phrase using the bilingual context of dependency relation, it would be possible to obtain interchangeable paraphrases under a given context. Also, we provide an advanced method to acquire generalized translation knowledge using the extracted paraphrases. We applied the method to acquire the generalized translation knowledge for Korean-English translation. Through experiments with parallel corpora of a Korean and English language pairs, we show that our paraphrasing method effectively extracts paraphrases with high precision, 94.3% and 84.6% respectively for Korean and English, and the translation knowledge extracted from the bilingual corpora could be generalized successfully using the paraphrases with the 12.5% compression ratio.
Statistics Learning And Universal Grammar : Modeling Word Segmentation
This paper describes a computational model of word segmentation and presents simulation results on realistic acquisition. In particular, we explore the capacity and limitations of statistical learning mechanisms that have recently gained prominence in cognitive psychology and linguistics.
Automatic Construction Of A Transfer Dictionary Considering Directionality
In this paper, we show how to construct a transfer dictionary automatically. Dictionary construction, one of the most difficult tasks in developing a machine translation system, is expensive. To avoid this problem, we investigate how we build a dictionary using existing linguistic resources. Our algorithm can be applied to any language pairs, but for the present we focus on building a Korean-to-Japanese dictionary using English as a pivot. We attempt three ways of automatic construction to corroborate the effect of the directionality of dictionaries. First, we introduce "one-time look up" method using a Korean-to-English and a Japanese-to-English dictionary. Second, we show a method using "overlapping constraint" with a Korean-to-English dictionary and an English-to-Japanese dictionary. Third, we consider another alternative method rarely used for building a dictionary: an English-to-Korean dictionary and English-to-Japanese dictionary. We found that the first method is the most effective and the best result can be obtained from combining the three methods.
Annotating Discourse Connectives And Their Arguments
This paper describes a new, large scale discourse-level annotation project - the Penn Discourse TreeBank (PDTB). We present an approach to annotating a level of discourse structure that is based on identifying discourse connectives and their arguments. The PDTB is being built directly on top of the Penn TreeBank and Propbank, thus supporting the extraction of useful syntactic and semantic features and providing a richer substrate for the development and evaluation of practical algorithms. We provide a detailed preliminary analysis of inter-annotator agreement - both the level of agreement and the types of inter-annotator variation.
INTEX : A Syntactic Role Driven Protein-Protein Interaction Extractor For Bio-Medical Text
In this paper, we present a fully automated extraction system, named IntEx, to identify gene and protein interactions in biomedical text. Our approach is based on first splitting complex sentences into simple clausal structures made up of syntactic roles. Then, tagging biological entities with the help of biomedical and linguistic ontologies. Finally, extracting complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations. Our extraction system handles complex sentences and extracts multiple and nested interactions specified in a sentence. Experimental evaluations with two other state of the art extraction systems indicate that the IntEx system achieves better performance without the labor intensive pattern engineering requirement.
Distributional Measures Of Concept- Distance : A Task-Oriented Evaluation
We propose a framework to derive the distance between concepts from distributional measures of word co-occurrences. We use the categories in a published thesaurus as coarse-grained concepts, allowing all possible distance values to be stored in a concept-concept matrix roughly.01% the size of that created by existing measures. We show that the newly proposed concept-distance measures outperform traditional distributional word-distance measures in the tasks of (1) ranking word pairs in order of semantic distance, and (2) correcting real-word spelling errors. In the latter task, of all the WordNet-based measures, only that proposed by Jiang and Conrath outperforms the best distributional concept-distance measures.
Learning to Transform Linguistic Graphs
We argue in favor of the the use of labeled directed graph to represent various types of linguistic structures, and illustrate how this allows one to view NLP tasks as graph transformations. We present a general method for learning such transformations from an annotated corpus and describe experiments with two applications of the method: identification of non-local depenencies (using Penn Treebank data) and semantic role labeling (using Proposition Bank data).
Credibility Improves Topical Blog Post Retrieval
Topical blog post retrieval is the task of ranking blog posts with respect to their relevance for a given topic. To improve topical blog post retrieval we incorporate textual credibility indicators in the retrieval process. We consider two groups of indicators: post level (determined using information about individual blog posts only) and blog level (determined using information from the underlying blogs). We describe how to estimate these indicators and how to integrate them into a retrieval approach based on language models. Experiments on the TREC Blog track test set show that both groups of credibility indicators significantly improve retrieval effectiveness; the best performance is achieved when combining them.
Lyric-based Song Sentiment Classification with Sentiment Vector Space Model
Lyric-based song sentiment classification seeks to assign songs appropriate sentiment labels such as light-hearted heavy-hearted. Four problems render vector space model (VSM)-based text classification approach ineffective: 1) Many words within song lyrics actually contribute little to sentiment; 2) Nouns and verbs used to express sentiment are ambiguous; 3) Negations and modifiers around the sentiment keywords make particular contributions to sentiment; 4) Song lyric is usually very short. To address these problems, the sentiment vector space model (s-VSM) is proposed to represent song lyric document. The preliminary experiments prove that the s-VSM model outperforms the VSM model in the lyric-based song sentiment classification task.
Source Language Markers in EUROPARL Translations
This paper shows that it is very often possible to identify the source language of medium-length speeches in the EUROPARL corpus on the basis of frequency counts of word n-grams (87.2%-96.7% accuracy depending on classification method). The paper also examines in detail which positive markers are most powerful and identifies a number of linguistic aspects as well as culture- and domain-related ones.
Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation
Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT. Experiments show that our method improves a state-of-the-art MT system in a small and a large data environment.
The Impact of Reference Quality on Automatic MT Evaluation
Language resource quality is crucial in NLP. Many of the resources used are derived from data created by human beings out of an NLP context, especially regarding MT and reference translations. Indeed, automatic evaluations need high-quality data that allow the comparison of both automatic and human translations. The validation of these resources is widely recommended before being used. This paper describes the impact of using different-quality references on evaluation. Surprisingly enough, similar scores are obtained in many cases regardless of the quality. Thus, the limitations of the automatic metrics used within MT are also discussed in this regard.
A Linguistic Knowledge Discovery Tool : Very Large Ngram Database Search with Arbitrary Wildcards
In this paper, we will describe a search tool for a huge set of ngrams. The tool supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. The system runs on a single Linux PC with reasonable size memory (less than 4GB) and disk space (less than 400GB). This system can be a very useful tool for linguistic knowledge discovery and other NLP tasks.
Unsupervised Learning of Bulgarian POS Tags
This paper presents an approach to the unsupervised learning of parts of speech which uses both morphological and syntactic information. While the model is more complex than those which have been employed for unsupervised learning of POS tags in English, which use only syntactic information, the variety of languages in the world requires that we consider morphology as well. In many languages, morphology provides better clues to a word's category than word order. We present the computational model for POS learning, and present results for applying it to Bulgarian, a Slavic language with relatively free word order and rich morphology.
A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies
We propose a solution to the challenge of the CoNLL 2008 shared task that uses a generative history-based latent variable model to predict the most likely derivation of a synchronous dependency parser for both syntactic and semantic dependencies. The submitted model yields 79.1% macro-average F1 performance, for the joint task, 86.9% syntactic dependencies LAS and 71.0% semantic dependencies F1. A larger model trained after the deadline achieves 80.5% macro-average F1, 87.6% syntactic dependencies LAS, and 73.1% semantic dependencies F1.
Integrating Discourse Markers Into A Pipelined Natural Language Generation Architecture
Pipelined Natural Language Generation (NLG) systems have grown increasingly complex as architectural modules were added to support language functionalities such as referring expressions, lexical choice, and revision. This has given rise to discussions about the relative placement of these new modules in the overall architecture. Recent work on another aspect of multi-paragraph text, discourse markers, indicates it is time to consider where a discourse marker insertion algorithm fits in. We present examples which suggest that in a pipelined NLG architecture, the best approach is to strongly tie it to a revision component. Finally, we evaluate the approach in a working multi-page system.
Multi-Tagging For Lexicalized-Grammar Parsing
With performance above 97% accuracy for newspaper text, part of speech (pos) tagging might be considered a solved problem. Previous studies have shown that allowing the parser to resolve pos tag ambiguity does not improve performance. However, for grammar formalisms which use more fine-grained grammatical categories, for example tag and ccg, tagging accuracy is much lower. In fact, for these formalisms, premature ambiguity resolution makes parsing infeasible. We describe a multi-tagging approach which maintains a suitable level of lexical category ambiguity for accurate and efficient ccg parsing. We extend this multi-tagging approach to the pos level to overcome errors introduced by automatically assigned pos tags. Although pos tagging accuracy seems high, maintaining some pos tag ambiguity in the language processing pipeline results in more accurate ccg supertagging.
Discursive Usage Of Six Chinese Punctuation Marks
Both rhetorical structure and punctuation have been helpful in discourse processing. Based on a corpus annotation project, this paper reports the discursive usage of 6 Chinese punctuation marks in news commentary texts: Colon, Dash, Ellipsis, Exclamation Mark, Question Mark, and Semicolon. The rhetorical patterns of these marks are compared against patterns around cue phrases in general. Results show that these Chinese punctuation marks, though fewer in number than cue phrases, are easy to identify, have strong correlation with certain relations, and can be used as distinctive indicators of nuclearity in Chinese texts.
Partial Descriptions And Systemic Grammar
This paper examines the properties of feature-based partial descriptions built on top of Halliday's systemic networks. We show that the crucial operation of consistency checking for such descriptions is NP-complete, and therefore probably intractable, but proceed to develop algorithms which can sometimes alleviate the unpleasant consequences of this intractability.
Classifier Assignment By Corpus-Based Approach
This paper presents an algorithm for selecting an appropriate classifier word for a noun. In Thai language, it frequently happens that there is fluctuation in the choice of classifier for a given concrete noun, both from the point of view of the whole speech community and individual speakers. Basically, there is no exact rule for classifier selection. As far as we can do in the rule-based approach is to give a default rule to pick up a corresponding classifier of each noun. Registration of classifier for each noun is limited to the type of unit classifier because other types are open due to the meaning of representation. We propose a corpus-based method (Biber,1993; Nagao,1993; Smadja,1993) which generates Noun Classifier Associations (NCA) to overcome the problems in classifier assignment and semantic construction of noun phrase. The NCA is created statistically from a large corpus and recomposed under concept hierarchy constraints and frequency of occurrences.
An Unsupervised Learning Method For Associative Relationships Between Verb Phrases
This paper describes an unsupervised learning method for associative relationships between verb phrases, which is important in developing reliable Q&A systems. Consider the situation that a user gives a query "How much petrol was imported to Japan from Saudi Arabia?" to a Q&A system, but the text given to the system includes only the description "X tonnes of petrol was conveyed to Japan from Saudi Arabia". We think that the description is a good clue to find the answer for our query, "X tonnes". But there is no large-scale database that provides the associative relationship between "imported" and "conveyed". Our aim is to develop an unsupervised learning method that can obtain such an associative relationship, which we call scenario consistency. The method we are currently working on uses an expectation-maximization (EM) based word-clustering algorithm, and we have evaluated the effectiveness of this method using Japanese verb phrases.
Automatic Learning Of Language Model Structure
Statistical language modeling remains a challenging task, in particular for morphologically rich languages. Recently, new approaches based on factored language models have been developed to address this problem. These models provide principled ways of including additional conditioning variables other than the preceding words, such as morphological or syntactic features. However, the number of possible choices for model parameters creates a large space of models that cannot be searched exhaustively. This paper presents an entirely data-driven model selection procedure based on genetic search, which is shown to outperform both knowledge-based and random selection procedures on two different language modeling tasks (Arabic and Turkish).
Montagovian Definite Clause Grammar
This paper reports a completed stage of ongoing research at the University of York. Landsbergen's advocacy of analytical inverses for compositional syntax rules encourages the application of Definite Clause Grammar techniques to the construction of a parser returning Montague analysis trees. A parser MDCC is presented which implements an augmented Friedman - Warren algorithm permitting post referencing* and interfaces with a language of intenslonal logic translator LILT so as to display the derivational history of corresponding reduced IL formulae. Some familiarity with Montague's PTQ and the basic DCG mechanism is assumed.
An Approach To Sentence-Level Anaphora In Machine Translation
Theoretical research in the area of machine translation usually involves the search for and creation of an appropriate formalism. An important issue in this respect is the way in which the compositionality of translation is to be defined. In this paper, we will introduce the anaphoric component of the Mimo formalism. It makes the definition and translation of anaphoric relations possible, relations which are usually problematic for systems that adhere to strict compositionality. In Mimo, the translation of anaphoric relations is compositional. The anaphoric component is used to define linguistic phenomena such as wh-movement, the passive and the binding of reflexives and pronouns mono-lingually. The actual working of the component will be shown in this paper by means of a detailed discussion of wh-movement.
Interpretation Of Nominal Compounds : Combining Domain-Independent And Domain-Specific Information
A domain independent model is proposed for the automated interpretation of nominal compounds in English. This model is meant to account for productive rules of interpretation which are inferred from the morpho-syntactic and semantic characteristics of the nominal constituents. In particular, we make extensive use of Pustejovsky's principles concerning the predicative information associated with nominals. We argue that it is necessary to draw a line between generalizable semantic principles and domain-specific semantic information. We explain this distinction and we show how this model may be applied to the interpretation of compounds in real texts, provided that complementary semantic information are retrieved.
Modeling Local Coherence: An Entity-Based Approach
This paper considers the problem of automatic assessment of local coherence. We present a novel entity-based representation of discourse which is inspired by Centering Theory and can be computed automatically from raw text. We view coherence assessment as a ranking learning problem and show that the proposed discourse representation supports the effective learning of a ranking function. Our experiments demonstrate that the induced model achieves significantly higher accuracy than a state-of-the-art coherence model.
Using Conditional Random Fields For Sentence Boundary Detection In Speech
Sentence boundary detection in speech is important for enriching speech recognition output, making it easier for humans to read and downstream modules to process. In previous work, we have developed hidden Markov model (HMM) and maximum entropy (Maxent) classifiers that integrate textual and prosodic knowledge sources for detecting sentence boundaries. In this paper, we evaluate the use of a conditional random field (CRF) for this task and relate results with this model to our prior work. We evaluate across two corpora (conversational telephone speech and broadcast news speech) on both human transcriptions and speech recognition output. In general, our CRF model yields a lower error rate than the HMM and Max-ent models on the NIST sentence boundary detection task in speech, although it is interesting to note that the best results are achieved by three-way voting among the classifiers. This probably occurs because each model has different strengths and weaknesses for modeling the knowledge sources.
Using Emoticons To Reduce Dependency In Machine Learning Techniques For Sentiment Classification
Sentiment Classification seeks to identify a piece of text according to its author's general feeling toward their subject, be it positive or negative. Traditional machine learning techniques have been applied to this problem with reasonable success, but they have been shown to work well only when there is a good match between the training and test data with respect to topic. This paper demonstrates that match with respect to domain and time is also important, and presents preliminary experiments with training data labeled with emoticons, which has the potential of being independent of domain, topic and time.
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade
Using natural language processing, we carried out a trend survey on Japanese natural language processing studies that have been done over the last ten years. We determined the changes in the number of papers published for each research organization and on each research area as well as the relationship between research organizations and research areas. This paper is useful for both recognizing trends in Japanese NLP and constructing a method of supporting trend surveys using NLP.
Finding Content-Bearing Terms Using Term Similarities
This paper explores the issue of using different co-occurrence similarities between terms for separating query terms that are useful for retrieval from those that are harmful. The hypothesis under examination is that useful terms tend to be more similar to each other than to other query terms. Preliminary experiments with similarities computed using first-order and second-order co-occurrence seem to confirm the hypothesis. Term similarities could then be used for determining which query terms are useful and best reflect the user's information need. A possible application would be to use this source of evidence for tuning the weights of the query terms.
The Structure Of Communicative Context Of Dialogue Interaction
We propose a draft scheme of the model formalizing the structure of communicative context in dialogue interaction. The relationships between the interacting partners are considered as system of three automata representing the partners of the dialogue and environment.
Non-Deterministic Recursive Ascent Parsing
A purely functional implementation of LR-parsers is given, together with a simple correctness proof. It is presented as a generalization of the recursive descent parser. For non-LR grammars the time-complexity of our parser is cubic if the functions that constitute the parser are implemented as memo-functions, i.e. functions that memorize the results of previous invocations. Memo-functions also facilitate a simple way to construct a very compact representation of the parse forest. For LR(0) grammars, our algorithm is closely related to the recursive ascent parsers recently discovered by Kruse-man Aretz [1] and Roberts [2]. Extended CF grammars (grammars with regular expressions at the right hand side) can be parsed with a simple modification of the LR-parser for normal CF grammars.
Probabilistic CFG With Latent Annotations
This paper defines a generative probabilistic model of parse trees, which we call PCFG-LA. This model is an extension of PCFG in which non-terminal symbols are augmented with latent variables. Finegrained CFG rules are automatically induced from a parsed corpus by training a PCFG-LA model using an EM-algorithm. Because exact parsing with a PCFG-LA is NP-hard, several approximations are described and empirically compared. In experiments using the Penn WSJ corpus, our automatically trained model gave a performance of 86.6% (F1, sentences < 40 words), which is comparable to that of an unlexicalized PCFG parser created using extensive manual feature selection.
Exploring Various Knowledge In Relation Extraction
Extracting semantic relationships between entities is challenging. This paper investigates the incorporation of diverse lexical, syntactic and semantic knowledge in feature-based relation extraction using SVM. Our study illustrates that the base phrase chunking information is very effective for relation extraction and contributes to most of the performance improvement from syntactic aspect while additional information from full parsing gives limited further enhancement. This suggests that most of useful information in full parse trees for relation extraction is shallow and can be captured by chunking. We also demonstrate how semantic information such as WordNet and Name List, can be used in feature-based relation extraction to further improve the performance. Evaluation on the ACE corpus shows that effective incorporation of diverse features enables our system outperform previously best-reported systems on the 24 ACE relation subtypes and significantly outperforms tree kernel-based systems by over 20 in F-measure on the 5 ACE relation types.
Automatic Acquisition Of Adjectival Subcategorization From Corpora
This paper describes a novel system for acquiring adjectival subcategorization frames (scfs) and associated frequency information from English corpus data. The system incorporates a decision-tree classifier for 30 scf types which tests for the presence of grammatical relations (grs) in the output of a robust statistical parser. It uses a powerful pattern-matching language to classify grs into frames hierarchically in a way that mirrors inheritance-based lexica. The experiments show that the system is able to detect scf types with 70% precision and 66% recall rate. A new tool for linguistic annotation of scfs in corpus data is also introduced which can considerably alleviate the process of obtaining training and test data for subcategorization acquisition.
Automatic recognition of French expletive pronoun occurrences
We present a tool, called ILIMP, which takes as input a raw text in French and produces as output the same text in which every occurrence of the pronoun il is tagged either with tag [ANA] for anaphoric or [IMP] for impersonal or expletive. This tool is therefore designed to distinguish between the anaphoric occurrences of il, for which an anaphora resolution system has to look for an antecedent, and the expletive occurrences of this pronoun, for which it does not make sense to look for an antecedent. The precision rate for ILIMP is 97,5%. The few errors are analyzed in detail. Other tasks using the method developed for ILIMP are described briefly, as well as the use of ILIMP in a modular syntactic analysis system.
A PROBLEM SOLVING APPROACH TO GENERATING TEXT FROM SYSTEMIC GRAMMARS
Systemic grammar has been used for AI text generation work in the past, but the implementations have tended be ad hoc or inefficient. This paper presents an approach to systemic text generation where AI problem solving techniques are applied directly to an unadulterated systemic grammar. This approach is made possible by a special relationship between systemic grammar and problem solving: both are organized primarily as choosing from alternatives. The result is simple, efficient text generation firmly based in a linguistic theory.
User Studies And The Design Of Natural Language Systems
This paper presents a critical discussion of the various approaches that have been used in the evaluation of Natural Language systems. We conclude that previous approaches have neglected to evaluate systems in the context of their use, e.g. solving a task requiring data retrieval. This raises questions about the validity of such approaches. In the second half of the paper, we report a laboratory study using the Wizard of Oz technique to identify NL requirements for carrying out this task. We evaluate the demands that task dialogues collected using this technique, place upon a prototype Natural Language system. We identify three important requirements which arose from the task that we gave our subjects: operators specific to the task of database access, complex contextual reference and reference to the structure of the information source. We discuss how these might be satisfied by future Natural Language systems.
LFG Semantics Via Constraints
Semantic theories of natural language associate meanings with utterances by providing meanings for lexical items and rules for determining the meaning of larger units given the meanings of their parts. Traditionally, meanings are combined via function composition, which works well when constituent structure trees are used to guide semantic composition. More recently, the functional structure of LFG has been used to provide the syntactic information necessary for constraining derivations of meaning in a cross-linguistically uniform format. It has been difficult, however, to reconcile this approach with the combination of meanings by function composition.
In contrast to compositional approaches, we present a deductive approach to assembling meanings, based on reasoning with constraints, which meshes well with the unordered nature of information in the functional structure. Our use of linear logic as a 'glue' for assembling meanings also allows for a coherent treatment of modification as well as of the LFG requirements of completeness and coherence.
Splitting The Reference Time: Temporal Anaphora And Quantification In DRT
This paper presents an analysis of temporal anaphora in sentences which contain quantification over events, within the framework of Discourse Representation Theory. The analysis in (Partee, 1984) of quantified sentences, introduced by a temporal connective, gives the wrong truth-conditions when the temporal connective in the subordinate clause is before or after. This problem has been previously analyzed in (de Swart, 1991) as an instance of the proportion problem and given a solution from a Generalized Quantifier approach. By using a careful distinction between the different notions of reference time based on (Kamp and Reyle, 1993), we propose a solution to this problem, within the framework of DRT. We show some applications of this solution to additional temporal anaphora phenomena in quantified sentences.
A Proposal For SLS Evaluation
This paper proposes an automatic, essentially domain-independent means of evaluating Spoken Language Systems (SLS) which combines software we have developed for that purpose (the "Comparator") and a set of specifications for answer expressions (the "Common Answer Specification", or CAS). The Comparator checks whether the answer provided by a SLS accords with a canonical answer, returning either true or false. The Common Answer Specification determines the syntax of answer expressions, the minimal content that must be included in them, the data to be included in and excluded from test corpora, and the procedures used by the Comparator. Though some details of the CAS are particular to individual domains, the Comparator software is domain-independent, as is the CAS approach.
Speech and Text-Image Processing in Documents
Two themes have evolved in speech and text image processing work at Xerox PARC that expand and redefine the role of recognition technology in document-oriented applications. One is the development of systems that provide functionality similar to that of text processors but operate directly on audio and scanned image data. A second, related theme is the use of speech and text-image recognition to retrieve arbitrary, user-specified information from documents with signal content. This paper discusses three research initiatives at PARC that exemplify these themes: a text-image editor[1], a wordspotter for voice editing and indexing[12], and a decoding framework for scanned-document content retrieval[4]. The discussion focuses on key concepts embodied in the research that enable novel signal-based document processing functionality.
A Statistical Profile Of The Named Entity Task
In this paper we present a statistical profile of the Named Entity task, a specific information extraction task for which corpora in several languages are available. Using the results of the statistical analysis, we propose an algorithm for lower bound estimation for Named Entity corpora and discuss the significance of the cross-lingual comparisons provided by the analysis.
Using Random Walks For Question-Focused Sentence Retrieval
We consider the problem of question-focused sentence retrieval from complex news articles describing multi-event stories published over time. Annotators generated a list of questions central to understanding each story in our corpus. Because of the dynamic nature of the stories, many questions are time-sensitive (e.g. "How many victims have been found?"). Judges found sentences providing an answer to each question. To address the sentence retrieval problem, we apply a stochastic, graph-based method for comparing the relative importance of the textual units, which was previously used successfully for generic summarization. Currently, we present a topic-sensitive version of our method and hypothesize that it can outperform a competitive baseline, which compares the similarity of each sentence to the input question via IDF-weighted word overlap. In our experiments, the method achieves a TRDR score that is significantly higher than that of the baseline.
A Formal Model For Context-Free Languages Augmented With Reduplication
A model is presented to characterize the class of languages obtained by adding reduplication to context-free languages. The model is a pushdown automaton augmented with the ability to check reduplication by using the stack in a new way. The class of languages generated is shown to lie strictly between the context-free languages and the indexed languages. The model appears capable of accommodating the sort of reduplications that have been observed to occur in natural languages, but it excludes many of the unnatural constructions that other formal models have permitted.
Some remarks on the Annotation of Quantifying Noun Groups in Treebanks
This article is devoted to the problem of quantifying noun groups in German. After a thorough description of the phenomena, the results of corpus-based investigations are described. Moreover, some examples are given that underline the necessity of integrating some kind of information other than grammar sensu stricto into the treebank. We argue that a more sophisticated and fine-grained annotation in the tree-bank would have very positve effects on stochastic parsers trained on the tree-bank and on grammars induced from the treebank, and it would make the treebank more valuable as a source of data for theoretical linguistic investigations. The information gained from corpus research and the analyses that are proposed are realized in the framework of SILVA, a parsing and extraction tool for German text corpora.
Formal Constraints on Metarules
Metagrammatical formalisms that combine context-free phrase structure rules and metarules (MPS grammars) allow concise statement of generalizations about the syntax of natural languages. Unconstrained MPS grammars, unfortunately, are not computationally safe. We evaluate several proposals for constraining them, basing our assessment on computational tractability and explanatory adequacy. We show that none of them satisfies both criteria, and suggest new directions for research on alternative metagrammatical formalisms.
A CENTERING APPROACH TO PRONOUNS
In this paper we present a formalization of the centering approach to modeling attentional structure in discourse and use it as the basis for an algorithm to track discourse context and bind pronouns. As described in [GJW86], the process of centering attention on entities in the discourse gives rise to the intersentential transitional states of continuing, retaining and shifting. We propose an extension to these states which handles some additional cases of multiple ambiguous pronouns. The algorithm has been implemented in an HPSG natural language system which serves as the interface to a database query application.
Compilation of HPSG to TAG
We present an implemented compilation algorithm that translates HPSG into lexicalized feature-based TAG, relating concepts of the two theories. While HPSG has a more elaborated principle-based theory of possible phrase structures, TAG provides the means to represent lexicalized structures more explicitly. Our objectives are met by giving clear definitions that determine the projection of structures from the lexicon, and identify maximal projections, auxiliary trees and foot nodes.
Fast Context-Free Parsing Requires Fast Boolean Matrix Multiplication
Valiant showed that Boolean matrix multiplication (BMM) can be used for CFG parsing. We prove a dual result: CFG parsers running in time O(|G||w|3-e) on a grammar G and a string w can be used to multiply m x m Boolean matrices in time O(m3-e/3). In the process we also provide a formal definition of parsing motivated by an informal notion due to Lang. Our result establishes one of the first limitations on general CFG parsing: a fast, practical CFG parser would yield a fast, practical BMM algorithm, which is not believed to exist.
Efficient Generation in Primitive Optimality Theory
This paper introduces primitive Optimality Theory (OTP), a linguistically motivated formalization of OT. OTP specifies the class of autosegmental representations, the universal generator Gen, and the two simple families of permissible constraints. In contrast to less restricted theories using Generalized Alignment, OTP's optimal surface forms can be generated with finite-state methods adapted from (Ellison, 1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, factored automata, where regular languages are represented compactly via formal intersections of FSAs.
Towards resolution of bridging descriptions
We present preliminary results concerning robust techniques for resolving bridging definite descriptions. We report our analysis of a collection of 20 Wall Street Journal articles from the Penn Treebank Corpus and our experiments with WordNet to identify relations between bridging descriptions and their antecedents.
A semantically-derived subset of English for hardware verification
To verify hardware designs by model checking, circuit specifications are commonly expressed in the temporal logic CTL. Automatic conversion of English to CTL requires the definition of an appropriately restricted subset of English. We show how the limited semantic expressibility of CTL can be exploited to derive a hierarchy of subsets. Our strategy avoids potential difficulties with approaches that take existing computational semantic analyses of English as their starting point--such as the need to ensure that all sentences in the subset possess a CTL translation.
What To Do When Lexicalization Fails: Parsing German With Suffix Analysis And Smoothing
In this paper, we present an unlexicalized parser for German which employs smoothing and suffix analysis to achieve a labelled bracket F-score of 76.2, higher than previously reported results on the NEGRA corpus. In addition to the high accuracy of the model, the use of smoothing in an unlexicalized parser allows us to better examine the interplay between smoothing and parsing results.
Alignment Model Adaptation For Domain-Specific Word Alignment
This paper proposes an alignment adaptation approach to improve domain-specific (in-domain) word alignment. The basic idea of alignment adaptation is to use out-of-domain corpus to improve in-domain word alignment results. In this paper, we first train two statistical word alignment models with the large-scale out-of-domain corpus and the small-scale in-domain corpus respectively, and then interpolate these two models to improve the domain-specific word alignment. Experimental results show that our approach improves domain-specific word alignment in terms of both precision and recall, achieving a relative error rate reduction of 6.56% as compared with the state-of-the-art technologies.
An Information-State Approach To Collaborative Reference
We describe a dialogue system that works with its interlocutor to identify objects. Our contributions include a concise, modular architecture with reversible processes of understanding and generation, an information-state model of reference, and flexible links between semantics and collaborative problem solving.
AN APPROACH TO NATURAL LANGUAGE IN THE SI-NETS PARADIGM
This article deals with the interpretation of conceptual operations underlying the communicative use of natural language (NL) within the Structured Inheritance Network (SI-Nets) paradigm. The operations are reduced to functions of a formal language, thus changing the level of abstraction of the operations to be performed on SI-Nets. In this sense, operations on SI-Nets are not merely isomorphic to single epistemological objects, but can be viewed as a simulation of processes on a different level, that pertaining to the conceptual system of NL. For this purpose, we have designed a version of KL-ONE which represents the epistemological level, while the new experimental language, KL-Conc, represents the conceptual level. KL-Conc would seem to be a more natural and intuitive way of interacting with SI-Nets.
Iteration, Habituality And Verb Form Semantics
The verb forms are often claimed to convey two kinds of information : 1. whether the event described in a sentence is present, past or future (= deictic information) 2. whether the event described in a sentence is presented as completed, going on, just starting or being finished (= aspectual information). It will be demonstrated in this paper that one has to add a third component to the analysis of verb form meanings, namely whether or not they express habituality. The framework of the analysis is model-theoretic semantics.
A Language For The Statement Of Binary Relations Over Feature Structures
Unification is often the appropriate method for expressing relations between representations in the form of feature structures; however, there are circumstances in which a different approach is desirable. A declarative formalism is presented which permits direct mappings of one feature structure into another, and illustrative examples are given of its application to areas of current interest.
A Discourse Copying Algorithm for Ellipsis and Anaphora Resolution
We give an analysis of ellipsis resolution in terms of a straightforward discourse copying algorithm that correctly predicts a wide range of phenomena. The treatment does not suffer from problems inherent in identity-of-relations analyses. Furthermore, in contrast to the approach of Dalrymple et al. [1991], the treatment directly encodes the intuitive distinction between full NPs and the referential elements that corefer with them through what we term role linking. The correct predictions for several problematic examples of ellipsis naturally result. Finally, the analysis extends directly to other discourse copying phenomena.
Representing Text Chunks
Dividing sentences in chunks of words is a useful preprocessing step for parsing, information extraction and information retrieval. (Ramshaw and Marcus, 1995) have introduced a "convenient" data representation for chunking by converting it to a tagging task. In this paper we will examine seven different data representations for the problem of recognizing noun phrase chunks. We will show that the data representation choice has a minor influence on chunking performance. However, equipped with the most suitabledata representation, our memory-based learning chunker was able to improve the best published chunking results for a standard data set.
Tagging French - Comparing A Statistical And A Constraint-Based Method
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disambiguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems and compare the results. The accuracy of the statistical method is reasonably good, comparable to taggers for English. But the constraint-based tagger seems to be superior even with the limited time we allowed ourselves for rule development.
An Annotation Scheme For Discourse-Level Argumentation In Research Articles
In order to build robust automatic abstracting systems, there is a need for better training resources than are currently available. In this paper, we introduce an annotation scheme for scientific articles which can be used to build such a resource in a consistent way. The seven categories of the scheme are based on rhetorical moves of argumentation. Our experimental results show that the scheme is stable, reproducible and intuitive to use.
Automatic Acquisition Of Subcategorization Frames From Tagged Text
This paper describes an implemented program that takes a tagged text corpus and generates a partial list of the subcategorization frames in which each verb occurs. The completeness of the output list increases monotonically with the total occurrences of each verb in the training corpus. False positive rates are one to three percent. Five subcategorization frames are currently detected and we foresee no impediment to detecting many more. Ultimately, we expect to provide a large subcategorization dictionary to the NLP community and to train dictionaries for specific corpora.
Large-Scale Acquisition of LCS-Based Lexicons for Foreign Language Tutoring
We focus on the problem of building large repositories of lexical conceptual structure (LCS) representations for verbs in multiple languages. One of the main results of this work is the definition of a relation between broad semantic classes and LCS meaning components. Our acquisition program - LEXICALL - takes, as input, the result of previous work on verb classification and thematic grid tagging, and outputs LCS representations for different languages. These representations have been ported into English, Arabic and Spanish lexicons, each containing approximately 9000 verbs. We are currently using these lexicons in an operational foreign language tutoring and machine translation.
Semi-Automatic Acquisition Of Domain-Specific Translation Lexicons
We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons.
SIMULTANEOUS-DISTRIBUTIVE COORDINATION AND CONTEXT-FREENESS
English is shown to be trans-context-free on the basis of coordinations of the respectively type that involve strictly syntactic cross-serial agreement. The agreement in question involves number in nouns and reflexive pronouns and is syntactic rather than semantic in nature because grammatical number in English, like grammatical gender in languages such as French, is partly arbitrary. The formal proof, which makes crucial use of the Interchange Lemma of Ogden et al., is so constructed as to be valid even if English is presumed to contain grammatical sentences in which respectively operates across a pair of coordinate phrases one of whose members has fewer conjuncts than the other; it thus goes through whatever the facts may be regarding constructions with unequal numbers of conjuncts in the scope of respectively, whereas other arguments have foundered on this problem.
A Class-oriented Approach to Building a Paraphrase Corpus
Towards deep analysis of compositional classes of paraphrases, we have examined a class-oriented framework for collecting paraphrase examples, in which sentential paraphrases are collected for each paraphrase class separately by means of automatic candidate generation and manual judgement. Our preliminary experiments on building a paraphrase corpus have so far been producing promising results, which we have evaluated according to cost-efficiency, exhaustiveness, and reliability.
A Construction-Specific Approach to Focused Interaction in Flexible Parsing
A flexible parser can deal with input that deviates from its grammar, in addition to input that conforms to it. Ideally, such a parser will correct the deviant input: sometimes, it will be unable to correct it at all; at other times, correction will be possible, but only to within a range of ambiguous possibilities. This paper is concerned with such ambiguous situations, and with making it as easy as possible for the ambiguity to be resolved through consultation with the user of the parser - we presume interactive use. We show the importance of asking the user for clarification in as focused a way as possible. Focused interaction of this kind is facilitated by a construction-specific approach to flexible parsing, with specialized parsing techniques for each type of construction, and specialized ambiguity representations for each type of ambiguity that a particular construction can give rise to. A construction-specific approach also aids in task-specific language development by allowing a language definition that is natural in terms of the task domain to be interpreted directly without compilation into a uniform grammar formalism, thus greatly speeding the testing of changes to the language definition.
Semantic Caseframe Parsing and Syntactic Generality
We have implemented a restricted domain parser called Plume. Building on previous work at Carnegie-Mellon University e.g. [4, 5, 8], Plume's approach to parsing is based on semantic caseframe instantiation. This has the advantages of efficiency on grammatical input, and robustness in the face of ungrammatical input. While Plume is well adapted to simple declarative and imperative utterances, it handles passives, relative clauses and interrogatives in an ad hoc manner leading to patchy syntactic coverage. This paper outlines Plume as it currently exists and describes our detailed design for extending Plume to handle passives, relative clauses, and interrogatives in a general manner.
Resolving Translation Mismatches With Information Flow
Languages differ in the concepts and real-world entities for which they have words and grammatical constructs. Therefore translation must sometimes be a matter of approximating the meaning of a source language text rather than finding an exact counterpart in the target language. We propose a translation framework based on Situation Theory. The basic ingredients are an information lattice, a representation scheme for utterances embedded in contexts, and a mismatch resolution scheme defined in terms of information flow. We motivate our approach with examples of translation between English and Japanese.
Two-Level, Many-Paths Generation
Large-scale natural language generation requires the integration of vast amounts of knowledge: lexical, grammatical, and conceptual. A robust generator must be able to operate well even when pieces of knowledge are missing. It must also be robust against incomplete or inaccurate inputs. To attack these problems, we have built a hybrid generator, in which gaps in symbolic knowledge are filled by statistical methods. We describe algorithms and show experimental results. We also discuss how the hybrid generation model can be used to simplify current generators and enhance their portability, even when perfect knowledge is in principle obtainable.
Machine Transliteration
It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. For example, computer in English comes out as ~ i/l:::'=--~-- (konpyuutaa) in Japanese. Translating such items from Japanese back to English is even more challenging, and of practical interest, as transliterated items make up the bulk of text phrases not found in bilingual dictionaries. We describe and evaluate a method for performing backwards transliterations by machine. This method uses a generative model, incorporating several distinct stages in the transliteration process.
Approximating Context-Free Grammars with a Finite-State Calculus
Although adequate models of human language for syntactic analysis and semantic interpretation are of at least context-free complexity, for applications such as speech processing in which speed is important finite-state models are often preferred. These requirements may be reconciled by using the more complex grammar to automatically derive a finite-state approximation which can then be used as a filter to guide speech recognition or to reject many hypotheses at an early stage of processing. A method is presented here for calculating such finite-state approximations from context-free grammars. It is essentially different from the algorithm introduced by Pereira and Wright (1991; 1996), is faster in some cases, and has the advantage of being open-ended and adaptable.
A Part of Speech Estimation Method for Japanese Unknown Words using a Statistical Model of Morphology and Context
We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. The model can achieve 96.6% tagging accuracy if unknown words are correctly segmented.
A Pylonic Decision-Tree Language Model with Optimal Question Selection
This paper discusses a decision-tree approach to the problem of assigning probabilities to words following a given text. In contrast with previous decision-tree language model attempts, an algorithm for selecting nearly optimal questions is considered. The model is to be tested on a standard task, The Wall Street Journal, allowing a fair comparison with the well-known tri-gram model.
Two-Level Description Of Turkish Morphology
This poster paper describes a full scale two-level morphological description (Karttunen, 1983; Koskenniemi, 1983) of Turkish word structures. The description has been implemented using the PC-KIMMO environment (Antworth, 1990) and is based on a root word lexicon of about 23,000 roots words. Almost all the special cases of and exceptions to phonological and morphological rules have been implemented. Turkish is an agglutinative language with word structures formed by productive affixations of derivational and inflectional suffixes to root words. Turkish has finite-state but nevertheless rather complex morphotactics. Morphemes added to a root word or a stem can convert the word from a nominal to a verbal structure or vice-versa, or can create adverbial constructs. The surface realizations of morphological constructions are constrained and modified by a number of phonetic rules such as vowel harmony.
TUIT : A Toolkit For Constructing Multilingual TIPSTER User Interfaces
The TIPSTER Architecture has been designed to enable a variety of different text applications to use a set of common text processing modules. Since user interfaces work best when customized for particular applications , it is appropriator that no particular user interface styles or conventions are described in the TIPSTER Architecture specification. However, the Computing Research Laboratory (CRL) has constructed several TIPSTER applications that use a common set of configurable Graphical User Interface (GUI) functions. These GUIs were constructed using CRL's TIPSTER User Interface Toolkit (TUIT). TUIT is a software library that can be used to construct multilingual TIPSTER user interfaces for a set of common user tasks. CRL developed TUIT to support their work to integrate TIPSTER modules for the 6 and 12 month TIPSTER II demonstrations as well as their Oleada and Temple demonstration projects. This paper briefly describes TUIT and its capabilities.
Phonological Comprehension And The Compilation Of Optimality Theory
This paper ties up some loose ends in finite-state Optimality Theory. First, it discusses how to perform comprehension under Optimality Theory grammars consisting of finite-state constraints. Comprehension has not been much studied in OT; we show that unlike production, it does not always yield a regular set, making finite-state methods inapplicable. However, after giving a suitably flexible presentation of OT, we show carefully how to treat comprehension under recent variants of OTin which grammars can be compiled into finite-state transducers. We then unify these variants, showing that compilation is possible if all components of the grammar are regular relations, including the harmony ordering on scored candidates.
Learning Correlations between Linguistic Indicators and Semantic Constraints : Reuse of Context-Dependent Descriptions of Entities
This paper presents the results of a study on the semantic constraints imposed on lexical choice by certain contextual indicators. We show how such indicators are computed and how correlations between them and the choice of a noun phrase description of a named entity can be automatically established using supervised learning. Based on this correlation, we have developed a technique for automatic lexical choice of descriptions of entities in text generation. We discuss the underlying relationship between the pragmatics of choosing an appropriate description that serves a specific purpose in the automatically generated text and the semantics of the description itself. We present our work in the framework of the more general concept of reuse of linguistic structures that are automatically extracted from large corpora. We present a formal evaluation of our approach and we conclude with some thoughts on potential applications of our method.
Evaluation In The ARPA Machine Translation Program : 1993 Methodology
In the second year of evaluations of the ARPA HLT Machine Translation (MT) Initiative, methodologies developed and tested in 1992 were applied to the 1993 MT test runs. The current methodology optimizes the inherently subjective judgments on translation accuracy and quality by channeling the judgments of non-translators into many data points which reflect both the comparison of the performance of the research MT systems with production MT systems and against the performance of novice translators. This paper discusses the three evaluation methods used in the 1993 evaluation, the results of the evaluations , and preliminary characterizations of the Winter 1994 evaluation, now underway. The efforts under discussion focus on measuring the progress of core MT technology and increasing the sensitivity and portability of MT evaluation methodology.
Integrating Shallow Linguistic Processing Into A Unification-Based Spanish Grammar
This paper describes to what extent deep processing may benefit from shallow techniques and it presents a NLP system which integrates a linguistic PoS tagger and chunker as a preprocessing module of a broad coverage unification based grammar of Spanish. Experiments show that the efficiency of the overall analysis improves significantly and that our system also provides robustness to the linguistic processing while maintaining both the accuracy and the precision of the grammar.
Parsing And Subcategorization Data
In this paper, we compare the performance of a state-of-the-art statistical parser (Bikel, 2004) in parsing written and spoken language and in generating sub-categorization cues from written and spoken language. Although Bikel's parser achieves a higher accuracy for parsing written language, it achieves a higher accuracy when extracting subcategorization cues from spoken language. Our experiments also show that current technology for extracting subcategorization frames initially designed for written texts works equally well for spoken language. Additionally, we explore the utility of punctuation in helping parsing and extraction of subcategorization cues. Our experiments show that punctuation is of little help in parsing spoken language and extracting subcategorization cues from spoken language. This indicates that there is no need to add punctuation in transcribing spoken corpora simply in order to help parsers.
Character-Based Collocation For Mandarin Chinese
This paper describes a characters-based Chinese collocation system and discusses the advantages of it over a traditional word-based system. Since wordbreaks are not conventionally marked in Chinese text corpora, a character-based collocation system has the dual advantages of avoiding pre-processing distortion and directly accessing sub-lexical information. Furthermore, word-based collocational properties can be obtained through an auxiliary module of automatic segmentation.
Efficient Parsing Of Highly Ambiguous Context-Free Grammars With Bit Vectors
An efficient bit-vector-based CKY-style parser for context-free parsing is presented. The parser computes a compact parse forest representation of the complete set of possible analyses for large treebank grammars and long input sentences. The parser uses bit-vector operations to parallelise the basic parsing operations. The parser is particularly useful when all analyses are needed rather than just the most probable one.
Automatic Question Answering : Beyond The Factoid
In this paper we describe and evaluate a Question Answering system that goes beyond answering factoid questions. We focus on FAQ-like questions and answers , and build our system around a noisy-channel architecture which exploits both a language model for answers and a transformation model for answer/question terms, trained on a corpus of 1 million question/answer pairs collected from the Web.
Unsupervised Learning Of Field Segmentation Models For Information Extraction
The applicability of many current information extraction techniques is severely limited by the need for supervised training data.
We demonstrate that for certain field structured extraction tasks, such as classified advertisements and bibliographic citations, small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion. Although hidden Markov models (HMMs) provide a suitable generative model for field structured text, general unsupervised HMM learning fails to learn useful structure in either of our domains. However, one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions. In both domains, we found that unsupervised methods can attain accuracies with 400 unlabeled examples comparable to those attained by supervised methods on 50 labeled examples, and that semi-supervised methods can make good use of small amounts of labeled data.
Joint Learning Improves Semantic Role Labeling
Despite much recent progress on accurate semantic role labeling, previous work has largely used independent classifiers, possibly combined with separate label sequence models via Viterbi decoding. This stands in stark contrast to the linguistic observation that a core argument frame is a joint structure, with strong dependencies between arguments. We show how to build a joint model of argument frames, incorporating novel features that model these interactions into discriminative log-linear models. This system achieves an error reduction of 22% on all arguments and 32% on core arguments over a state-of-the art independent classifier for gold-standard parse trees on PropBank.
Organizing English Reading Materials For Vocabulary Learning
We propose a method of organizing reading materials for vocabulary learning. It enables us to select a concise set of reading texts (from a target corpus) that contains all the target vocabulary to be learned. We used a specialized vocabulary for an English certification test as the target vocabulary and used English Wikipedia, a free-content encyclopedia, as the target corpus. The organized reading materials would enable learners not only to study the target vocabulary efficiently but also to gain a variety of knowledge through reading. The reading materials are available on our web site.
NATURAL LANGUAGE INPUT FOR SCENE GENERATION
In this paper a system which understands and conceptualizes scenes descriptions in natural language is presented. Specifically, the following components of the system are described: the syntactic analyzer, based on a Procedural Systemic Grammar, the semantic analyzer relying on the Conceptual Dependency Theory, and the dictionary.
TENSES AS ANAPHORA
A proposal to deal with French tenses in the framework of Discourse Representation Theory is presented, as it has been implemented for a fragment at the IMS. It is based on the theory of tenses of H. Kamp and Ch. Rohrer. Instead of using operators to express the meaning of the tenses the Reichenbachian point of view is adopted and refined such that the impact of the tenses with respect to the meaning of the text is understood as contribution to the integration of the events of a sentence in the event structure of the preceeding text. Thereby a system of relevant times provided by the preceeding text and by the temporal adverbials of the sentence being processed is used. This system consists of one or more reference times and temporal perspective times, the speech time and the location time. The special interest of our proposal is to establish a plausible choice of anchors for the new event out of the system of relevant times and to update this system of temporal coordinates correctly. The problem of choice is largely neglected in the literature. In opposition to the approach of Kamp and Rohrer the exact meaning of the tenses is fixed by the resolution component and not in the process of syntactic analysis.
Talking About Trees
In this paper we introduce a modal language LTfor imposing constraints on trees, and an extension LT (LF) for imposing constraints on trees decorated with feature structures. The motivation for introducing these languages is to provide tools for formalising grammatical frameworks perspicuously, and the paper illustrates this by showing how the leading ideas of GPSG can be captured in LT (LF). In addition, the role of modal languages (and in particular, what we have called as constraint formalisms for linguistic theorising is discussed in some detail.
Parsing with an Extended Domain of Locality
One of the claimed benefits of Tree Adjoining Grammars is that they have an extended domain of locality (EDOL). We consider how this can be exploited to limit the need for feature structure unification during parsing. We compare two wide-coverage lexicalized grammars of English, LEXSYS and XTAG, finding that the two grammars exploit EDOL in different ways.
ParseTalk About Sentence- And Text-Level Anaphora
We provide a unified account of sentence-level and text-level anaphora within the framework of a dependency-based grammar model. Criteria for anaphora resolution within sentence boundaries rephrase major concepts from GB's binding theory, while those for text-level anaphora incorporate an adapted version of a Grosz-Sidner-style focus model.
The MIT Summit Speech Recognition System : A Progress Report
Recently, we initiated a project to develop a phonetically-based spoken language understanding system called SUMMIT. In contrast to many of the past efforts that make use of heuristic rules whose development requires intense knowledge engineering, our approach attempts to express the speech knowledge within a formal framework using well-defined mathematical tools. In our system, features and decision strategies are discovered and trained automatically, using a large body of speech data. This paper describes the system, and documents its current performance.
A Proposal For Lexical Disambiguation
A method of sense resolution is proposed that is based on WordNet, an on-line lexical database that incorporates semantic relations (synonymy, antonymy, hyponymy, meronymy, causal and troponymic entailment) as labeled pointers between word senses. With WordNet, it is easy to retrieve sets of semantically related words, a facility that will be used for sense resolution during text processing, as follows. When a word with multiple senses is encountered, one of two procedures will be followed. Either, (1) words related in meaning to the alternative senses of the polysemous word will be retrieved; new strings will be derived by substituting these related words into the context of the polysemous word; a large textual corpus will then be searched for these derived strings; and that sense will be chosen that corresponds to the derived string that is found most often in the corpus. Or, (2) the context of the polysemous word will be used as a key to search a large corpus; all words found to occur in that context will be noted; WordNet will then be used to estimate the semantic distance from those words to the alternative senses of the polysemous word; and that sense will be chosen that is closest in meaning to other words occurring in the same context If successful, this procedure could have practical applications to problems of information retrieval, mechanical translation, intelligent tutoring systems, and elsewhere.
Dutch Sublanguage Semantic Tagging Combined With Mark-Up Technology
In this paper, we want to show how the morphological component of an existing NLP-system for Dutch (Dutch Medical Language Processor - DMLP) has been extended in order to produce output that is compatible with the language independent modules of the LSP-MLP system (Linguistic String Project - Medical Language Processor) of the New York University. The former can take advantage of the language independent developments of the latter, while focusing on idiosyncrasies for Dutch. This general strategy will be illustrated by a practical application, namely the highlighting of relevant information in a patient discharge summary (PDS) by means of modern HyperText Mark-Up Language (HTML) technology. Such an application can be of use for medical administrative purposes in a hospital environment.
Automatic Extraction Of Subcategorization From Corpora
We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount
PROCESSING DICTIONARY DEFINITIONS WITH PHRASAL PATTERN HIERARCHIES
This paper shows how dictionary word sense definitions can be analysed by applying a hierarchy of phrasal patterns. An experimental system embodying this mechanism has been implemented for processing definitions from the Longman Dictionary of Contemporary English. A property of this dictionary, exploited by the system, is that it uses a restricted vocabulary in its word sense definitions. The structures generated by the experimental system are intended to be used for the classification of new word senses in terms of the senses of words in the restricted vocabulary. Examples illustrating the output generated are presented, and some qualitative performance results and problems that were encountered are discussed. The analysis process applies successively more specific phrasal analysis rules as determined by a hierarchy of patterns in which less specific patterns dominate more specific ones. This ensures that reasonable incomplete analyses of the definitions are produced when more complete analyses are not possible, resulting in a relatively robust analysis mechanism. Thus the work reported addresses two robustness problems faced by current experimental natural language processing systems: coping with an incomplete lexicon and with incomplete knowledge of phrasal constructions.
Evaluating Contextual Dependency of Paraphrases using a Latent Variable Model
This paper presents an evaluation method employing a latent variable model for paraphrases with their contexts. We assume that the context of a sentence is indicated by a latent variable of the model as a topic and that the likelihood of each variable can be inferred. A paraphrase is evaluated for whether its sentences are used in the same context. Experimental results showed that the proposed method achieves almost 60% accuracy and that there is not a large performance difference between the two models. The results also revealed an upper bound of accuracy of 77% with the method when using only topic information.
Crossed Serial Dependencies : A low-power parseable extension to GPSG
An extension to the GPSG grammatical formalism is proposed, allowing non-terminals to consist of finite sequences of category labels, and allowing schematic variables to range over such sequences. The extension is shown to be sufficient to provide a strongly adequate grammar for crossed serial dependencies, as found in e.g. Dutch subordinate clauses. The structures induced for such constructions are argued to be more appropriate to data involving conjunction than some previous proposals have been. The extension is shown to be parseable by a simple extension to an existing parsing method for GPSG.