WORD SENSE DISAMBIGUATION

 

   Main Page

   Literature Survey

   Online Thesis

   Online Code-base

   WSD Tools

   WSD Related Links

 

 

 

INDEX

Chapter 1: Introduction

Chapter 2: Word Sense Disambiguation

2.1 WSD Approaches

2.1.1 Corpus Based Approaches

2.1.1.1 Supervised Disambiguation

2.1.1.2 Unsupervised Disambiguation

2.1.2 Knowledge Based Approaches

2.1.2.1 Machine Readable Dictionaries

2.1.2.2 Thesauri

2.1.2.3 Computational Lexicons

2.2 Information Sources for WSD

2.3 General Approach and Related Work

Chapter 3: Word Sense Disambiguation Based on Sense
Similarity and Syntactic Context

3.1 Introduction

3.2 Resources of the Algorithm

3.2.1 Training Corpus

3.2.2 Concept Hierarchy

3.2.3 Similarity Measure

3.2.4 Broad-coverage Parser

3.3 Local Context

3.4 Local Context Database

3.5 Disambiguation Algorithm

3.6 Evaluation and Results

3.6.1 Weak Contexts of the Algorithm

Chapter 4: Conclusion

Chapter 1

 

INTRODUCTION

 

           

      Natural language is an integral part of our lives. Language serves as the primary vehicle by which people communicate and record information. It has the potential for expressing an enormous range of ideas, and for conveying complex thoughts. Because it is so integral to our lives, we usually take its powers and influence for granted.

      The aim of computational linguistics is, in a sense, to capture this power. It tries to enable computers to be used as aids in analyzing and processing natural language, and to understand, by analogy with computers, more about how people process natural language. By understanding language processes in procedural terms, we can give computer systems the ability to generate and interpret natural language. This would make it possible for computers to perform linguistic tasks (such as translation), process textual data (books, journals, newspapers), and make it much easier for people to access computer-stored data.

      In the field of computational linguistics, some results have already been obtained however, a number of important research problems have not been solved yet. Ambiguity is one of these problems which have been a great challenge for computational linguists. In general, people are unaware of the ambiguities in the language they use because they are very good at resolving them using context and their knowledge of the world. But computer systems do not have this knowledge, and consequently do not do a good job of making use of the context.

      Something is ambiguous when it can be understood in two or more possible ways or when it has more than one meaning. If the ambiguity is in a sentence or clause, it is called structural (syntactic) ambiguity. If it is in a single word, it is called lexical ambiguity.

      For the structural ambiguity, consider the sentence “The man saw the girl with the telescope”. This sentence is ambiguous since it can be interpreted in two ways: The man saw the girl who possessed the telescope or, the man saw the girl with the aid of the telescope. However, the sentence “The man saw the girl with a red hat” is not ambiguous for a human reader (people has the knowledge that a hat cannot be used to see), while it has the same ambiguity as the previous example for a computer.

      Lexical semantic ambiguity occurs when a single word is associated with multiple senses. In this thesis, we will be focusing on lexical semantic ambiguity. Examples of lexical ambiguity are everywhere. In fact, almost any word has more than one meaning. For example, consider the noun party. It can refer to at least 5 different things as follows:

·         an organization to gain political power

·         an occasion on which people can assemble for social interaction and entertainment

·         a band of people associated temporarily in some activity

·         a group of people gathered together for pleasure

·         a person involved in legal proceedings

All of the above senses can be subsumed in just one sense such as “group of people”, however, for various applications, such as information retrieval or machine translation, it is important to be able to distinguish between the different senses of a word. In a machine translation application, different senses of a word may be represented with different words in the target language. In order to correctly translate a text in one language to another, firstly we have to know the senses of the words and then find the best translation equivalent in the target language. Apart from these, for many other words there is no such general sense like the one for the noun party.

Lexical ambiguity can refer to both homonymy and polysemy. Homonyms are words that are written the same way, but are (historically or conceptually) really two different words with different meanings which seem unrelated. Examples are suit (“lawsuit” and “set of garments”) and bank (“river bank” and “financial institution”). If a word’s meanings are related, it is called a polyseme. The word party is polysemous because its senses can be generalized as “group of people”, that is they are related.

Now let us consider the meaning of the noun party in the following sentence:

 

Mr. Smith’s party took 38% of the votes in the last elections.

 

      It is clear to a human reader that the noun party is in the sense “an organization to gain political power” in the sentence above. Most people are not even aware of the ambiguity contained in the sentence. Humans are so skilled at resolving potential ambiguities that they do not realize they are doing it. There has been research on how people resolve ambiguities; however we still do not know exactly how humans do lexical disambiguation. Therefore, it is a difficult task to teach a computer to do the same thing. If there is more than one ambiguous word in a sentence, the number of potential interpretations of the sentence increases dramatically. The number of interpretations of a sentence is the product of all possible meanings of the words that construct that sentence. For the sentence above, party has 5, take has 42, vote has 5, last has 10, and election has 4 senses. Therefore there are 42000 possible interpretations for the example sentence. The most prominent way to disambiguate a word is examining its context. The context can be considered as the words surrounding the ambiguous word, which is the noun party in our case. Words as vote and election might be a good clue for the sense of the noun party. But context is not the only information available for disambiguation. Syntactic classes of the words in the ambiguous word’s context (whether they are noun, verb or adjective, etc.), whether the ambiguous word plays the role of object or subject in the syntactic structure of the sentence may also be used in the disambiguation process.   

      The main research question in this thesis is determining the meaning of a word in a particular usage, namely Word Sense Disambiguation, by using sense similarity and syntactic context. In order to achieve this, we make use of a WSD algorithm that relies on the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”.

      This thesis is constructed as follows. An overview of the problem of Word Sense Disambiguation and a literature survey is presented in Chapter-2. Chapter-3 contains the description of our WSD system and presents the results obtained from this system on SENSEVAL-2 and SENSEVAL-3 English all-words data. Finally, we conclude in Chapter-4 with some final remarks on the findings presented in this thesis. 

 

Chapter 2

 

WORD SENSE DISAMBIGUATION

 

Many words have more than one possible meaning, for example a bat can be a small nocturnal creature or a piece of sports equipment and a bank can mean the edge of a river or a financial institution that accepts money. When we look up a word in any dictionary, it can be seen that a word may have many meanings some of which are very different from each other. In fact in some cases a word can have two meanings which are the opposite of one another such as “consult” which can mean to ask for advice or give advice and “clip” which can mean to fasten or detach. Given these complications, it is important for a computer to correctly determine the meaning in which a word is used.

Word Sense Disambiguation (WSD) refers to the resolution of lexical semantic ambiguity and its goal is to attribute the correct senses to words in a given context. It has been described as AI-complete, which is used to describe problems or subproblems in AI, to indicate that the solution presupposes a solution to the “strong AI problem” (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is, in other words, just too hard.

WSD is regarded as one of the most interesting and longest-standing problems in natural language processing. There are many uses for WSD. The most obvious application of WSD is Machine Translation. The machine translation process requires at least two stages: understanding the source language translation is done from and generating sentences in the target language. WSD is required in both stages since a word in the source language may have more than one possible translation in the target language. For example, the English word “drug” can be translated into Turkish as “ilaç” for its sense of “medicine” or as “uyuþturucu” for its sense of “dope” depending on the context. In order to be able to correctly translate a text, we need to know which sense is intended in the text.

Information Retrieval also benefits from WSD. Ambiguous words in the queries are problematic for information retrieval systems. Hence, retrieval engines need WSD for filtering out documents with senses irrelevant to the query.

Another potential area for WSD is Speech Processing. In speech synthesis, it is important to determine the correct pronunciations of words in order to generate speech that sounds natural. This process is difficult since there exists some words which are pronounced in more than one way depending on their meaning. For example, as mentioned in Stevenson 2003[1], “lead” is pronounced in one way when it is used in the sense “be in front” and in another way when it is used in the sense “a type of metal”. WSD could help speech synthesis by identifying the correct sense of the word which will also provide the correct pronunciation. The reverse problem may occur in speech recognition for homophones, words that are spelled differently but pronounced in the same way. If different spellings are treated as different senses, WSD can also be helpful in this situation.

Applications in text processing, grammatical analysis, content and thematic analysis also benefit from WSD techniques.

In the following section, we will proceed with defining WSD approaches based on the different types of training material and an overall literature review on WSD.

 

2.1 WSD Approaches

WSD algorithms can be divided into three based on the way they acquire information. These approaches are the following:

i.                    Corpus based approaches

ii.                  Knowledge based approaches

In the following sections these approaches will be explained in detail.

 

2.1.1 Corpus Based Approaches

In corpus based approaches, information is gained from training on some corpus. A corpus provides a set of samples that enables the systems to develop some numerical models. This approach can further be classified into two subclasses based on the training corpus as follows:

i.                    Supervised disambiguation

ii.                  Unsupervised disambiguation

In supervised WSD the training data is sense-tagged whereas in unsupervised WSD the training data is raw corpora which have not been semantically disambiguated.

 

2.1.1.1 Supervised Disambiguation

Supervised disambiguation is an application of the supervised learning approach for creating a classifier. A disambiguated corpus where each occurrence of an ambiguous word is annotated with a contextually appropriate sense is available for training. The aim in supervised disambiguation is to build a classifier which correctly classifies new cases based on their context of use.

Machine learning algorithms such as Bayesian classifiers (Duda and Hat, 1973) [2], decision lists (Rivest, 1987)[3], decision trees (Quinlan, 1986)[4], k-nearest neighbor and neural networks (Rumelhart et al., 1986)[5] all fall into this category.   

An example of probabilistic algorithms is Naïve Bayes (Mitchell 1997:ch. 6)[6] which has been frequently applied in WSD with good results (Pedersen, 2000)[7]. Gale, Church and Yarowsky (1992a; 1992b) [8;9] uses a variant of Bayes ratio on six ambiguous nouns and reports 90% accuracy. Mooney (1996)[10] reports that Naïve Bayes and neural networks achieved the highest performance with an accuracy of 73% in assigning the correct senses to a corpus of examples of word line which has six senses. The other algorithms in Mooney’s survey were 3-nearest neighbors, perceptron, decision tree, decision list and logic programming variants. Combining various classifiers has also been tested. Florian et al., 2002[11], combined four classifiers namely feature-enhanced Naïve Bayes, Cosine, bag-of-words Naïve Bayes and non-hierarchical decision lists, and obtained good results.

Decision lists search for discriminatory features in the training corpus and build a set of rules for disambiguation. Yarowsky (2000)[12] makes use of hierarchical decision lists and achieves top performance in the SENSEVAL framework on the 36 test words. Agirre and Martinez (2000)[13] reports that decision lists provide state-of-the-art results with simple and very fast means. This approach is reported to learn with low amounts of data.

Decision lists and Bayesian classifiers are the most popular algorithms in supervised disambiguation. For neural networks Towell and Voorhees (1998)[14], for decision trees Black (1988)[15] and Pedersen (2002)[16], for k-nearest neighbor Ng and Lee (1996)[17] and for information-theoretic approaches Brown et al. (1991a)[18] are some examples of the work done on WSD.

A major problem with supervised approaches is the need for a large sense-tagged training set. Despite the availability of large corpora, manually sense-tagging of a corpus is very difficult and very few sense-tagged data are available now. 

The two largest corpora that are available are the SemCor corpus (Landes et al., 1998)[19] and the SENSEVAL corpus (Kilgariff and Rosenzweig, 2000)[20]. The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words. In SemCor, all the words are tagged by part of speech and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet 1.6 (mappings for later versions of WordNet are also available). SENSEVAL corpus is derived from the HECTOR corpus and dictionary project. It is a joint Oxford University Press and Digital project which took place in the early 1990s. In the course of the project a 20-million word corpus (which also served as a pilot for the British National Corpus) was developed.

There have been several efforts for finding a way to sense-tag corpora automatically. Bootstrapping is the most frequently used method for this purpose. Bootstrapping relies on a small number of instances of each sense for each lexeme of interest. These sense-tagged instances are used as seeds to train an initial classifier. This initial classifier is then used to extract a larger training set from the remaining untagged corpus. With each iteration of this process, the training corpus grows and the untagged corpus shrinks.

Hearst (1991)[21] generates a seed set by simply hand-tagging a small set of examples from the untagged corpus. However, during the training phase each occurrence of a set of nouns to be disambiguated is manually sense-tagged in several occurrences. Schütze (1992, 1993)[22;23] proposes a method that avoids tagging each occurrence in the training corpus. Yarowsky (1995)[24] proposes an alternative technique by using two constraints named as “One sense per collocation” and “One sense per discourse” and reports an accuracy of 96% on twelve words.  “One sense per collocation” argues that nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship. Also, “One sense per discourse” constraint argues that the sense of a target word is highly consistent within any given document. Different bootstrapping techniques are also presented in Mihalcea and Moldovan (2001a) [25] and Mihalcea (2002)[26].

Another method for avoiding hand-tagged data is using parallel corpora (Brown et al. 1991b)[27]. In this method, bilingual corpora are used since different senses of some words often translate differently in another language.  By using a parallel aligned corpus, the translation of each occurrence of such words can be used to determine their correct senses automatically. In Dagan and Itai (1994)[28], Ide et al. (2002)[29] and Ng et al. (2003)[30], various uses of parallel corpora for WSD and its disadvantages can be found.

Another problem that supervised disambiguation methods face with is data sparseness. Since the sense-tagged training corpus is finite and very few for WSD, some senses of polysemous words are very likely to be missing. For a supervised algorithm to be successful, the training data must ensure that all senses of a polysemous word are covered. Smoothing is used to solve the data sparseness problem. The task of reevaluating some of the zero-probabilities or low-probabilities and assigning them non-zero values is called smoothing. Some of the smoothing methods are add-one smoothing, Witten-Bell smoothing (Witten and Bell, 1991)[31], and Good-Turing smoothing (Good, 1953)[32].

 

2.1.1.2 Unsupervised Disambiguation

In unsupervised disambiguation, information is gathered from raw corpora which have not been semantically disambiguated. Unsupervised methods correspond to clustering tasks rather than sense tagging tasks. Indeed, completely unsupervised disambiguation is not possible for word senses since sense tagging requires characterization of the senses. WSD can be divided into two subproblems: sense discrimination and sense labeling. If sense labeling is part of the task, an outside source of knowledge is necessary to define the senses. Sense discrimination can be done in a completely unsupervised way. It divides the occurrences of a word into a number of classes by determining for any two occurrences whether they belong to the same sense or not (Schütze and Pedersen 1995; Pedersen and Bruce, 1997; Schütze, 1998)[33;34;35]. Schütze’s (1998)[35] results indicate that for coarse binary distinctions, unsupervised techniques can achieve results approaching those of supervised and bootstrapping methods. Purandere and Pedersen (2004)[36] present a systematic comparison of discrimination techniques proposed by Pedersen and Bruce (1997; 1998)[34;37] and by Schütze (1992; 1998)[22;35]. 

Infrequent senses and senses that have few collocations are hard to isolate in unsupervised disambiguation. In general, accuracy of unsupervised WSD systems are 5% to 10% lower than that of other algorithms since no lexical resources for training or defining senses are used.

 

2.1.2 Knowledge Based Approaches

All of the methods described up to now require considerable amount of work to create a classifier for each entry in the lexicon. Because of this reason, they were able to report results on very few lexical items (in general between 4 to 12 items). With the availability of large-scale lexical resources, such as dictionaries, thesauri and corpora, work on WSD focused on large-scale disambiguation. In the following subsections WSD based on machine readable dictionaries, thesauri and computational lexicons will be presented.

2.1.2.1 Machine Readable Dictionaries

Machine readable dictionaries (MRD) provide a ready-made information source of word senses. The first attempt to use MRD’s came from Lesk (1986)[38]. He starts from the simple idea that a word’s dictionary definitions are likely to be good indicators of the senses they define. By using Oxford Advanced Learner’s Dictionary (OALD), he counts overlapping content words in the sense definitions of the ambiguous word and in the definitions of context words occurring nearby and selects the sense that achieves the maximum number of overlaps. The accuracy of the method is reported to be 50-70% which is a good result regarding the fact that a fine set of sense distinctions have been used such as OALD.

Cowie et al. (1992)[39] tried to improve Lesk’s approach by optimizing the overlap of all words in a single sentence simultaneously. However, it was found computationally very expensive. Therefore, Cowie et al. (1992)[39] used simulated annealing (Sampson 1986)[40] for the first time in natural language processing. They evaluated this approach using a corpus consisting of 50 sentences extracted from Longman Dictionary of Contemporary English (LDOCE). 47% of the words were reported to be correctly disambiguated to sense level and 72% to more rough grained senses.

Stevenson and Wilks (2001)[41] computed the overlap by normalizing the contribution of a word to the overlap count. Pedersen and Banerjee (2002)[42] described a different version of the Lesk’s algorithm by employing glosses contained in WordNet (Fellbaum, 1998; Miller et al., 1990)[19;43], a lexical database for English which will be described later in detail.

Because of the fact that dictionaries are created for human use, not for computers, there are some inconsistencies (Veronis and Ide 1991; Ide and Veronis 1993a, 1993b)[44;45;46]. Although they provide detailed information at the lexical level, they lack pragmatic information used for sense determination. For instance, the relation between ash and tobacco, cigarette or tray is very indirect in a dictionary whereas the word ash co-occurs very frequently with these words in a corpus (Ide and Veronis, 1998)[47].

2.1.2.2 Thesauri

Thesauri provide information about relationships among words. Thesaurus based disambiguation makes use of the semantic categorization provided by a thesaurus or a dictionary with subject categories. The most frequently used thesaurus in WSD is Roget’s International Thesaurus (Roget, 1946) which was put into machine-tractable form in 1950’s. The basic inference in thesaurus-based disambiguation is that semantic categories of the words in a context determine the semantic category of that context as a whole. And this category is then determines the correct senses that are used.

Walker (1987:254) proposed an algorithm as follows: each word is assigned to one or more subject categories in the thesaurus. If the word is assigned to several subjects, then it is assumed that they correspond to different senses of the word. Black (1988: 187) applied this approach to five different words and achieved accuracies around 50%.

Yarowsky (1992) proposed an approach for marking words with their categories from Roget’s thesaurus. The important aspect of the approach was that he used a context of 50 words either side so that 100 words were considered in the training examples for each ambiguous word. This method was tested on 12 ambiguous words and reported to achieve 92% accuracy. Yarowsky notes that this method is best for extracting topical information, most successful for nouns.  

Similar to machine readable dictionaries, a thesaurus is a resource for humans, so there is not enough information about word relations. Therefore, Roget’s or any other thesauri have not been used so extensively.

 

2.1.2.3 Computational Lexicons

The usefulness of lexical relations in linguistic, psycholinguistic and computational research has led to a number of efforts to create large electronic databases of such relations. Beginning from the mid-1980’s, construction of semantic lexicons by hand has emerged. Some examples of these lexicons are WordNet (Fellbaum 1998; Miller et al. 1990), CyC (Lenat and Guha 1990), ACQUILEX (Briscoe 1991), and COMPLEX (Grishman, Macleod, and Meyers 1994).

Since WordNet is the most popular lexicon among the above and since it is used in the work done in this thesis both for sense evaluation and for similarity measure, this section presents detailed information about it.

 

WordNet

WordNet is an online lexical reference system which was developed at Princeton University under the direction of Professor George A. Miller. It combines many features used for WSD in one system. It includes definitions of word senses as in a dictionary; it defines “synsets” of synonymous words representing a single lexical concept like a thesaurus; and it includes word-to-word relations.

WordNet consists of three databases: noun database, verb database and one database for adjectives and adverbs. Each database consists of lexical entries corresponding to unique orthographic forms. Each form is associated with a set of senses.

 

Overview of noun car

The noun car has 5 senses (first 3 from tagged texts)

1. (598) car, auto, automobile, machine, motorcar -- (4-wheeled motor vehicle; usually propelled                                       by an internal combustion engine; "he needs a car to get to work")

2. (24) car, railcar, railway car, railroad car -- (a wheeled vehicle adapted to the rails of railroad;                                                      "three cars had jumped the rails")

3. (1) cable car, car -- (a conveyance for passengers or freight on a cable railway; "they took a                                                      cable car to the top of the mountain")

4. car, gondola -- (car suspended from an airship and carrying personnel and cargo and power                                                      plant)

5. car, elevator car -- (where passengers ride up and down; "the car was on the top floor")

 


 

Figure-1: WordNet 2.0 entry for the noun “car

 

WordNet’s sense entries consist of a set of synonyms, a dictionary-style definition, or gloss, and some example uses as shown in Figure-1. The numbers in front of some senses are the frequency values obtained from SemCor corpus. This simple listing of lexical entries looks like an ordinary dictionary. However, WordNet contains a set of domain-independent lexical relations which hold among WordNet entries, senses or sets of synonyms. Lexical relations are only available within the same database. Figure-2, 3 and 4 show a subset of the relations associated with each of the three databases.

 

Relation

Definition

Example

Hypernym

Hyponym

Has-member

Member-of

Has-part

Part-of

Antonym

From concepts to superordinates

From concepts to subtypes

From groups to their members

From members to their groups

From wholes to parts

From parts to wholes

Opposites

breakfast → meal

meal → lunch

faculty → professor

pilot → crew

table → leg

course → meal

leader → follower

 

Figure-2: Noun relations in WordNet

 

Relation

Definition

Example

Hypernym

Troponym

Entails

Antonym

From events to superordinate events

From events to their subtypes

From events to the events they entail

Opposites

fly → travel

walk → stroll

snore → sleep

increase → decrease

 

Figure-3: Verb relations in WordNet

 

Relation

Definition

Example

Antonym (adjective)

Antonym (adverb)

Opposite

Opposite

heavy → light

quickly → slowly

 

Figure-4: Adjective and adverb relations in WordNet

 

Synonym and hyponym relations are the most useful relations presented in WordNet. Synonym means different lexemes with the same meaning. Two WordNet entries are considered to be synonyms if they can be successfully substituted in some context. A synset is a set of synonyms. Consider the following example of a synset:

{ car, auto, automobile, machine, motorcar }

The gloss of this synset describes it as “4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work" as shown in Figure-1 for the first sense of noun car. Instead of representing concepts using logical terms, WordNet represents them as synsets.

Each synset is related to its immediately more general and more specific synsets via direct hypernym and hyponym relations. To find chains of more general or more specific synsets, one can follow a transitive chain of hypernym and hyponym relations.  As an example, the hypernym chain for the noun valley is shown in Figure-5. In this figure, successively more general synsets are shown on successive indented lines. The chain starts from valley and ends in entity.

 

valley, vale

       => natural depression, depression

           => geological formation, formation

               => natural object

                   => object, physical object

                              =>entity

Figure-5: Hyponymy chain for the noun “valley

 

Since WordNet provides the broadest set of lexical information in one resource, it is used widespreadly. Another reason for its wide-spread use is, it is the first broad-coverage lexical resource that is freely available.

The earliest attempts to use WordNet in WSD were in information retrieval field. Voorhees (1993) and Richardson and Smeaton (1994) created knowledge bases using WordNet’s hierarchy. Li et al. (1995) proposed a WordNet-based algorithm for WSD. Disambiguation was done by semantic similarity between words and heuristic rules. Heuristic rules were based on the semantic similarity and the WordNet hierarchy. Leacock et al. (1998) used WordNet to counter data sparseness problem. Hawkins (1999) built up a WSD system that works with frequency and contextual information that is based on WordNet. Fellbaum et al. (2001) proposed a system that made use of syntactic clustering and semantic distinctions extracted from WordNet.

As seen, there has been various work regarding WordNet. But WordNet is mostly used to determine semantic similarity between senses. Resnik (1995a) computed information content of words which is a measure of the specificity of the concept that subsumes the words in the WordNet hypernym hierarchy. Agirre and Rigau (1996) employ WordNet to determine the conceptual distance among concepts whereas Mihalcea and Moldovan (1999) exploit semantic density and WordNet glosses in an all words design. Lin (1997; 1998; 2000) described a semantic similarity measure where similarity between two objects is defined to be the amount of information contained in the commonality between the objects divided by the amount of information in the description of the objects. We used the same measure for the work conducted in this thesis. Detailed information is given in Section 3.2.3.

Other approaches using WordNet are Jiang (1997), Agirre and Martinez (2000), Agirre et al. (2001), Hayres (2001) and Banerjee et al. (2003). A combination of MRDs and WordNet has also been tried with success (Litkowski, 2000, 2001; Mihalcea and Moldovan, 1998).

2.2 Information Sources for WSD

There are various information sources or feature types used in WSD regardless of the type of the approach. To disambiguate a word, a diversity of information, including syntactic tags, word frequencies, collocations, semantic context, role-related expectations, and syntactic restrictions can be considered.

In Agirre and Martinez (2001), a comparison of WSD systems has been made based on the information source they used. Some of these sources are as follows:

Frequency of senses: Frequency information is used to measure the likelihood of each possible sense appearing in the text. Therefore this information is generally used in statistical approaches and it is generally learned from hand-tagged data such as SemCor corpus. WordNet senses are ordered according to the frequencies of the senses in the SemCor corpus.

Part of speech (POS): Part of speech tagging is regarded as the first step of the disambiguation process if the tokens have the same orthographic forms but different syntactic classes. It is necessary since it reduces the number of possible senses a word can belong to. An orthographic form may even be unambiguous in one syntactic class whereas it has more than one sense in another. For instance, in WordNet 2.0 handle has 5 senses as a verb, but only one sense as a noun. The impact of knowledge resources on WSD in examined in Gaustad (2004). The results show that accurate POS information is beneficial for WSD and that including the POS of the ambiguous word itself as well as POS of the context increases the disambiguation accuracy.

Morphology: It is defined specially as the relation between derived words and their roots. For instance, the noun agreement has 6 senses, its verbal root agree 7. A stemmer tries to reduce various forms of a word to a single stem. Since English is a language with little inflectional morphology, it is not certain that using morphology will lead to significant improvements in WSD. With other languages, such as German or Italian, morphology is of greater influence. Derivational morphology would be useful only for disambiguating roots.

Collocations: Collocation is the relationship among any group of words that tend to co-occur in a predictable configuration. Disambiguation relies heavily on collocational information. For example, the noun match has 9 senses. However, it has only one possible sense in “football match”. It is observed that collocations are strong indicators if they are learned from hand-tagged corpora. Although they are strong, they should be used with other sources. They should not be treated as rules for sense-filtering alone (as in Hirst 1987 or Dahlgren, McDowell, and Stabler 1989).

Semantic word associations: These can be classified as follows:

i.                    Taxonomial organization: This refers to the classification of words in a hierarchy and the lexical-semantic relationships holding between words such as a dog is a kind of animal. This kind of information can be extracted from ontologies like WordNet.

ii.                  Situation and Topic: Information about the situation or topic enables a WSD system to see the ambiguous word in a broader context. For example, if the word mouse is used in an office situation and the topic is computer use, the most probable sense of the word mouse will be “computer tool”, not “animal”. Semantic word associations around topic and situation are powerful when learned from hand-tagged corpora. Associations learned from MRDs can also be useful.

iii.                Argument-head relations: These relations provide important clues for disambiguation such as the relationship between dog and bite in the sentence “the dog bite the man.” Yarowsky (1993) showed that collocation information can be captured using argument-head relations. Argument-head relations work as a strong indicator if they are given as a sense-to-word relation.

Syntactic structure (subcategorization information): Subcategorization and dependency relations refer to certain kinds of relations between words and phrases. For example the verb want can be followed by an infinitive, as in “I want to fly to Istanbul”, or a noun phrase, as in “I want a flight to Istanbul”. But the verb find cannot be followed by an infinitive. Verbs have several possible patterns of arguments. A particular set of arguments that a verb can appear with is referred to a subcategorization frame. Subcategorization frames capture syntactic regularities about complements.

Gaustad (2004) used dependency relations in his WSD system for Dutch and showed that syntactic information provides very useful disambiguation clues and leads to a significant increase in performance. For the work presented in this thesis, we also used syntactic information and dependency relations, so detailed information about this kind of knowledge source will be given in the following sections.     

Agirre and Martinez (2001) made a comparison between the contributions of the above resources to WSD. According to their observations, if learned from hand-tagged corpora, collocations and semantic word associations are the most important knowledge types for WSD, but they also mentioned that syntactic cues are equally reliable. On the other hand, taxonomical information was found to be very weak.

2.3 General Approach and Related Work

As can be seen from the previous sections, most of the study in WSD is concentrated so far on supervised learning since these algorithms usually achieve the best performance. This fact is shown by the results of SENSEVAL-2 (Cotton et al., 1998). However these systems have the drawback that they are only applicable to the words for which sense tagged data is available, and they are strongly dependent to the labeled training data available. In addition, almost all of these systems make use of the picking the first sense heuristic.

Supervised systems and first sense heuristic has come to a natural bound of approximately 70% precision (Mihalcea, 2002). In order to improve this value, benefits of knowledge-based or unsupervised alternatives should be examined.

In this thesis, we present a disambiguation algorithm that relies on the information gained from syntactic context and sense similarity. The WSD approach used in the system cannot be specified as unsupervised since it uses a lexicon, namely WordNet, for sense disambiguation and for semantic similarity metric; however, it may also be regarded as unsupervised in the sense that no sense-tagged data is needed for training. All details about the system will be given in Chapter-3. In this section, a number of WSD systems that have employed syntactic context in the past are introduced.

The algorithm used in this thesis is very similar to the algorithm presented in Lin (1997). Lin mentions the same problems that supervised algorithms face with: need for a large sense-tagged training set and data sparseness. In order not to deal with these problems, Lin proposes an algorithm adopting the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”. This means that the same knowledge sources are used for all words and that instead of building separate classifiers for each word, past usages of other words are used to disambiguate the current word.

Using a broad-coverage parser, a 25-million-word Wall Street Journal corpus was parsed and dependency relationships were extracted. Dependency relationships were stored as dependency triplets which were composed of the type of the dependency relation, the word related to the ambiguous word via the dependency relation and information on whether the word is the head or the modifier in the relationship. A local context database was constructed for each of these triplets (local contexts). An entry in the database consisted of twenty words found from the training corpus that were most likely to occur in that particular local context.

In the disambiguation process, input is parsed and local contexts are extracted. Local context database is searched to find the words that occurred in the same local context as the ambiguous word. Based on a similarity maximization algorithm between the words, the polysemous word is disambiguated. The algorithm used in the similarity maximization part of both Lin’s work and our work is very similar to the algorithm presented in (Resnik, 1995a).

Since Wall Street Journal is mostly business news, Lin only used the “press reportage” of SemCor corpus for testing. An accuracy of 56.1% is reported for nouns.

Li, Szpakowicz, and Matwin (1995), presents an algorithm for automatic word sense disambiguation, based on lexical knowledge contained in WordNet and on the results of surface-syntactic analysis. The algorithm was designed to support text analysis with minimal precoded knowledge. They proposed a set of heuristic rules that are based on the idea that objects of the same or similar verbs are similar. Disambiguation is done by computing the semantic similarity between words and applying the heuristic rules based on this similarity.

Stetina et al. (1998) achieve good results with syntactic relations as features compared to the frequency baseline. They use a measure of semantic distance based on WordNet to find similar features. The features are extracted using a statistical parser (Collins, 1996), and consist of the head and modifiers of each phrase.

Martinez et al. (2002) discuss the contribution of various syntactic features to WSD: grammatical relations coded as the presence of adjuncts/arguments in isolation or as subcategorization frames, and instantiated grammatical relations between words.

The set of syntactic features was extracted using Minipar (Lin, 1993). Two types of syntactic relations were taken into account: instantiated grammatical relations (IGR) and grammatical relations (GR). IGR are triples of [wordsense relation value] where the value is either the word form or the lemma. For GR, bigrams [wordsense relation] and n-grams [wordsense relation1 relation2...] are stored. Three types of n-grams are defined: all subcategorization information including part of speech given by Minipar, subcategorization information occurring in the sentence, and all dependencies in the parse tree. Their results on the Senseval-2 lexical sample data for English indicate that both types of syntactic relations, IGR and GR, together provide the best F-score (IGR getting better precision than coverage and GR lower precision, but higher coverage). They also came to the conclusion that syntactic features effectively improve WSD precision.

Gaustad (2004) also investigated whether structural syntactic information is helpful for lexical semantic disambiguation or not. He used the Alpino dependency parser for Dutch in order to extract the dependency relations. He reported that using dependency relations in conjunction with the lemma of the ambiguous word and its part of speech as features performs better than the model including only immediately left and right context.

 

 

Chapter 3

 

Word Sense Disambiguation Based on Sense

Similarity and Syntactic Context

 

 

3.1 Introduction

Most corpus-based WSD algorithms make use of the previous usages of a polysemous word in order to disambiguate that word. These algorithms basically rely on the intuition that “two occurrences of the same word have identical meanings if they have similar local contexts”. However, this intuition can be misleading. In order for this method to have the desired consequences; a polysemous word must be observed several times before making a good classifier. Knowing that there are nearly 15000 polysemous nouns in WordNet 2.0, the corpus must contain millions of words. If the corpus fails to contain a polysemous word or all the senses of a word, the algorithm can lead to false results.

In this thesis, we make use of a WSD algorithm that relies on the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”. By this approach, a generalization can be made since it allows using the knowledge learned for one word on other words; instead of using a separate classifier for each polysemous word. Hence, the size of the training corpus can be much smaller. Moreover, this method is able to disambiguate infrequent or unseen words in the training corpus.

The basic steps of the algorithm are as follows:

                    i.            Parse the training corpus and build up a local context database by extracting local contexts according to some rules.

                  ii.            Similarly, parse the input text and extract local contexts of each ambiguous word.

                iii.            For each ambiguous word w, search the local context database and find words that are in an identical local context as w. These words are the similar words set S of w which will be used in sense selection.

                iv.            Select a sense of w that maximizes the similarity between w and its similar words set S that was found in the previous step.

Consider the sentence below:

Jane applied for a job.

 

The noun “job” has 15 possible meanings in WordNet 2.0. To disambiguate the noun “job”, we consider the other words that appeared in an identical context as “job”. According to our system, the local context of “job” is “N:prep-for:apply:head:V”, that is there is a relation between the verb “apply” and the noun “job” via the preposition “for”. In Table-1, the words that occurred in the local context “N:prep-for:apply:head:V” are presented. The numbers in table indicate the likelihood ratios of the words in that local context. The meaning of “job” in the example can be determined by choosing one of its 15 senses that is most similar to the meanings of the words in Table-1.

 

N:prep-for:apply:head:V

license                   419.46

membership          324.86

job                         299.67

approval                241.81

patent                    233.65

permission             207.59

amnesty                198.54

permit                    166.21

listing                     141.75

loan                       141.00 

citizenship                   100.85

asylum                          92.37

visa                               84.21

status                            72.32

grant                             69.68

exemption                     59.04

legalization                   58.55

noun.person                 54.34

refund                           52.63

residence                      50.20

Table-1: Words that have a relation with “apply” via “for”

This chapter is structured as follows: Firstly we present the resources needed for the algorithm. In section 3.3 and section 3.4, the concepts of local context and local context database are explained respectively. Later, the disambiguation algorithm is presented in section 3.5. Finally, we conclude this chapter with the results obtained from the experiments.

3.2 Resources of the Algorithm

In order to implement this algorithm the following resources are needed:

                    i.            an untagged training corpus,

                  ii.            a concept hierarchy,

                iii.            a similarity measure,

                iv.            a broad-coverage parser.

In this section, all of the above resources are presented one by one.

3.2.1 Training Corpus

For the training corpus, we used a 100-million-word Wall Street Journal corpus of the periods presented below:

Source        Period              Original Publication(s)

---------------------------------------------------------------

WSJ           1987-9             ACL/DCI and CSR WSJ0

WSJ           1990-2             TIPSTER Vol. 2, version 2

WSJ           1992-4             CSR LM-1 corpus

 

The WSJ 1992 collection on TIPSTER covers only the first three months of the year.  Data from the remaining months of 1992 were taken from the LDC, along with all of 1993 and the first three months of 1994.

3.2.2 Concept Hierarchy (WORDNET 2.0)

As the concept hierarchy WordNet 2.0 is used. Information about the general WordNet structure is given in section 2.1.2.3. WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0 includes more than 42,000 new links between nouns and verbs that are morphologically related, a topical organization for many areas that classifies synsets by category, region, or usage, gloss and synset fixes, and new terminology.

3.2.2 Concept Hierarchy (WORDNET 2.0)

As the concept hierarchy WordNet 2.0 is used. Information about the general WordNet structure is given in section 2.1.2.3. WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0 includes more than 42,000 new links between nouns and verbs that are morphologically related, a topical organization for many areas that classifies synsets by category, region, or usage, gloss and synset fixes, and new terminology.

 

3.2.3 Similarity Measure

In order to find the similarity between the senses of a polysemous word and another word we need a similarity measure. There is a strong relationship between a word’s semantics (or meaning) and its syntactic behavior. As discussed in Weeds (2002), if two words mean similar things then they will occur in documents in similar contexts. Accordingly, we can attempt to calculate the similarity between all senses of two words by finding all of the contexts that each appears in. If a description of a word is made up of the contexts it appears in and their associated frequencies then we can mathematically measure the similarity between these descriptions. For example, the word dog may occur as the subject of the word eat x number of times - this will increase the similarity between dog and other words which also occur as the subject of the word eat e.g. cat. Many measures have been proposed to calculate the similarity between the descriptions of words; we used WordNet Similarity Package v.0.15 which implements the semantic relatedness measures described by Leacock & Chodorow (1998), Jiang & Conrath (1997), Resnik (1995), Lin (1998), Hirst & St-Onge (1998), Wu & Palmer (1994), the adapted gloss overlap measure by Banerjee and Pedersen (2002), and two measures based on context vectors by Patwardhan (2003). This package is able to calculate the similarity between either two words or two senses (or synsets). We use it to find the similarity between two senses.

We used Lin’s measure (1998) in the experiments conducted for this thesis.

 

Lin’s Measure[1]

This measure is derived from the following assumptions (Lin, 1998):

 

i.                    The commonality between the words A and B is measured by I(common(A,B)) where common(A,B) is a proposition that states the commonalities between A and B and I(s) is the amount of information contained in the proposition s.

ii.                  The differences between the words A and B is measured by

I(describe(A,B)) – I(common(A,B))

             where describe(A,B) is a proposition that describes what A and B are.

iii.                The similarity between A and B, sim(A,B), is a function of their commonality and differences. That is,

sim(A,B) = f ( I(common(A,B)) , I(describe(A,B)))

                   where the domain of function f(x,y) is { (x,y) | x ≥ 0 , y > 0 , y ≥ x } .

iv.                Similarity is independent of the unit used in the information measure.

                  According to the Information Theory (Cover and Thomas, 1991), I(s) = -logbP(s),          where P(s) is the probability of s, and b is the unit. When b=2, I(s) is the number of bits                needed to encode s. Since logbx =  , assumption (iv) means that the function f                 must satisfy the following condition:

                                                     

v.                  Similarity is additive with respect to commonality.

If common(A,B) consists of two independent parts, then the sim(A,B) is the sum of the          similarities computed when each part of the commonality is considered. In other words:             f(x1 + x2, y) = f(x1,y) + f(x2,y). Therefore,,                              which means that when there is no commonality between A and B, their similarity is 0, no            matter how different they are.

vi.                The similarity between a pair of identical objects is 1.

When A and B are identical, knowing their commonalities means knowing what they are, i.e., I(common(A,B)) = I(describe(A,B)). Therefore, the function f must have the following property: 

vii.              The function f(x,y) is continuous.

 

Similarity Theorem: The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are.

 

sim(A,B)

Proof: To prove the theorem, we need to show f(x,y) = . Since f(x,y) = f(,y) (due to assumption (iv)), we only need to show that when  is a rational number, f(x,y) = .The result can be generalized to all real numbers because f is continuous and for any real number, there are rational numbers that are infinitely close to it.

Suppose m and n are positive integers.

                                          f(nx,y) = f((n-1)x,y) + f(x,y) = n f(x,y)  (due to assumption (v))

Thus,

Substituting  for x in this equation:

Since