WORD SENSE DISAMBIGUATION

 

   Main Page

   Literature Survey

   Online Thesis

   Online Code-base

   WSD Tools

   WSD Related Links

 

 

 

INDEX

Chapter 1: Introduction

Chapter 2: Word Sense Disambiguation

2.1 WSD Approaches

2.1.1 Corpus Based Approaches

2.1.1.1 Supervised Disambiguation

2.1.1.2 Unsupervised Disambiguation

2.1.2 Knowledge Based Approaches

2.1.2.1 Machine Readable Dictionaries

2.1.2.2 Thesauri

2.1.2.3 Computational Lexicons

2.2 Information Sources for WSD

2.3 General Approach and Related Work

Chapter 3: Word Sense Disambiguation Based on Sense
Similarity and Syntactic Context

3.1 Introduction

3.2 Resources of the Algorithm

3.2.1 Training Corpus

3.2.2 Concept Hierarchy

3.2.3 Similarity Measure

3.2.4 Broad-coverage Parser

3.3 Local Context

3.4 Local Context Database

3.5 Disambiguation Algorithm

3.6 Evaluation and Results

3.6.1 Weak Contexts of the Algorithm

Chapter 4: Conclusion

Chapter 1

 

INTRODUCTION

 

           

      Natural language is an integral part of our lives. Language serves as the primary vehicle by which people communicate and record information. It has the potential for expressing an enormous range of ideas, and for conveying complex thoughts. Because it is so integral to our lives, we usually take its powers and influence for granted.

      The aim of computational linguistics is, in a sense, to capture this power. It tries to enable computers to be used as aids in analyzing and processing natural language, and to understand, by analogy with computers, more about how people process natural language. By understanding language processes in procedural terms, we can give computer systems the ability to generate and interpret natural language. This would make it possible for computers to perform linguistic tasks (such as translation), process textual data (books, journals, newspapers), and make it much easier for people to access computer-stored data.

      In the field of computational linguistics, some results have already been obtained however, a number of important research problems have not been solved yet. Ambiguity is one of these problems which have been a great challenge for computational linguists. In general, people are unaware of the ambiguities in the language they use because they are very good at resolving them using context and their knowledge of the world. But computer systems do not have this knowledge, and consequently do not do a good job of making use of the context.

      Something is ambiguous when it can be understood in two or more possible ways or when it has more than one meaning. If the ambiguity is in a sentence or clause, it is called structural (syntactic) ambiguity. If it is in a single word, it is called lexical ambiguity.

      For the structural ambiguity, consider the sentence “The man saw the girl with the telescope”. This sentence is ambiguous since it can be interpreted in two ways: The man saw the girl who possessed the telescope or, the man saw the girl with the aid of the telescope. However, the sentence “The man saw the girl with a red hat” is not ambiguous for a human reader (people has the knowledge that a hat cannot be used to see), while it has the same ambiguity as the previous example for a computer.

      Lexical semantic ambiguity occurs when a single word is associated with multiple senses. In this thesis, we will be focusing on lexical semantic ambiguity. Examples of lexical ambiguity are everywhere. In fact, almost any word has more than one meaning. For example, consider the noun party. It can refer to at least 5 different things as follows:

·         an organization to gain political power

·         an occasion on which people can assemble for social interaction and entertainment

·         a band of people associated temporarily in some activity

·         a group of people gathered together for pleasure

·         a person involved in legal proceedings

All of the above senses can be subsumed in just one sense such as “group of people”, however, for various applications, such as information retrieval or machine translation, it is important to be able to distinguish between the different senses of a word. In a machine translation application, different senses of a word may be represented with different words in the target language. In order to correctly translate a text in one language to another, firstly we have to know the senses of the words and then find the best translation equivalent in the target language. Apart from these, for many other words there is no such general sense like the one for the noun party.

Lexical ambiguity can refer to both homonymy and polysemy. Homonyms are words that are written the same way, but are (historically or conceptually) really two different words with different meanings which seem unrelated. Examples are suit (“lawsuit” and “set of garments”) and bank (“river bank” and “financial institution”). If a word’s meanings are related, it is called a polyseme. The word party is polysemous because its senses can be generalized as “group of people”, that is they are related.

Now let us consider the meaning of the noun party in the following sentence:

 

Mr. Smith’s party took 38% of the votes in the last elections.

 

      It is clear to a human reader that the noun party is in the sense “an organization to gain political power” in the sentence above. Most people are not even aware of the ambiguity contained in the sentence. Humans are so skilled at resolving potential ambiguities that they do not realize they are doing it. There has been research on how people resolve ambiguities; however we still do not know exactly how humans do lexical disambiguation. Therefore, it is a difficult task to teach a computer to do the same thing. If there is more than one ambiguous word in a sentence, the number of potential interpretations of the sentence increases dramatically. The number of interpretations of a sentence is the product of all possible meanings of the words that construct that sentence. For the sentence above, party has 5, take has 42, vote has 5, last has 10, and election has 4 senses. Therefore there are 42000 possible interpretations for the example sentence. The most prominent way to disambiguate a word is examining its context. The context can be considered as the words surrounding the ambiguous word, which is the noun party in our case. Words as vote and election might be a good clue for the sense of the noun party. But context is not the only information available for disambiguation. Syntactic classes of the words in the ambiguous word’s context (whether they are noun, verb or adjective, etc.), whether the ambiguous word plays the role of object or subject in the syntactic structure of the sentence may also be used in the disambiguation process.   

      The main research question in this thesis is determining the meaning of a word in a particular usage, namely Word Sense Disambiguation, by using sense similarity and syntactic context. In order to achieve this, we make use of a WSD algorithm that relies on the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”.

      This thesis is constructed as follows. An overview of the problem of Word Sense Disambiguation and a literature survey is presented in Chapter-2. Chapter-3 contains the description of our WSD system and presents the results obtained from this system on SENSEVAL-2 and SENSEVAL-3 English all-words data. Finally, we conclude in Chapter-4 with some final remarks on the findings presented in this thesis. 

 

Chapter 2

 

WORD SENSE DISAMBIGUATION

 

Many words have more than one possible meaning, for example a bat can be a small nocturnal creature or a piece of sports equipment and a bank can mean the edge of a river or a financial institution that accepts money. When we look up a word in any dictionary, it can be seen that a word may have many meanings some of which are very different from each other. In fact in some cases a word can have two meanings which are the opposite of one another such as “consult” which can mean to ask for advice or give advice and “clip” which can mean to fasten or detach. Given these complications, it is important for a computer to correctly determine the meaning in which a word is used.

Word Sense Disambiguation (WSD) refers to the resolution of lexical semantic ambiguity and its goal is to attribute the correct senses to words in a given context. It has been described as AI-complete, which is used to describe problems or subproblems in AI, to indicate that the solution presupposes a solution to the “strong AI problem” (that is, the synthesis of a human-level intelligence). A problem that is AI-complete is, in other words, just too hard.

WSD is regarded as one of the most interesting and longest-standing problems in natural language processing. There are many uses for WSD. The most obvious application of WSD is Machine Translation. The machine translation process requires at least two stages: understanding the source language translation is done from and generating sentences in the target language. WSD is required in both stages since a word in the source language may have more than one possible translation in the target language. For example, the English word “drug” can be translated into Turkish as “ilaç” for its sense of “medicine” or as “uyuþturucu” for its sense of “dope” depending on the context. In order to be able to correctly translate a text, we need to know which sense is intended in the text.

Information Retrieval also benefits from WSD. Ambiguous words in the queries are problematic for information retrieval systems. Hence, retrieval engines need WSD for filtering out documents with senses irrelevant to the query.

Another potential area for WSD is Speech Processing. In speech synthesis, it is important to determine the correct pronunciations of words in order to generate speech that sounds natural. This process is difficult since there exists some words which are pronounced in more than one way depending on their meaning. For example, as mentioned in Stevenson 2003[1], “lead” is pronounced in one way when it is used in the sense “be in front” and in another way when it is used in the sense “a type of metal”. WSD could help speech synthesis by identifying the correct sense of the word which will also provide the correct pronunciation. The reverse problem may occur in speech recognition for homophones, words that are spelled differently but pronounced in the same way. If different spellings are treated as different senses, WSD can also be helpful in this situation.

Applications in text processing, grammatical analysis, content and thematic analysis also benefit from WSD techniques.

In the following section, we will proceed with defining WSD approaches based on the different types of training material and an overall literature review on WSD.

 

2.1 WSD Approaches

WSD algorithms can be divided into three based on the way they acquire information. These approaches are the following:

i.                    Corpus based approaches

ii.                  Knowledge based approaches

In the following sections these approaches will be explained in detail.

 

2.1.1 Corpus Based Approaches

In corpus based approaches, information is gained from training on some corpus. A corpus provides a set of samples that enables the systems to develop some numerical models. This approach can further be classified into two subclasses based on the training corpus as follows:

i.                    Supervised disambiguation

ii.                  Unsupervised disambiguation

In supervised WSD the training data is sense-tagged whereas in unsupervised WSD the training data is raw corpora which have not been semantically disambiguated.

 

2.1.1.1 Supervised Disambiguation

Supervised disambiguation is an application of the supervised learning approach for creating a classifier. A disambiguated corpus where each occurrence of an ambiguous word is annotated with a contextually appropriate sense is available for training. The aim in supervised disambiguation is to build a classifier which correctly classifies new cases based on their context of use.

Machine learning algorithms such as Bayesian classifiers (Duda and Hat, 1973) [2], decision lists (Rivest, 1987)[3], decision trees (Quinlan, 1986)[4], k-nearest neighbor and neural networks (Rumelhart et al., 1986)[5] all fall into this category.   

An example of probabilistic algorithms is Naïve Bayes (Mitchell 1997:ch. 6)[6] which has been frequently applied in WSD with good results (Pedersen, 2000)[7]. Gale, Church and Yarowsky (1992a; 1992b) [8;9] uses a variant of Bayes ratio on six ambiguous nouns and reports 90% accuracy. Mooney (1996)[10] reports that Naïve Bayes and neural networks achieved the highest performance with an accuracy of 73% in assigning the correct senses to a corpus of examples of word line which has six senses. The other algorithms in Mooney’s survey were 3-nearest neighbors, perceptron, decision tree, decision list and logic programming variants. Combining various classifiers has also been tested. Florian et al., 2002[11], combined four classifiers namely feature-enhanced Naïve Bayes, Cosine, bag-of-words Naïve Bayes and non-hierarchical decision lists, and obtained good results.

Decision lists search for discriminatory features in the training corpus and build a set of rules for disambiguation. Yarowsky (2000)[12] makes use of hierarchical decision lists and achieves top performance in the SENSEVAL framework on the 36 test words. Agirre and Martinez (2000)[13] reports that decision lists provide state-of-the-art results with simple and very fast means. This approach is reported to learn with low amounts of data.

Decision lists and Bayesian classifiers are the most popular algorithms in supervised disambiguation. For neural networks Towell and Voorhees (1998)[14], for decision trees Black (1988)[15] and Pedersen (2002)[16], for k-nearest neighbor Ng and Lee (1996)[17] and for information-theoretic approaches Brown et al. (1991a)[18] are some examples of the work done on WSD.

A major problem with supervised approaches is the need for a large sense-tagged training set. Despite the availability of large corpora, manually sense-tagging of a corpus is very difficult and very few sense-tagged data are available now. 

The two largest corpora that are available are the SemCor corpus (Landes et al., 1998)[19] and the SENSEVAL corpus (Kilgariff and Rosenzweig, 2000)[20]. The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words. In SemCor, all the words are tagged by part of speech and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet 1.6 (mappings for later versions of WordNet are also available). SENSEVAL corpus is derived from the HECTOR corpus and dictionary project. It is a joint Oxford University Press and Digital project which took place in the early 1990s. In the course of the project a 20-million word corpus (which also served as a pilot for the British National Corpus) was developed.

There have been several efforts for finding a way to sense-tag corpora automatically. Bootstrapping is the most frequently used method for this purpose. Bootstrapping relies on a small number of instances of each sense for each lexeme of interest. These sense-tagged instances are used as seeds to train an initial classifier. This initial classifier is then used to extract a larger training set from the remaining untagged corpus. With each iteration of this process, the training corpus grows and the untagged corpus shrinks.

Hearst (1991)[21] generates a seed set by simply hand-tagging a small set of examples from the untagged corpus. However, during the training phase each occurrence of a set of nouns to be disambiguated is manually sense-tagged in several occurrences. Schütze (1992, 1993)[22;23] proposes a method that avoids tagging each occurrence in the training corpus. Yarowsky (1995)[24] proposes an alternative technique by using two constraints named as “One sense per collocation” and “One sense per discourse” and reports an accuracy of 96% on twelve words.  “One sense per collocation” argues that nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship. Also, “One sense per discourse” constraint argues that the sense of a target word is highly consistent within any given document. Different bootstrapping techniques are also presented in Mihalcea and Moldovan (2001a) [25] and Mihalcea (2002)[26].

Another method for avoiding hand-tagged data is using parallel corpora (Brown et al. 1991b)[27]. In this method, bilingual corpora are used since different senses of some words often translate differently in another language.  By using a parallel aligned corpus, the translation of each occurrence of such words can be used to determine their correct senses automatically. In Dagan and Itai (1994)[28], Ide et al. (2002)[29] and Ng et al. (2003)[30], various uses of parallel corpora for WSD and its disadvantages can be found.

Another problem that supervised disambiguation methods face with is data sparseness. Since the sense-tagged training corpus is finite and very few for WSD, some senses of polysemous words are very likely to be missing. For a supervised algorithm to be successful, the training data must ensure that all senses of a polysemous word are covered. Smoothing is used to solve the data sparseness problem. The task of reevaluating some of the zero-probabilities or low-probabilities and assigning them non-zero values is called smoothing. Some of the smoothing methods are add-one smoothing, Witten-Bell smoothing (Witten and Bell, 1991)[31], and Good-Turing smoothing (Good, 1953)[32].

 

2.1.1.2 Unsupervised Disambiguation

In unsupervised disambiguation, information is gathered from raw corpora which have not been semantically disambiguated. Unsupervised methods correspond to clustering tasks rather than sense tagging tasks. Indeed, completely unsupervised disambiguation is not possible for word senses since sense tagging requires characterization of the senses. WSD can be divided into two subproblems: sense discrimination and sense labeling. If sense labeling is part of the task, an outside source of knowledge is necessary to define the senses. Sense discrimination can be done in a completely unsupervised way. It divides the occurrences of a word into a number of classes by determining for any two occurrences whether they belong to the same sense or not (Schütze and Pedersen 1995; Pedersen and Bruce, 1997; Schütze, 1998)[33;34;35]. Schütze’s (1998)[35] results indicate that for coarse binary distinctions, unsupervised techniques can achieve results approaching those of supervised and bootstrapping methods. Purandere and Pedersen (2004)[36] present a systematic comparison of discrimination techniques proposed by Pedersen and Bruce (1997; 1998)[34;37] and by Schütze (1992; 1998)[22;35]. 

Infrequent senses and senses that have few collocations are hard to isolate in unsupervised disambiguation. In general, accuracy of unsupervised WSD systems are 5% to 10% lower than that of other algorithms since no lexical resources for training or defining senses are used.

 

2.1.2 Knowledge Based Approaches

All of the methods described up to now require considerable amount of work to create a classifier for each entry in the lexicon. Because of this reason, they were able to report results on very few lexical items (in general between 4 to 12 items). With the availability of large-scale lexical resources, such as dictionaries, thesauri and corpora, work on WSD focused on large-scale disambiguation. In the following subsections WSD based on machine readable dictionaries, thesauri and computational lexicons will be presented.

2.1.2.1 Machine Readable Dictionaries

Machine readable dictionaries (MRD) provide a ready-made information source of word senses. The first attempt to use MRD’s came from Lesk (1986)[38]. He starts from the simple idea that a word’s dictionary definitions are likely to be good indicators of the senses they define. By using Oxford Advanced Learner’s Dictionary (OALD), he counts overlapping content words in the sense definitions of the ambiguous word and in the definitions of context words occurring nearby and selects the sense that achieves the maximum number of overlaps. The accuracy of the method is reported to be 50-70% which is a good result regarding the fact that a fine set of sense distinctions have been used such as OALD.

Cowie et al. (1992)[39] tried to improve Lesk’s approach by optimizing the overlap of all words in a single sentence simultaneously. However, it was found computationally very expensive. Therefore, Cowie et al. (1992)[39] used simulated annealing (Sampson 1986)[40] for the first time in natural language processing. They evaluated this approach using a corpus consisting of 50 sentences extracted from Longman Dictionary of Contemporary English (LDOCE). 47% of the words were reported to be correctly disambiguated to sense level and 72% to more rough grained senses.

Stevenson and Wilks (2001)[41] computed the overlap by normalizing the contribution of a word to the overlap count. Pedersen and Banerjee (2002)[42] described a different version of the Lesk’s algorithm by employing glosses contained in WordNet (Fellbaum, 1998; Miller et al., 1990)[19;43], a lexical database for English which will be described later in detail.

Because of the fact that dictionaries are created for human use, not for computers, there are some inconsistencies (Veronis and Ide 1991; Ide and Veronis 1993a, 1993b)[44;45;46]. Although they provide detailed information at the lexical level, they lack pragmatic information used for sense determination. For instance, the relation between ash and tobacco, cigarette or tray is very indirect in a dictionary whereas the word ash co-occurs very frequently with these words in a corpus (Ide and Veronis, 1998)[47].

2.1.2.2 Thesauri

Thesauri provide information about relationships among words. Thesaurus based disambiguation makes use of the semantic categorization provided by a thesaurus or a dictionary with subject categories. The most frequently used thesaurus in WSD is Roget’s International Thesaurus (Roget, 1946) which was put into machine-tractable form in 1950’s. The basic inference in thesaurus-based disambiguation is that semantic categories of the words in a context determine the semantic category of that context as a whole. And this category is then determines the correct senses that are used.

Walker (1987:254) proposed an algorithm as follows: each word is assigned to one or more subject categories in the thesaurus. If the word is assigned to several subjects, then it is assumed that they correspond to different senses of the word. Black (1988: 187) applied this approach to five different words and achieved accuracies around 50%.

Yarowsky (1992) proposed an approach for marking words with their categories from Roget’s thesaurus. The important aspect of the approach was that he used a context of 50 words either side so that 100 words were considered in the training examples for each ambiguous word. This method was tested on 12 ambiguous words and reported to achieve 92% accuracy. Yarowsky notes that this method is best for extracting topical information, most successful for nouns.  

Similar to machine readable dictionaries, a thesaurus is a resource for humans, so there is not enough information about word relations. Therefore, Roget’s or any other thesauri have not been used so extensively.

 

2.1.2.3 Computational Lexicons

The usefulness of lexical relations in linguistic, psycholinguistic and computational research has led to a number of efforts to create large electronic databases of such relations. Beginning from the mid-1980’s, construction of semantic lexicons by hand has emerged. Some examples of these lexicons are WordNet (Fellbaum 1998; Miller et al. 1990), CyC (Lenat and Guha 1990), ACQUILEX (Briscoe 1991), and COMPLEX (Grishman, Macleod, and Meyers 1994).

Since WordNet is the most popular lexicon among the above and since it is used in the work done in this thesis both for sense evaluation and for similarity measure, this section presents detailed information about it.

 

WordNet

WordNet is an online lexical reference system which was developed at Princeton University under the direction of Professor George A. Miller. It combines many features used for WSD in one system. It includes definitions of word senses as in a dictionary; it defines “synsets” of synonymous words representing a single lexical concept like a thesaurus; and it includes word-to-word relations.

WordNet consists of three databases: noun database, verb database and one database for adjectives and adverbs. Each database consists of lexical entries corresponding to unique orthographic forms. Each form is associated with a set of senses.

 

Overview of noun car

The noun car has 5 senses (first 3 from tagged texts)

1. (598) car, auto, automobile, machine, motorcar -- (4-wheeled motor vehicle; usually propelled                                       by an internal combustion engine; "he needs a car to get to work")

2. (24) car, railcar, railway car, railroad car -- (a wheeled vehicle adapted to the rails of railroad;                                                      "three cars had jumped the rails")

3. (1) cable car, car -- (a conveyance for passengers or freight on a cable railway; "they took a                                                      cable car to the top of the mountain")

4. car, gondola -- (car suspended from an airship and carrying personnel and cargo and power                                                      plant)

5. car, elevator car -- (where passengers ride up and down; "the car was on the top floor")

 


 

Figure-1: WordNet 2.0 entry for the noun “car

 

WordNet’s sense entries consist of a set of synonyms, a dictionary-style definition, or gloss, and some example uses as shown in Figure-1. The numbers in front of some senses are the frequency values obtained from SemCor corpus. This simple listing of lexical entries looks like an ordinary dictionary. However, WordNet contains a set of domain-independent lexical relations which hold among WordNet entries, senses or sets of synonyms. Lexical relations are only available within the same database. Figure-2, 3 and 4 show a subset of the relations associated with each of the three databases.

 

Relation

Definition

Example

Hypernym

Hyponym

Has-member

Member-of

Has-part

Part-of

Antonym

From concepts to superordinates

From concepts to subtypes

From groups to their members

From members to their groups

From wholes to parts

From parts to wholes

Opposites

breakfast → meal

meal → lunch

faculty → professor

pilot → crew

table → leg

course → meal

leader → follower

 

Figure-2: Noun relations in WordNet

 

Relation

Definition

Example

Hypernym

Troponym

Entails

Antonym

From events to superordinate events

From events to their subtypes

From events to the events they entail

Opposites

fly → travel

walk → stroll

snore → sleep

increase → decrease

 

Figure-3: Verb relations in WordNet

 

Relation

Definition

Example

Antonym (adjective)

Antonym (adverb)

Opposite

Opposite

heavy → light

quickly → slowly

 

Figure-4: Adjective and adverb relations in WordNet

 

Synonym and hyponym relations are the most useful relations presented in WordNet. Synonym means different lexemes with the same meaning. Two WordNet entries are considered to be synonyms if they can be successfully substituted in some context. A synset is a set of synonyms. Consider the following example of a synset:

{ car, auto, automobile, machine, motorcar }

The gloss of this synset describes it as “4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work" as shown in Figure-1 for the first sense of noun car. Instead of representing concepts using logical terms, WordNet represents them as synsets.

Each synset is related to its immediately more general and more specific synsets via direct hypernym and hyponym relations. To find chains of more general or more specific synsets, one can follow a transitive chain of hypernym and hyponym relations.  As an example, the hypernym chain for the noun valley is shown in Figure-5. In this figure, successively more general synsets are shown on successive indented lines. The chain starts from valley and ends in entity.

 

valley, vale

       => natural depression, depression

           => geological formation, formation

               => natural object

                   => object, physical object

                              =>entity

Figure-5: Hyponymy chain for the noun “valley

 

Since WordNet provides the broadest set of lexical information in one resource, it is used widespreadly. Another reason for its wide-spread use is, it is the first broad-coverage lexical resource that is freely available.

The earliest attempts to use WordNet in WSD were in information retrieval field. Voorhees (1993) and Richardson and Smeaton (1994) created knowledge bases using WordNet’s hierarchy. Li et al. (1995) proposed a WordNet-based algorithm for WSD. Disambiguation was done by semantic similarity between words and heuristic rules. Heuristic rules were based on the semantic similarity and the WordNet hierarchy. Leacock et al. (1998) used WordNet to counter data sparseness problem. Hawkins (1999) built up a WSD system that works with frequency and contextual information that is based on WordNet. Fellbaum et al. (2001) proposed a system that made use of syntactic clustering and semantic distinctions extracted from WordNet.

As seen, there has been various work regarding WordNet. But WordNet is mostly used to determine semantic similarity between senses. Resnik (1995a) computed information content of words which is a measure of the specificity of the concept that subsumes the words in the WordNet hypernym hierarchy. Agirre and Rigau (1996) employ WordNet to determine the conceptual distance among concepts whereas Mihalcea and Moldovan (1999) exploit semantic density and WordNet glosses in an all words design. Lin (1997; 1998; 2000) described a semantic similarity measure where similarity between two objects is defined to be the amount of information contained in the commonality between the objects divided by the amount of information in the description of the objects. We used the same measure for the work conducted in this thesis. Detailed information is given in Section 3.2.3.

Other approaches using WordNet are Jiang (1997), Agirre and Martinez (2000), Agirre et al. (2001), Hayres (2001) and Banerjee et al. (2003). A combination of MRDs and WordNet has also been tried with success (Litkowski, 2000, 2001; Mihalcea and Moldovan, 1998).

2.2 Information Sources for WSD

There are various information sources or feature types used in WSD regardless of the type of the approach. To disambiguate a word, a diversity of information, including syntactic tags, word frequencies, collocations, semantic context, role-related expectations, and syntactic restrictions can be considered.

In Agirre and Martinez (2001), a comparison of WSD systems has been made based on the information source they used. Some of these sources are as follows:

Frequency of senses: Frequency information is used to measure the likelihood of each possible sense appearing in the text. Therefore this information is generally used in statistical approaches and it is generally learned from hand-tagged data such as SemCor corpus. WordNet senses are ordered according to the frequencies of the senses in the SemCor corpus.

Part of speech (POS): Part of speech tagging is regarded as the first step of the disambiguation process if the tokens have the same orthographic forms but different syntactic classes. It is necessary since it reduces the number of possible senses a word can belong to. An orthographic form may even be unambiguous in one syntactic class whereas it has more than one sense in another. For instance, in WordNet 2.0 handle has 5 senses as a verb, but only one sense as a noun. The impact of knowledge resources on WSD in examined in Gaustad (2004). The results show that accurate POS information is beneficial for WSD and that including the POS of the ambiguous word itself as well as POS of the context increases the disambiguation accuracy.

Morphology: It is defined specially as the relation between derived words and their roots. For instance, the noun agreement has 6 senses, its verbal root agree 7. A stemmer tries to reduce various forms of a word to a single stem. Since English is a language with little inflectional morphology, it is not certain that using morphology will lead to significant improvements in WSD. With other languages, such as German or Italian, morphology is of greater influence. Derivational morphology would be useful only for disambiguating roots.

Collocations: Collocation is the relationship among any group of words that tend to co-occur in a predictable configuration. Disambiguation relies heavily on collocational information. For example, the noun match has 9 senses. However, it has only one possible sense in “football match”. It is observed that collocations are strong indicators if they are learned from hand-tagged corpora. Although they are strong, they should be used with other sources. They should not be treated as rules for sense-filtering alone (as in Hirst 1987 or Dahlgren, McDowell, and Stabler 1989).

Semantic word associations: These can be classified as follows:

i.                    Taxonomial organization: This refers to the classification of words in a hierarchy and the lexical-semantic relationships holding between words such as a dog is a kind of animal. This kind of information can be extracted from ontologies like WordNet.

ii.                  Situation and Topic: Information about the situation or topic enables a WSD system to see the ambiguous word in a broader context. For example, if the word mouse is used in an office situation and the topic is computer use, the most probable sense of the word mouse will be “computer tool”, not “animal”. Semantic word associations around topic and situation are powerful when learned from hand-tagged corpora. Associations learned from MRDs can also be useful.

iii.                Argument-head relations: These relations provide important clues for disambiguation such as the relationship between dog and bite in the sentence “the dog bite the man.” Yarowsky (1993) showed that collocation information can be captured using argument-head relations. Argument-head relations work as a strong indicator if they are given as a sense-to-word relation.

Syntactic structure (subcategorization information): Subcategorization and dependency relations refer to certain kinds of relations between words and phrases. For example the verb want can be followed by an infinitive, as in “I want to fly to Istanbul”, or a noun phrase, as in “I want a flight to Istanbul”. But the verb find cannot be followed by an infinitive. Verbs have several possible patterns of arguments. A particular set of arguments that a verb can appear with is referred to a subcategorization frame. Subcategorization frames capture syntactic regularities about complements.

Gaustad (2004) used dependency relations in his WSD system for Dutch and showed that syntactic information provides very useful disambiguation clues and leads to a significant increase in performance. For the work presented in this thesis, we also used syntactic information and dependency relations, so detailed information about this kind of knowledge source will be given in the following sections.     

Agirre and Martinez (2001) made a comparison between the contributions of the above resources to WSD. According to their observations, if learned from hand-tagged corpora, collocations and semantic word associations are the most important knowledge types for WSD, but they also mentioned that syntactic cues are equally reliable. On the other hand, taxonomical information was found to be very weak.

2.3 General Approach and Related Work

As can be seen from the previous sections, most of the study in WSD is concentrated so far on supervised learning since these algorithms usually achieve the best performance. This fact is shown by the results of SENSEVAL-2 (Cotton et al., 1998). However these systems have the drawback that they are only applicable to the words for which sense tagged data is available, and they are strongly dependent to the labeled training data available. In addition, almost all of these systems make use of the picking the first sense heuristic.

Supervised systems and first sense heuristic has come to a natural bound of approximately 70% precision (Mihalcea, 2002). In order to improve this value, benefits of knowledge-based or unsupervised alternatives should be examined.

In this thesis, we present a disambiguation algorithm that relies on the information gained from syntactic context and sense similarity. The WSD approach used in the system cannot be specified as unsupervised since it uses a lexicon, namely WordNet, for sense disambiguation and for semantic similarity metric; however, it may also be regarded as unsupervised in the sense that no sense-tagged data is needed for training. All details about the system will be given in Chapter-3. In this section, a number of WSD systems that have employed syntactic context in the past are introduced.

The algorithm used in this thesis is very similar to the algorithm presented in Lin (1997). Lin mentions the same problems that supervised algorithms face with: need for a large sense-tagged training set and data sparseness. In order not to deal with these problems, Lin proposes an algorithm adopting the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”. This means that the same knowledge sources are used for all words and that instead of building separate classifiers for each word, past usages of other words are used to disambiguate the current word.

Using a broad-coverage parser, a 25-million-word Wall Street Journal corpus was parsed and dependency relationships were extracted. Dependency relationships were stored as dependency triplets which were composed of the type of the dependency relation, the word related to the ambiguous word via the dependency relation and information on whether the word is the head or the modifier in the relationship. A local context database was constructed for each of these triplets (local contexts). An entry in the database consisted of twenty words found from the training corpus that were most likely to occur in that particular local context.

In the disambiguation process, input is parsed and local contexts are extracted. Local context database is searched to find the words that occurred in the same local context as the ambiguous word. Based on a similarity maximization algorithm between the words, the polysemous word is disambiguated. The algorithm used in the similarity maximization part of both Lin’s work and our work is very similar to the algorithm presented in (Resnik, 1995a).

Since Wall Street Journal is mostly business news, Lin only used the “press reportage” of SemCor corpus for testing. An accuracy of 56.1% is reported for nouns.

Li, Szpakowicz, and Matwin (1995), presents an algorithm for automatic word sense disambiguation, based on lexical knowledge contained in WordNet and on the results of surface-syntactic analysis. The algorithm was designed to support text analysis with minimal precoded knowledge. They proposed a set of heuristic rules that are based on the idea that objects of the same or similar verbs are similar. Disambiguation is done by computing the semantic similarity between words and applying the heuristic rules based on this similarity.

Stetina et al. (1998) achieve good results with syntactic relations as features compared to the frequency baseline. They use a measure of semantic distance based on WordNet to find similar features. The features are extracted using a statistical parser (Collins, 1996), and consist of the head and modifiers of each phrase.

Martinez et al. (2002) discuss the contribution of various syntactic features to WSD: grammatical relations coded as the presence of adjuncts/arguments in isolation or as subcategorization frames, and instantiated grammatical relations between words.

The set of syntactic features was extracted using Minipar (Lin, 1993). Two types of syntactic relations were taken into account: instantiated grammatical relations (IGR) and grammatical relations (GR). IGR are triples of [wordsense relation value] where the value is either the word form or the lemma. For GR, bigrams [wordsense relation] and n-grams [wordsense relation1 relation2...] are stored. Three types of n-grams are defined: all subcategorization information including part of speech given by Minipar, subcategorization information occurring in the sentence, and all dependencies in the parse tree. Their results on the Senseval-2 lexical sample data for English indicate that both types of syntactic relations, IGR and GR, together provide the best F-score (IGR getting better precision than coverage and GR lower precision, but higher coverage). They also came to the conclusion that syntactic features effectively improve WSD precision.

Gaustad (2004) also investigated whether structural syntactic information is helpful for lexical semantic disambiguation or not. He used the Alpino dependency parser for Dutch in order to extract the dependency relations. He reported that using dependency relations in conjunction with the lemma of the ambiguous word and its part of speech as features performs better than the model including only immediately left and right context.

 

 

Chapter 3

 

Word Sense Disambiguation Based on Sense

Similarity and Syntactic Context

 

 

3.1 Introduction

Most corpus-based WSD algorithms make use of the previous usages of a polysemous word in order to disambiguate that word. These algorithms basically rely on the intuition that “two occurrences of the same word have identical meanings if they have similar local contexts”. However, this intuition can be misleading. In order for this method to have the desired consequences; a polysemous word must be observed several times before making a good classifier. Knowing that there are nearly 15000 polysemous nouns in WordNet 2.0, the corpus must contain millions of words. If the corpus fails to contain a polysemous word or all the senses of a word, the algorithm can lead to false results.

In this thesis, we make use of a WSD algorithm that relies on the intuition that “two different words are likely to have similar meanings if they occur in identical local contexts”. By this approach, a generalization can be made since it allows using the knowledge learned for one word on other words; instead of using a separate classifier for each polysemous word. Hence, the size of the training corpus can be much smaller. Moreover, this method is able to disambiguate infrequent or unseen words in the training corpus.

The basic steps of the algorithm are as follows:

                    i.            Parse the training corpus and build up a local context database by extracting local contexts according to some rules.

                  ii.            Similarly, parse the input text and extract local contexts of each ambiguous word.

                iii.            For each ambiguous word w, search the local context database and find words that are in an identical local context as w. These words are the similar words set S of w which will be used in sense selection.

                iv.            Select a sense of w that maximizes the similarity between w and its similar words set S that was found in the previous step.

Consider the sentence below:

Jane applied for a job.

 

The noun “job” has 15 possible meanings in WordNet 2.0. To disambiguate the noun “job”, we consider the other words that appeared in an identical context as “job”. According to our system, the local context of “job” is “N:prep-for:apply:head:V”, that is there is a relation between the verb “apply” and the noun “job” via the preposition “for”. In Table-1, the words that occurred in the local context “N:prep-for:apply:head:V” are presented. The numbers in table indicate the likelihood ratios of the words in that local context. The meaning of “job” in the example can be determined by choosing one of its 15 senses that is most similar to the meanings of the words in Table-1.

 

N:prep-for:apply:head:V

license                   419.46

membership          324.86

job                         299.67

approval                241.81

patent                    233.65

permission             207.59

amnesty                198.54

permit                    166.21

listing                     141.75

loan                       141.00 

citizenship                   100.85

asylum                          92.37

visa                               84.21

status                            72.32

grant                             69.68

exemption                     59.04

legalization                   58.55

noun.person                 54.34

refund                           52.63

residence                      50.20

Table-1: Words that have a relation with “apply” via “for”

This chapter is structured as follows: Firstly we present the resources needed for the algorithm. In section 3.3 and section 3.4, the concepts of local context and local context database are explained respectively. Later, the disambiguation algorithm is presented in section 3.5. Finally, we conclude this chapter with the results obtained from the experiments.

3.2 Resources of the Algorithm

In order to implement this algorithm the following resources are needed:

                    i.            an untagged training corpus,

                  ii.            a concept hierarchy,

                iii.            a similarity measure,

                iv.            a broad-coverage parser.

In this section, all of the above resources are presented one by one.

3.2.1 Training Corpus

For the training corpus, we used a 100-million-word Wall Street Journal corpus of the periods presented below:

Source        Period              Original Publication(s)

---------------------------------------------------------------

WSJ           1987-9             ACL/DCI and CSR WSJ0

WSJ           1990-2             TIPSTER Vol. 2, version 2

WSJ           1992-4             CSR LM-1 corpus

 

The WSJ 1992 collection on TIPSTER covers only the first three months of the year.  Data from the remaining months of 1992 were taken from the LDC, along with all of 1993 and the first three months of 1994.

3.2.2 Concept Hierarchy (WORDNET 2.0)

As the concept hierarchy WordNet 2.0 is used. Information about the general WordNet structure is given in section 2.1.2.3. WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0 includes more than 42,000 new links between nouns and verbs that are morphologically related, a topical organization for many areas that classifies synsets by category, region, or usage, gloss and synset fixes, and new terminology.

3.2.2 Concept Hierarchy (WORDNET 2.0)

As the concept hierarchy WordNet 2.0 is used. Information about the general WordNet structure is given in section 2.1.2.3. WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0 includes more than 42,000 new links between nouns and verbs that are morphologically related, a topical organization for many areas that classifies synsets by category, region, or usage, gloss and synset fixes, and new terminology.

 

3.2.3 Similarity Measure

In order to find the similarity between the senses of a polysemous word and another word we need a similarity measure. There is a strong relationship between a word’s semantics (or meaning) and its syntactic behavior. As discussed in Weeds (2002), if two words mean similar things then they will occur in documents in similar contexts. Accordingly, we can attempt to calculate the similarity between all senses of two words by finding all of the contexts that each appears in. If a description of a word is made up of the contexts it appears in and their associated frequencies then we can mathematically measure the similarity between these descriptions. For example, the word dog may occur as the subject of the word eat x number of times - this will increase the similarity between dog and other words which also occur as the subject of the word eat e.g. cat. Many measures have been proposed to calculate the similarity between the descriptions of words; we used WordNet Similarity Package v.0.15 which implements the semantic relatedness measures described by Leacock & Chodorow (1998), Jiang & Conrath (1997), Resnik (1995), Lin (1998), Hirst & St-Onge (1998), Wu & Palmer (1994), the adapted gloss overlap measure by Banerjee and Pedersen (2002), and two measures based on context vectors by Patwardhan (2003). This package is able to calculate the similarity between either two words or two senses (or synsets). We use it to find the similarity between two senses.

We used Lin’s measure (1998) in the experiments conducted for this thesis.

 

Lin’s Measure[1]

This measure is derived from the following assumptions (Lin, 1998):

 

i.                    The commonality between the words A and B is measured by I(common(A,B)) where common(A,B) is a proposition that states the commonalities between A and B and I(s) is the amount of information contained in the proposition s.

ii.                  The differences between the words A and B is measured by

I(describe(A,B)) – I(common(A,B))

             where describe(A,B) is a proposition that describes what A and B are.

iii.                The similarity between A and B, sim(A,B), is a function of their commonality and differences. That is,

sim(A,B) = f ( I(common(A,B)) , I(describe(A,B)))

                   where the domain of function f(x,y) is { (x,y) | x ≥ 0 , y > 0 , y ≥ x } .

iv.                Similarity is independent of the unit used in the information measure.

                  According to the Information Theory (Cover and Thomas, 1991), I(s) = -logbP(s),          where P(s) is the probability of s, and b is the unit. When b=2, I(s) is the number of bits                needed to encode s. Since logbx =  , assumption (iv) means that the function f                 must satisfy the following condition:

                                                     

v.                  Similarity is additive with respect to commonality.

If common(A,B) consists of two independent parts, then the sim(A,B) is the sum of the          similarities computed when each part of the commonality is considered. In other words:             f(x1 + x2, y) = f(x1,y) + f(x2,y). Therefore,,                              which means that when there is no commonality between A and B, their similarity is 0, no            matter how different they are.

vi.                The similarity between a pair of identical objects is 1.

When A and B are identical, knowing their commonalities means knowing what they are, i.e., I(common(A,B)) = I(describe(A,B)). Therefore, the function f must have the following property: 

vii.              The function f(x,y) is continuous.

 

Similarity Theorem: The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are.

 

sim(A,B)

Proof: To prove the theorem, we need to show f(x,y) = . Since f(x,y) = f(,y) (due to assumption (iv)), we only need to show that when  is a rational number, f(x,y) = .The result can be generalized to all real numbers because f is continuous and for any real number, there are rational numbers that are infinitely close to it.

Suppose m and n are positive integers.

                                          f(nx,y) = f((n-1)x,y) + f(x,y) = n f(x,y)  (due to assumption (v))

Thus,

Substituting  for x in this equation:

Since is rational, there exist m and n such that =. Therefore,

 

For example, Figure-6 is a fragment of the WordNet. The nodes are synsets. The links represent hypernym relationships. For every synset C, logP(C) can be defined as the probability that a randomly selected noun refers to an instance of C. The probabilities are estimated by the frequency of synsets in SemCor.

If x is a hill and y is a coast, the commonality between x and y is that “x is a geological-formation and y is a geological-formation”. The information contained in this statement is -2xlogP(geological-formation). The similarity between the synsets hill and coast is:

Generally speaking,

sim(A,B) =

 

where  is the probability that an object belongs to all the maximally specific super classes (Ci s) of both A and B.

 

Figure-6: A fragment of WordNet

3.2.4 Broad-coverage Parser (MINIPAR)

Parsing means taking an input and producing some sort of structure for it. A parser for a natural language such as English, Turkish, etc. is a program that constructs derivations or phrase structure trees for the sentences of that language. It supplies a correct grammatical analysis for a given sentence by separating it into its constituents and labeling each of them. It may also offer additional information, such as the semantic class (e.g., Person, Physical Object) of each word and the functional class (e.g., Subject, Direct Object) of each constituent of the sentence.

We chose to use MINIPAR (Lin, 2001) in order to extract the local contexts from the text corpus. MINIPAR is a principle-based, broad-coverage parser for English (Lin, 2001). It represents its grammar as a network of nodes and links, where the nodes represent grammatical categories and the links represent types of dependency relationships. In the Table-2 and Table-3 the lists of grammatical categories and dependency relationships are presented respectively.

 

 

GRAMMATICAL CATEGORIES

 

 

Det                                                          Determiners

PreDet                                                     Pre-determiners

PostDet                                                   Post-determiners

NUM                                                       numbers

C                                                              Clauses

I                                                               Inflectional Phrases

V                                                              Verb and Verb Phrases

N                                                              Noun and Noun Phrases

NN                                                           noun-noun modifiers

P                                                              Preposition and Preposition Phrases

PpSpec                                                   Specifiers of Preposition Phrases

A                                                              Adjective/Adverbs

Have                                                       have

Aux                                                         Auxiliary verbs

Be                                                            Different forms of “be”

COMP                                                    Complementizer

VBE                                                        be used as a linking verb

V_N                                                        verbs with one argument , i.e., intransitive verbs

V_N_N                                                   verbs with two arguments, i.e., transitive verbs

V_N_I                                                    verbs taking small clause as complement

SentAdjunct                                          Sentence adjunct

U                                                              undefined

 

Table-2 : Grammatical Categories of MINIPAR

 

 

 

 

GRAMMATICAL RELATIONSHIPS

 

 

appo                                                       "ACME president, --appo → P.W. Buckman"

aux                                                          "should ← aux-- resign"

be                                                            "is ← be-- sleeping"

c                                                               "that ← c-- John loves Mary"

comp1                                                    first complement

det                                                           "the ← det -- hat"

gen                                                          "Jane's ← gen -- uncle"

have                                                        "have ← have-- disappeared"

i                                                                relationship btw. a C clause and its I clause

inv-aux                                                   inverted auxiliary: "Will ← inv-aux-- you stop it?

inv-be                                                     inverted be: "Is ← inv-be-- she sleeping"

inv-have                                                 inverted have: "Have ← inv-have-- you slept"

mod                                                        relationship btw. a word and its adjunct modifier

pnmod                                                    post nominal modifier

p-spec                                                     specifier of prepositional phrases

pcomp-c                                                 clausal complement of prepositions

pcomp-n                                                nominal complement of prepositions

post                                                         post determiner

pre                                                           pre determiner

pred                                                         predicate of a clause

rel                                                            relative clause

vrel                                                          passive verb modifier of nouns

wha, whn, whp                                      wh-elements at C-spec positions

obj                                                           object of verbs

obj2                                                        second object of ditransitive verbs

subj                                                         subject of verbs

s                                                               surface subject


 

 

Table-3 : Grammatical Relationships of MINIPAR

 

MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships (Lin, 1998a), evaluated on the SUSANNE corpus (Sampson, 1995), a subset of the Brown Corpus of American English. In (Alam, 2002), they also performed a test on 584 sentences. Instead of checking the parse trees, they checked the frames corresponding to the sentences. They counted all the errors caused by MINIPAR and its accuracy was found 77.6% in terms of correctly parsed sentences.

As reported in (Alam, 2002), the majority of MINIPAR errors fall in the following categories:

1. Tagging errors: some nouns are mistagged as verbs.

2. Attachment errors: some prepositional phrases (PP) that should be attached to their immediate preceding nouns are attached to the verbs.

3. Missing lexical entries: some domain specific words such as download and their usages are not in the MINIPAR lexicon. This introduces parsing errors because such words are tagged as nouns by default.

4. Inability to handle ungrammatical sentences: in a real world application, it is unrealistic to expect the user to enter only grammatical sentences. Although MINIPAR still produces a syntactic tree for an ungrammatical sentence, the tree is ill formed and cannot be used to extract the semantic information being expressed.

3.3 Local Context

Context is the only source of information we use to identify the meaning of a polysemous word. A great deal of work done in WSD uses local context as the primary information source. However, every system has its own definition of a local context. It is generally considered as a small window of words surrounding a target word in a text. In these cases, the local context is generally regarded as all words or characters falling within some window without considering any features or any kinds of relations. Below are some features that are taken into consideration in local context selection:

Distance: Examining a context of a few words around the target word to disambiguate is the mostly used method in WSD. In this method, there is always a question concerning the optimal window size N.

Yarowsky (1993, 1994a and b) examines different windows of local context, including 1-contexts, k-contexts, and words pairs at offsets -1 and -2, -1 and + 1, and +1 and +2, and sorts them using a log-likelihood ratio to find the most reliable evidence for disambiguation. He makes the observation that the optimal value of k varies with the kind of ambiguity: he suggests that local ambiguities need only a window of k = 3 or 4, while semantic or topic-based ambiguities require a larger window of 20-50 words. No single best measure is reported, suggesting that for different ambiguous words, different distance relations are more efficient.

Collocation: The term “collocation” has been used in various senses in WSD work. Yarowsky (1993) adapts the definition of collocation as “the co-occurrence of two words in some defined relation.” He determined that in cases of binary ambiguity, there exists “one sense per collocation,” that is, in a given collocation a word is used with only one sense with 90-99% probability.

Syntactic relations: In most WSD work to date, syntactic information is used. Selectional restrictions rely on parsing, frames, semantic networks, the application of selectional preferences, etc.

In this thesis, the local context of a word is defined in terms of the syntactic dependencies between the word and the other words in the same sentence. Dependency grammars generate parses where words in a sentence are related directly to the word which is its syntactic head. As mentioned before, we used MINIPAR in order to generate a parse including dependency relationships between words for a sentence. Some examples of dependencies generated by MINIPAR can be found in Figure-7.

 

Figure-7

 

In Figure-7, the labels on the arrows indicate the grammatical relationships generated by MINIPAR which are shown in detail in Table-2. The local context of a word W1 is defined as a tuple as follows:

 

POS1 : relation : W2 : position : POS2

 

where POS1 is the part of speech of word W1; relation is the grammatical relationship between W1 and W2; W2 is the word related to W1 via the grammatical relationship; position is the indicator of W2 whether it is the syntactic head or the syntactic modifier(mod) in the grammatical relation and POS2 is the part of speech of word W2. Part of speech tagging is also done by MINIPAR and the possible part of speech tags are presented in Table-1.

The following local contexts are generated for the first sentence (dependency diagram) in Figure-1:

word          local contexts          

buy             ( V : aux : will : mod : Aux ) (V : subj : I : mod : N) (V : obj : car : mod : N )

will              ( Aux : aux : buy : head : V )

I                  ( N : subj : buy : head : V )

car              ( N : obj : buy : head : V ) ( N : mod : red : mod : A ) ( N : det : a : mod : Det)

red              ( A : mod : car : head : N )

a                 ( Det : det : car : head : N )

3.4 Local Context Database

In order to construct a local context database, initially, we have parsed the 100-million-word training corpus with MINIPAR and extracted local contexts. Later, we have introduced two types of modifications: function words are removed and they are bridged with semantic relations. Below, the general rules for excluding function words from the database are presented:

1. If one of the words in the dependency relationship is an auxiliary verb, that is if POS1 or POS2 is equal to “Aux” and also the relationship is “aux”, then we delete that dependency relationship from the database. Auxiliary verbs (will, shall, should, can, etc.) do not provide us any information about the meaning of a word.

2. If one of the words in the dependency relationship is a sentence adjunct, that is if POS1 or POS2 is equal to “SentAdjunct”, then we delete that dependency relationship from the database. Sentence adjuncts are always optional because they do not depend on the verb, and they may occur initially or finally. For example, “When you receive that card, you should know that you are not being invited to join the party.”

3. A complementizer is a conjunction which marks a complement clause. If one of the words in the dependency relationship is a complementizer, that is if POS1 or POS2 is equal to “COMP”, then we delete that dependency relationship from the database. An example for a complementizer is: I know that he is here.”

4. If the dependency relation is “punc” or if POS1 or POS2 is equal to “U” or “have”, then these dependency relationships are deleted from the database.

5. Contractions (`m, `s, `re, etc.) are deleted from the database.

6. Some words such as “one” are used very frequently in English. Therefore, since the information gained from these kinds of words does not contribute to the disambiguation process, dependency relationships including these words are deleted from the database.

7. Determiners are deleted from the database, i.e. a, an, the.

In addition to the dependency relationships produced by the parser, we created new dependency relationships by bridging function words. The rules about how to create these will be presented on example sentences. Consider the five sentences and related parser outputs below:

(1)  I went to the cinema.

                                          go                    V:mod:to:mod:Prep

                                          to                     Prep:mod:go:head:V

                                          to                     Prep:pcomp-n:cinema:mod:N

                                          cinema            N:pcomp-n:to:head:Prep

 

(2)  I am sitting at a table.

                                          sit                     V:mod:at:mod:Prep

                                          at                     Prep:mod:sit:head:V

                                          at                     Prep:pcomp-n:table:mod:N

                                          table                N:pcomp-n:at:head:Prep

(3)  The gardener cut the grass.

                                          cut                   V:subj:gardener:mod:N

                                          gardener         N:subj:cut:head:V

                                          cut                   V:obj:grass:mod:N

                                          grass               N:obj:cut:head:V

(4)  The thief who stole her bag is caught.

                                          thief                 N:rel:fin:mod:C

                                          fin[2]                   C:rel:thief:head:N

                                          fin                    C:whn:who:mod:N

                                          who                 N:whn:fin:head:C

                                          fin                    C:i:steal:mod:V

                                          steal                 V:i:fin:head:C

                                          steal                 V:subj:who:mod:N

                                          who                 N:subj:steal:head:V

 

(5) His car is very fast.

                                          be                    VBE:pred:fast:mod:A

                                          fast                  A:pred:be:head:VBE

                                          fast                  A:subj:car:mod:N

                                          car                   N:subj:fast:head:A

 

      Rule-1:

In the sentences (1) and (2), consider the relations between the words go-to-cinema and sit-at-table. The parser produces the relationship between a word and its adjunct modifier for word pairs go-to and sit-at where the prepositions “to” and “at” are the adjunct modifiers for the verbs “go” and “sit”; and a nominal complement of preposition relations for to-cinema and at-table word pairs where “cinema” and “table” are nominal complements for prepositions “to” and “at”. But it does not produce any relations for go-cinema and sit-table word pairs since there is no syntactic dependency in these pairs. However there is a semantic dependency in each of these pairs. So, as a rule, we introduce new relations for every nominal complement of prepositions, and the relation is different for each preposition such as prep-to, prep-at, prep-with, etc. In this case below tuples are produced:

go                      V:prep-to:cinema:modifier:N

sit                       V:prep-at:table:modifier:N

cinema               N:prep-to:go:head:V

table                  N:prep-at:sit:head:V

 

      Rule-2:

For the sentence (3), the parser produces a subject relation for gardener-cut pair where “gardener” is the subject and an object relation for cut-grass pair where “grass” is the object. There is no syntactic relation in the gardener-grass pair. For these kind of subject and object pairs, as a rule, we introduce the new “s-ob” relation resulting in tuples:

 

 gardener            N:s-ob:grass:modifier:N

 grass                 N:s-ob:gardener:head:N

 

Rule-3:

In the sentence (4), there exists a relative pronoun “who”, which is the syntactic subject of the verb “steal”. But, this relative pronoun does not give us any semantic information. So, as a rule, for the subordinate clauses constructed from relative pronouns (who, which, that), we assign the main clause’s subject to be the subject of the subordinate clause’s verb, like in the example pair thief-steal:

 

                                                steal                 V:subj:thief:modifier:N

                                                thief                 N:subj:steal:head:V

 

       Rule-4:

From the sentence (5), we get a subject relation for the pair car-fast where “car” is the subject. This is syntactically correct but it would be better to give a different name to the subject relation that is formed by the verb “be”. Therefore, a new relation is introduced with the name “subj-be” for such cases:

 

                                                fast                  A:subj-be:car:modifier:N

                                                car                   N:subj-be:fast:head:A

 

Rules that are explained above are applied to the output of the parser to obtain the local context tuples for the local context database. An entry in the local context database is composed of one local context and a set of 20 words which have top-20 likelihood ratios with that local context. In Figure-8, different database entries are presented for different local contexts.      

N:subj:employ:head:V

plant                     947.30

company              380.15

unit                       148.10

facility                   129.88

operation              126.82

factory                  125.74

division                 121.38

firm                         81.04

business                 73.01

mill                          70.60 

noun.person               51.79

venture                       41.58

industry                      41.37

kitchen                       40.80

noun.possession        27.45

enterprise                   26.42

manufacturer              26.20

he                               23.67

arrowhead                  19.91

digital                          19.65

 

 

 

 

 

 

 

 

 

Figure-8

For the local context “N:subj:employ:head:V”, the subjects of employ that have the highest 20 likelihood ratios are presented. So, “plant” is the most probable noun that is supposed to occur as the subject of the verb “employ” according to our local context database. The most probable verb to occur in a subject relationship with the noun “student” is “learn”. We can infer from the database that the thing that is most applied for is “license” and the noun “girl” is mostly modified by a quantifier.

There are some expressions as “noun.quantity” or “noun.person” in the local context database. These expressions are obtained by exposing the output of the parser to a named entity recognition process. “noun.person” includes all proper names recognized as a person, “noun.group” includes all proper names recognized as organization or corporation names, “noun.location” are the proper names used for addresses and geological forms, “noun.time” is the proper name for dates, “noun.possession” is the proper name regarding money and “noun.quantity” is used for numbers.

 

Likelihood Ratio

The likelihood ratio of a local context and a word is obtained by treating the word and the local context in a tuple as a bigram and applying the log-likelihood formula presented in Dunning 1993. The log-likelihood is formulated as follows:

 

where

  

also

 , , and .

 

3.5 Disambiguation Algorithm

After explaining the concept of local context and local context database, the steps of the algorithm is presented in this section in detail.

Step-1: Input text is parsed and local contexts of each word are extracted. Let LCw denote the set   of local contexts of word w. (There can be more than one local context for a word.)

      For each word ambiguous word w in the input text:

      Step-2: Search the local context database for every local context lc in LCw in order to find the          words that appeared in an identical local context as w. Let these words be denoted as     Selectorsw :

Selectorsw =

            where lc is a local context and C(lc) is a set of (word likelihood) pairs as shown in Figure-8.

           

            Step-3: Select a sense s of w that maximizes the similarity between w and Selectorsw.

 

                        For the similarity maximization process, suppose that w = W0 ,

                        selectors of W0 are  Selectorsw = {W1,W2,…..Wk}, and

                        the word Wi has ni senses which are denoted as {…..,}.

                       

Step-3.1: Construct a similarity matrix which is divided into (k+1)(k+1) blocks as shown in Figure-9. (One block for each word {W0 , W1 ,W2,…..Wk} .)

This matrix is used to keep the similarity values between every sense of each word. For example, suppose that the word W2 has 4 senses and the word W7 has 8 senses. Then the block B27 will be a 48 matrix consisting of the similarity values between every sense combination of W2 and W7.

The blocks on the diagonals will be all consisting of 0s. Similarity values lower than 0.2 are considered as noise and are ignored.

 

Text Box: otherwise

 

Text Box: if i ≠ j and sim(sil , sjm ) ≥ 0.2

       

                                               

where sim(sil , sjm ) is the similarity value found by the WordNet Similarity Package using Lin’s measure.

Step-3.2: Let A be the set of polysemous words in {W0 , W1 ,W2,…..Wk} :

A = { Wi | ni  > 1 , 0 < i ≤ k}

Step-3.3: Find a sense of words in A that gets the highest total support from other words. That is, find a sense which is most similar to every word. Call this sense :

 

    0<i≤k

where sil is one of the senses of word Wi ( l th sense of Wi  where l [1,ni ] ) that is in the set A  (Wi A) and support(sil,Wj) is defined as the support sil gets from Wj:

 

 

Step-3.4: The word is disambiguated and its sense is selected as . is removed from the set of ambiguous words set A.

 

 

Step-3.5: Since is disambiguated, the similarity values between other senses of and senses of other words are removed.

For all l, j, m, such that l [1,] and and and [1,nj]:

Step-3.6: If the disambiguated word is W0 , that is imax = 0 , stop. Otherwise repeat from Step3.3.

 

senses of W0

.   .   .  

W1 , …, Wk-1

.   .   .   .

senses of Wk

 .  .  . 

 

 

 

            0

 

 

 

 

         B0k

 

 

 

           B10

 

 

 

                  

 

 

 

          B1k

 

.

.

.

 

 

 

.

.

.

 

 

            Bk0

 

 

 

 

          0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure-9: Similarity matrix

 

Step-4: The sense found for the word w in Step-3 is assigned to the word.

 

3.6 Evaluation and Results

In order to see how well our system performs, we used the SENSEVAL-2 (Palmer et al., 2001) and SENSEVAL-3 English all-words task data. Both of these data sets consist of three texts from the Penn Treebank corpus.

 We only worked with nouns. Because of the similarity measure, it is impossible to do disambiguation on adverbs and adjectives. Hypernym relationship is not defined for adjectives and adverbs in WordNet. For the verbs, since they have more senses in WordNet and the difference between the senses is very subtle, the results were not promising.

There are 1136 annotated nouns in SENSEVAL-2 and 951 annotated nouns in SENSEVAL-3. Table-1 shows precision and recall figures for the results that are obtained by our system.

 

Data

Precision

Recall

SENSEVAL-2

59.11 %

40.80 %

SENSEVAL-3

56.80 %

40.00 %

Table-4: Results of Senseval data

 

In order to make a comparison, the top-10 performing systems in SENSEVAL-2 English all words task are presented in Table-5. As seen, most of the systems use sense-tagged data for training. When the systems that have the same precision and recall values are observed, it is found that they make use of the first sense heuristic for the cases their systems are unable to produce an answer.

There is a variety of reasons for the recall of our system to be low. The main reason is that we have eliminated the cases for which the parser produced wrong part of speech tagging. In order to find tagging errors, parser’s tagging was compared with the POS tags provided with SENSEVAL data. Secondly, local contexts of the nouns may sometimes not exist in the local context database. Thirdly, during the similarity maximization algorithm, there may be some time that all of the support values become zero, so no sense could be selected and the algorithm stopped.

   System Name                          Precision                     Recall              Supervised

     100 SMUaw                             0.698                           0.698                 Yes

     100 CNTS-Antwerp                  0.645                           0.645                 Yes

     100 Sinequa-LIA – HMM          0.626                          0.626                 Yes

      98.908 UNED - AW-U2           0.583                           0.577                 No

       98.908UNED - AW-U             0.565                           0.559                 No

       95.552 UCLA - gchao2            0.485                           0.463                 Yes   

      95.552 UCLA - gchao3             0.482                           0.461                 Yes

     108.5 CL Research – DIMAP    0.423                           0.46                   No

    100 CL Research - DIMAP (R)  0.46                             0.46                   No

       89.729 UCLA – gchao             0.508                           0.456                 Yes

Table-5: Top-10 performing systems in SENSEVAL-2

 

In order to make a comparison with Lin’s measure, we have also used the adapted lesk measure (Banerjee and Pedersen, 2002). We chose the lesk measure since it is reported to perform best among the other measures in WordNet Similarity Package in (McCarthy et al., 2004; Patwardhan et al., 2003). Lesk measure works by finding overlaps in the glosses of the two synsets. The similarity value is the sum of the squares of the overlap lengths. A single word overlap results in a score of 1 whereas a two-consecutive-words overlap results in a score of 4.        

For comparing the similarity measures lin and lesk, we conducted an experiment on a subset of SENSEVAL-2 English all-words data consisting of only polysemous nouns and obtained the results presented on Table-6. We didn’t use this similarity measure in all our experiments since it is very inefficient in the sense that it requires a lot of computational time.

 

Measure

Precision

Lin

43.9 %

Lesk

51.6 %

 

Table-6: Comparison of measures

3.6.1 Weak Contexts of the Algorithm

Even our system obtained up-to-date results compared to unsupervised systems, there are some weak contexts. Firstly, in the disambiguation process, all the local contexts of an ambiguous word are treated equally. However, some local contexts provide a lot of information while some others provide less or don’t provide anything.  We eliminated some of these local contexts and introduced new ones but a statistical model can be implemented on the disambiguation algorithm.

Another weakness is the idiomatic usages of words. The intuition that similar words occur in identical contexts does not always hold. For example, consider the following sentence:

Money talks.

Suppose we want to disambiguate “money” which has 3 senses. The most frequent subjects of “talk” according to our local context database are, noun.person, people, peace, man, leader, executive, noun.group, president., etc. None of these words is similar to the “wealth” meaning of “money”.

 

 

Chapter 4

 

CONCLUSION

 

In the first part of this thesis an overview of WSD is presented regarding different approaches as well as different knowledge sources. Most of the work done in the past 25 years is tried to be summarized. In spite of all those work done on WSD, relative to other fields, little progress seems to have been made. WSD is still an open problem waiting to be solved. Starting from 1980’s with the availability of more resources and statistical methods like Naïve Bayes, there has been an improvement on the earlier results; however, it appears that the limit of these supervised methods has been reached currently. The problems of scarce sense-tagged data and data sparseness still remain to be a barrier in front of WSD.

Another problem of WSD is the difficulty of determining or even defining word senses. Lately, WordNet has become a de facto standard for lexical resources however it also have lots of limitations. Its sense distinctions are very fine-grained and there are gaps in the WordNet hierarchy regarding the lexical relations. Actually it is not a hierarchy because of these gaps.

Considering these, in the second part of the thesis, we presented an algorithm which did not need sense-tagged corpus. Instead of statistical methods, we tried to obtain information from syntactic local contexts and sense similarity. With the help of a principle-based broad coverage parser, we extracted the local contexts. By removing function words, we built semantic relations by bridging the words connected by function words. We computed the similarity values between the words that occurred in the same local contexts using a similarity metric proposed by Lin (1997). We used WordNet in order to evaluate our system however because of the fine-grained senses and some weaknesses in the algorithm, we were able to achieve a precision about 59%. Lin (1997) also presented a similar system and evaluated his system with relaxed correctness criterion instead of evaluating his system according to WordNet senses. He reported nearly 70% accuracy with the relaxed evaluation criteria which is successful for an unsupervised system.

Therefore, according to our observations, the future direction in WSD research should focus on completely unsupervised methods (without using any ontology such as WordNet). Going on the other way, trying the methods that have been examined in the history of WSD, would cause a cycle in the WSD research and improvement.  


 

[2] “fin” is introduced by the parser. It has no meaning. They are deleted from the database.


 

[1] This section is taken from Lin(1997)