2.1.2 Knowledge Based Approaches
All of the methods
described up to now require considerable amount of work to create a
classifier for each entry in the lexicon. Because of this reason, they were
able to report results on very few lexical items (in general between 4 to 12
items). With the availability of large-scale lexical resources, such as
dictionaries, thesauri and corpora, work on WSD focused on large-scale
disambiguation. In the following subsections WSD based on machine readable
dictionaries, thesauri and computational lexicons will be presented.
2.1.2.1 Machine Readable Dictionaries
Machine readable
dictionaries (MRD) provide a ready-made information source of word senses.
The first attempt to use MRD’s came from Lesk (1986)[38]. He starts from the
simple idea that a word’s dictionary definitions are likely to be good
indicators of the senses they define. By using Oxford Advanced Learner’s
Dictionary (OALD), he counts overlapping content words in the sense
definitions of the ambiguous word and in the definitions of context words
occurring nearby and selects the sense that achieves the maximum number of
overlaps. The accuracy of the method is reported to be 50-70% which is a
good result regarding the fact that a fine set of sense distinctions have
been used such as OALD.
Cowie et al. (1992)[39]
tried to improve Lesk’s approach by optimizing the overlap of all words in a
single sentence simultaneously. However, it was found computationally very
expensive. Therefore, Cowie et al. (1992)[39] used simulated annealing
(Sampson 1986)[40] for the first time in natural language processing. They
evaluated this approach using a corpus consisting of 50 sentences extracted
from Longman Dictionary of Contemporary English (LDOCE). 47% of the words
were reported to be correctly disambiguated to sense level and 72% to more
rough grained senses.
Stevenson and Wilks
(2001)[41] computed the overlap by normalizing the contribution of a word to
the overlap count. Pedersen and Banerjee (2002)[42] described a different
version of the Lesk’s algorithm by employing glosses contained in WordNet
(Fellbaum, 1998; Miller et al., 1990)[19;43], a lexical database for English
which will be described later in detail.
Because of the fact that
dictionaries are created for human use, not for computers, there are some
inconsistencies (Veronis and Ide 1991; Ide and Veronis 1993a,
1993b)[44;45;46]. Although they provide detailed information at the lexical
level, they lack pragmatic information used for sense determination. For
instance, the relation between ash and tobacco, cigarette
or tray is very indirect in a dictionary whereas the word ash
co-occurs very frequently with these words in a corpus (Ide and Veronis,
1998)[47].
2.1.2.2 Thesauri
Thesauri provide
information about relationships among words. Thesaurus based
disambiguation makes use of the semantic categorization provided by a
thesaurus or a dictionary with subject categories. The most frequently
used thesaurus in WSD is Roget’s International Thesaurus (Roget, 1946)
which was put into machine-tractable form in 1950’s. The basic inference
in thesaurus-based disambiguation is that semantic categories of the
words in a context determine the semantic category of that context as a
whole. And this category is then determines the correct senses that are
used.
Walker (1987:254) proposed an algorithm as follows: each word
is assigned to one or more subject categories in the thesaurus. If the
word is assigned to several subjects, then it is assumed that they
correspond to different senses of the word. Black (1988: 187) applied
this approach to five different words and achieved accuracies around
50%.
Yarowsky (1992) proposed
an approach for marking words with their categories from Roget’s
thesaurus. The important aspect of the approach was that he used a
context of 50 words either side so that 100 words were considered in the
training examples for each ambiguous word. This method was tested on 12
ambiguous words and reported to achieve 92% accuracy. Yarowsky notes
that this method is best for extracting topical information, most
successful for nouns.
Similar to machine
readable dictionaries, a thesaurus is a resource for humans, so there is
not enough information about word relations. Therefore, Roget’s or any
other thesauri have not been used so extensively.
2.1.2.3 Computational Lexicons
The usefulness of lexical
relations in linguistic, psycholinguistic and computational research has
led to a number of efforts to create large electronic databases of such
relations. Beginning from the mid-1980’s, construction of semantic
lexicons by hand has emerged. Some examples of these lexicons are
WordNet (Fellbaum 1998; Miller et al. 1990), CyC (Lenat and Guha 1990),
ACQUILEX (Briscoe 1991), and COMPLEX (Grishman, Macleod, and Meyers
1994).
Since WordNet is the most
popular lexicon among the above and since it is used in the work done in
this thesis both for sense evaluation and for similarity measure, this
section presents detailed information about it.
WordNet
WordNet is an online
lexical reference system which was developed at
Princeton University under the direction of
Professor George A. Miller. It combines many features used for WSD
in one system. It includes definitions of word senses as in a
dictionary; it defines “synsets” of synonymous words representing a
single lexical concept like a thesaurus; and it includes word-to-word
relations.
WordNet consists of three
databases: noun database, verb database and one database for adjectives
and adverbs. Each database consists of lexical entries corresponding to
unique orthographic forms. Each form is associated with a set of senses.
|
Overview of noun car
The noun car has 5 senses (first 3 from tagged texts)
1. (598) car, auto, automobile, machine, motorcar -- (4-wheeled
motor vehicle; usually propelled
by an internal combustion
engine; "he needs a car to get to work")
2. (24) car, railcar, railway car, railroad car -- (a wheeled
vehicle adapted to the rails of railroad;
"three cars had
jumped the rails")
3. (1) cable car, car -- (a conveyance for passengers or freight
on a cable railway; "they took a
cable car to
the top of the mountain")
4. car, gondola -- (car suspended from an airship and carrying
personnel and cargo and power
plant)
5. car, elevator car -- (where passengers ride up and down; "the
car was on the top floor") |
Figure-1:
WordNet 2.0 entry for the noun “car”
WordNet’s sense entries
consist of a set of synonyms, a dictionary-style definition, or gloss, and
some example uses as shown in Figure-1. The numbers in front of some senses
are the frequency values obtained from SemCor corpus. This simple listing of
lexical entries looks like an ordinary dictionary. However, WordNet contains
a set of domain-independent lexical relations which hold among WordNet
entries, senses or sets of synonyms. Lexical relations are only available
within the same database. Figure-2, 3 and 4 show a subset of the relations
associated with each of the three databases.
|
Relation |
Definition |
Example |
|
Hypernym
Hyponym
Has-member
Member-of
Has-part
Part-of
Antonym |
From concepts to superordinates
From concepts to subtypes
From groups to their members
From members to their groups
From wholes to parts
From parts to wholes
Opposites |
breakfast → meal
meal → lunch
faculty → professor
pilot → crew
table → leg
course → meal
leader → follower |
Figure-2:
Noun relations in WordNet
|
Relation |
Definition |
Example |
|
Hypernym
Troponym
Entails
Antonym |
From events to superordinate events
From events to their subtypes
From events to the events they entail
Opposites |
fly → travel
walk → stroll
snore → sleep
increase → decrease |
Figure-3:
Verb relations in WordNet
|
Relation |
Definition |
Example |
|
Antonym (adjective)
Antonym (adverb) |
Opposite
Opposite |
heavy → light
quickly → slowly |
Figure-4:
Adjective and adverb
relations in WordNet
Synonym and hyponym
relations are the most useful relations presented in WordNet. Synonym means
different lexemes with the same meaning. Two WordNet entries are considered
to be synonyms if they can be successfully substituted in some context. A
synset is a set of synonyms. Consider the following example of a synset:
{ car, auto, automobile, machine, motorcar }
The gloss of this synset
describes it as “4-wheeled motor vehicle; usually propelled by an internal
combustion engine; "he needs a car to get to work" as shown in Figure-1 for
the first sense of noun car. Instead of representing concepts using
logical terms, WordNet represents them as synsets.
Each synset is related to
its immediately more general and more specific synsets via direct hypernym
and hyponym relations. To find chains of more general or more specific
synsets, one can follow a transitive chain of hypernym and hyponym
relations. As an example, the hypernym chain for the noun valley is
shown in Figure-5. In this figure, successively more general synsets are
shown on successive indented lines. The chain starts from valley and
ends in entity.
|
valley, vale
=> natural depression, depression
=> geological formation, formation
=> natural object
=> object, physical object
=>entity |
Figure-5:
Hyponymy chain for the noun
“valley”
Since WordNet provides the
broadest set of lexical information in one resource, it is used
widespreadly. Another reason for its wide-spread use is, it is the first
broad-coverage lexical resource that is freely available.
The earliest attempts to use
WordNet in WSD were in information retrieval field. Voorhees (1993) and
Richardson and Smeaton (1994) created knowledge bases using WordNet’s
hierarchy. Li et al. (1995) proposed a WordNet-based algorithm for WSD.
Disambiguation was done by semantic similarity between words and heuristic
rules. Heuristic rules were based on the semantic similarity and the WordNet
hierarchy. Leacock et al. (1998) used WordNet to counter data sparseness
problem. Hawkins (1999) built up a WSD system that works with frequency and
contextual information that is based on WordNet. Fellbaum et al. (2001)
proposed a system that made use of syntactic clustering and semantic
distinctions extracted from WordNet.
As seen, there has been
various work regarding WordNet. But WordNet is mostly used to determine
semantic similarity between senses. Resnik (1995a) computed information
content of words which is a measure of the specificity of the concept that
subsumes the words in the WordNet hypernym hierarchy. Agirre and Rigau
(1996) employ WordNet to determine the conceptual distance among concepts
whereas Mihalcea and Moldovan (1999) exploit semantic density and WordNet
glosses in an all words design. Lin (1997; 1998; 2000) described a semantic
similarity measure where similarity between two objects is defined to be the
amount of information contained in the commonality between the objects
divided by the amount of information in the description of the objects. We
used the same measure for the work conducted in this thesis. Detailed
information is given in Section 3.2.3.
Other approaches using
WordNet are Jiang (1997), Agirre and Martinez (2000), Agirre et al. (2001),
Hayres (2001) and Banerjee et al. (2003). A combination of MRDs and WordNet
has also been tried with success (Litkowski, 2000, 2001; Mihalcea and
Moldovan, 1998).
2.2 Information Sources for WSD
There are various
information sources or feature types used in WSD regardless of the type of
the approach. To disambiguate a word, a diversity of information, including
syntactic tags, word frequencies, collocations, semantic context,
role-related expectations, and syntactic restrictions can be considered.
In Agirre and Martinez
(2001), a comparison of WSD systems has been made based on the information
source they used. Some of these sources are as follows:
Frequency of senses:
Frequency
information is used to measure the likelihood of each possible sense
appearing in the text. Therefore this information is generally used in
statistical approaches and it is generally learned from hand-tagged data
such as SemCor corpus. WordNet senses are ordered according to the
frequencies of the senses in the SemCor corpus.
Part of speech (POS):
Part of speech tagging is regarded as the first step of the disambiguation
process if the tokens have the same orthographic forms but different
syntactic classes. It is necessary since it reduces the number of possible
senses a word can belong to. An orthographic form may even be unambiguous in
one syntactic class whereas it has more than one sense in another. For
instance, in WordNet 2.0 handle has 5 senses as a verb, but only one
sense as a noun. The impact of knowledge resources on WSD in examined in
Gaustad (2004). The results show that accurate POS information is beneficial
for WSD and that including the POS of the ambiguous word itself as well as
POS of the context increases the disambiguation accuracy.
Morphology:
It is defined specially as the relation between derived words and their
roots. For instance, the noun agreement has 6 senses, its verbal root
agree 7. A stemmer tries to reduce various forms of a word to a
single stem. Since English is a language with little inflectional
morphology, it is not certain that using morphology will lead to significant
improvements in WSD. With other languages, such as German or Italian,
morphology is of greater influence. Derivational morphology would be useful
only for disambiguating roots.
Collocations:
Collocation is the relationship among any group of words that tend to
co-occur in a predictable configuration. Disambiguation relies heavily on
collocational information. For example, the noun match has 9 senses.
However, it has only one possible sense in “football match”. It is
observed that collocations are strong indicators if they are learned from
hand-tagged corpora. Although they are strong, they should be used with
other sources. They should not be treated as rules for sense-filtering alone
(as in Hirst 1987 or Dahlgren, McDowell, and Stabler 1989).
Semantic word
associations:
These can be classified as follows:
i.
Taxonomial
organization:
This refers to the classification of words in a hierarchy and the
lexical-semantic relationships holding between words such as a dog is
a kind of animal. This kind of information can be extracted from
ontologies like WordNet.
ii.
Situation
and Topic:
Information about the situation or topic enables a WSD system to see the
ambiguous word in a broader context. For example, if the word mouse
is used in an office situation and the topic is computer use, the most
probable sense of the word mouse will be “computer tool”, not
“animal”. Semantic word associations around topic and situation are powerful
when learned from hand-tagged corpora. Associations learned from MRDs can
also be useful.
iii.
Argument-head relations:
These relations provide important clues for disambiguation such as the
relationship between dog and bite in the sentence “the dog
bite the man.” Yarowsky (1993) showed that collocation information can be
captured using argument-head relations. Argument-head relations work as a
strong indicator if they are given as a sense-to-word relation.
Syntactic structure
(subcategorization information):
Subcategorization and dependency relations refer to certain
kinds of relations between words and phrases. For example the verb want
can be followed by an infinitive, as in “I want to fly to Istanbul”,
or a noun phrase, as in “I want a flight to Istanbul”. But the verb
find cannot be followed by an infinitive. Verbs have several possible
patterns of arguments. A particular set of arguments that a verb can appear
with is referred to a subcategorization frame. Subcategorization frames
capture syntactic regularities about complements.
Gaustad (2004) used
dependency relations in his WSD system for Dutch and showed that syntactic
information provides very useful disambiguation clues and leads to a
significant increase in performance. For the work presented in this thesis,
we also used syntactic information and dependency relations, so detailed
information about this kind of knowledge source will be given in the
following sections.
Agirre and Martinez (2001)
made a comparison between the contributions of the above resources to WSD.
According to their observations, if learned from hand-tagged corpora,
collocations and semantic word associations are the most important knowledge
types for WSD, but they also mentioned that syntactic cues are equally
reliable. On the other hand, taxonomical information was found to be very
weak.
2.3 General Approach and Related Work
As can be seen from the
previous sections, most of the study in WSD is concentrated so far on
supervised learning since these algorithms usually achieve the best
performance. This fact is shown by the results of SENSEVAL-2 (Cotton et al.,
1998). However these systems have the drawback that they are only applicable
to the words for which sense tagged data is available, and they are strongly
dependent to the labeled training data available. In addition, almost all of
these systems make use of the picking the first sense heuristic.
Supervised systems and first
sense heuristic has come to a natural bound of approximately 70% precision
(Mihalcea, 2002). In order to improve this value, benefits of
knowledge-based or unsupervised alternatives should be examined.
In this thesis, we present a
disambiguation algorithm that relies on the information gained from
syntactic context and sense similarity. The WSD approach used in the system
cannot be specified as unsupervised since it uses a lexicon, namely WordNet,
for sense disambiguation and for semantic similarity metric; however, it may
also be regarded as unsupervised in the sense that no sense-tagged data is
needed for training. All details about the system will be given in
Chapter-3. In this section, a number of WSD systems that have employed
syntactic context in the past are introduced.
The algorithm used in this
thesis is very similar to the algorithm presented in Lin (1997). Lin
mentions the same problems that supervised algorithms face with: need for a
large sense-tagged training set and data sparseness. In order not to deal
with these problems, Lin proposes an algorithm adopting the intuition that
“two different words are likely to have similar meanings if they occur in
identical local contexts”. This means that the same knowledge sources are
used for all words and that instead of building separate classifiers for
each word, past usages of other words are used to disambiguate the current
word.
Using a broad-coverage
parser, a 25-million-word Wall Street Journal corpus was parsed and
dependency relationships were extracted. Dependency relationships were
stored as dependency triplets which were composed of the type of the
dependency relation, the word related to the ambiguous word via the
dependency relation and information on whether the word is the head or the
modifier in the relationship. A local context database was constructed for
each of these triplets (local contexts). An entry in the database consisted
of twenty words found from the training corpus that were most likely to
occur in that particular local context.
In the disambiguation
process, input is parsed and local contexts are extracted. Local context
database is searched to find the words that occurred in the same local
context as the ambiguous word. Based on a similarity maximization algorithm
between the words, the polysemous word is disambiguated. The algorithm used
in the similarity maximization part of both Lin’s work and our work is very
similar to the algorithm presented in (Resnik, 1995a).
Since Wall Street Journal is
mostly business news, Lin only used the “press reportage” of SemCor corpus
for testing. An accuracy of 56.1% is reported for nouns.
Li, Szpakowicz, and Matwin
(1995), presents an algorithm for automatic word sense disambiguation, based
on lexical knowledge contained in WordNet and on the results of
surface-syntactic analysis. The algorithm was designed to support text
analysis with minimal precoded knowledge. They proposed a set of heuristic
rules that are based on the idea that objects of the same or similar verbs
are similar. Disambiguation is done by computing the semantic similarity
between words and applying the heuristic rules based on this similarity.
Stetina et al. (1998)
achieve good results with syntactic relations as features compared to the
frequency baseline. They use a measure of semantic distance based on WordNet
to find similar features. The features are extracted using a statistical
parser (Collins, 1996), and consist of the head and modifiers of each
phrase.
Martinez et al. (2002)
discuss the contribution of various syntactic features to WSD: grammatical
relations coded as the presence of adjuncts/arguments in isolation or as
subcategorization frames, and instantiated grammatical relations between
words.
The set of syntactic
features was extracted using Minipar (Lin, 1993). Two types of syntactic
relations were taken into account: instantiated grammatical relations (IGR)
and grammatical relations (GR). IGR are triples of [wordsense relation
value] where the value is either the word form or the lemma. For GR, bigrams
[wordsense relation] and n-grams [wordsense relation1 relation2...] are
stored. Three types of n-grams are defined: all subcategorization
information including part of speech given by Minipar, subcategorization
information occurring in the sentence, and all dependencies in the parse
tree. Their results on the Senseval-2 lexical sample data for English
indicate that both types of syntactic relations, IGR and GR, together
provide the best F-score (IGR getting better precision than coverage and GR
lower precision, but higher coverage). They also came to the conclusion that
syntactic features effectively improve WSD precision.
Gaustad (2004) also
investigated whether structural syntactic information is helpful for lexical
semantic disambiguation or not. He used the Alpino dependency parser for
Dutch in order to extract the dependency relations. He reported that using
dependency relations in conjunction with the lemma of the ambiguous word and
its part of speech as features performs better than the model including only
immediately left and right context.
Chapter 3
Word Sense
Disambiguation Based on Sense
Similarity
and Syntactic Context
3.1 Introduction
Most corpus-based WSD
algorithms make use of the previous usages of a polysemous word in order to
disambiguate that word. These algorithms basically rely on the intuition
that “two occurrences of the same word have identical meanings
if they have similar local contexts”. However, this intuition can be
misleading. In order for this method to have the desired consequences; a
polysemous word must be observed several times before making a good
classifier. Knowing that there are nearly 15000 polysemous nouns in WordNet
2.0, the corpus must contain millions of words. If the corpus fails to
contain a polysemous word or all the senses of a word, the algorithm can
lead to false results.
In this thesis, we make
use of a WSD algorithm that relies on the intuition that “two different
words are likely to have similar meanings if they occur in
identical local contexts”. By this approach, a generalization can be
made since it allows using the knowledge learned for one word on other
words; instead of using a separate classifier for each polysemous word.
Hence, the size of the training corpus can be much smaller. Moreover, this
method is able to disambiguate infrequent or unseen words in the training
corpus.
The basic steps of the
algorithm are as follows:
i.
Parse the
training corpus and build up a local context database by extracting local
contexts according to some rules.
ii.
Similarly,
parse the input text and extract local contexts of each ambiguous word.
iii.
For each
ambiguous word w, search the local context database and find words
that are in an identical local context as w. These words are the
similar words set S of w which will be used in sense
selection.
iv.
Select a sense
of w that maximizes the similarity between w and its similar
words set S that was found in the previous step.
Consider the sentence
below:
Jane applied for a job.
The noun “job” has
15 possible meanings in WordNet 2.0. To disambiguate the noun “job”,
we consider the other words that appeared in an identical context as
“job”. According to our system, the local context of “job” is “N:prep-for:apply:head:V”,
that is there is a relation between the verb “apply” and the noun “job”
via the preposition “for”. In Table-1, the words that occurred in the
local context “N:prep-for:apply:head:V” are presented. The numbers in
table indicate the likelihood ratios of the words in that local context. The
meaning of “job” in the example can be determined by choosing one of its 15
senses that is most similar to the meanings of the words in Table-1.
|
N:prep-for:apply:head:V |
|
license 419.46
membership 324.86
job 299.67
approval 241.81
patent 233.65
permission 207.59
amnesty 198.54
permit 166.21
listing 141.75
loan 141.00 |
citizenship 100.85
asylum 92.37
visa 84.21
status 72.32
grant 69.68
exemption 59.04
legalization 58.55
noun.person 54.34
refund 52.63
residence 50.20 |
Table-1:
Words that have a relation with “apply” via “for”
This chapter is structured
as follows: Firstly we present the resources needed for the algorithm. In
section 3.3 and section 3.4, the concepts of local context and local context
database are explained respectively. Later, the disambiguation algorithm is
presented in section 3.5. Finally, we conclude this chapter with the results
obtained from the experiments.
3.2 Resources of the
Algorithm
In order to implement
this algorithm the following resources are needed:
i.
an untagged training corpus,
ii.
a concept hierarchy,
iii.
a similarity measure,
iv. a
broad-coverage parser.
In this section, all of
the above resources are presented one by one.
3.2.1 Training Corpus
For the training corpus,
we used a 100-million-word Wall Street Journal corpus of the periods
presented below:
Source Period
Original Publication(s)
---------------------------------------------------------------
WSJ 1987-9
ACL/DCI and CSR WSJ0
WSJ 1990-2
TIPSTER Vol. 2, version 2
WSJ 1992-4
CSR LM-1 corpus
The WSJ 1992 collection on
TIPSTER covers only the first three months of the year. Data from the
remaining months of 1992 were taken from the LDC, along with all of 1993 and
the first three months of 1994.
3.2.2 Concept Hierarchy (WORDNET 2.0)
As the concept
hierarchy WordNet 2.0 is used. Information about the general WordNet
structure is given in section 2.1.2.3.
WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0
includes more than 42,000 new links between nouns and verbs that are
morphologically related, a topical organization for many areas that
classifies synsets by category, region, or usage, gloss and synset fixes,
and new terminology.
3.2.2 Concept Hierarchy (WORDNET 2.0)
As the concept
hierarchy WordNet 2.0 is used. Information about the general WordNet
structure is given in section 2.1.2.3.
WordNet 2.0 is available for Unix and Windows platforms. WordNet 2.0
includes more than 42,000 new links between nouns and verbs that are
morphologically related, a topical organization for many areas that
classifies synsets by category, region, or usage, gloss and synset fixes,
and new terminology.
3.2.3 Similarity Measure
In order to find the
similarity between the senses of a polysemous word and another word we need
a similarity measure. There is a strong relationship between a word’s
semantics (or meaning) and its syntactic behavior. As discussed in Weeds
(2002), if two words mean similar things then they will occur in documents
in similar contexts. Accordingly, we can attempt to calculate the similarity
between all senses of two words by finding all of the contexts that each
appears in. If a description of a word is made up of the contexts it appears
in and their associated frequencies then we can mathematically measure the
similarity between these descriptions. For example, the word dog may
occur as the subject of the word eat x number of times - this will
increase the similarity between dog and other words which also occur
as the subject of the word eat e.g. cat. Many measures have
been proposed to calculate the similarity between the descriptions of words;
we used WordNet Similarity Package v.0.15 which implements the semantic
relatedness measures described by Leacock & Chodorow (1998), Jiang & Conrath
(1997), Resnik (1995), Lin (1998), Hirst & St-Onge (1998), Wu & Palmer
(1994), the adapted gloss overlap measure by Banerjee and Pedersen (2002),
and two measures based on context vectors by Patwardhan (2003). This package
is able to calculate the similarity between either two words or two senses
(or synsets). We use it to find the similarity between two senses.
We used Lin’s measure
(1998) in the experiments conducted for this thesis.
Lin’s Measure
This measure is derived
from the following assumptions (Lin, 1998):
i.
The commonality between the words A and B is measured
by I(common(A,B)) where common(A,B) is a proposition that
states the commonalities between A and B and I(s) is
the amount of information contained in the proposition s.
ii.
The differences between the words A and B is measured
by
I(describe(A,B)) – I(common(A,B))
where
describe(A,B) is a proposition that describes what A and B
are.
iii.
The similarity between A and B, sim(A,B), is a
function of their commonality and differences. That is,
sim(A,B)
= f ( I(common(A,B)) , I(describe(A,B)))
where
the domain of function f(x,y) is { (x,y) | x ≥ 0 , y > 0 , y ≥ x }
.
iv.
Similarity is independent of the unit used in the information
measure.
According to the Information Theory (Cover and Thomas, 1991), I(s) = -logbP(s),
where P(s) is the probability of s, and b is
the unit. When b=2, I(s) is the number of bits
needed to encode s. Since logbx =
,
assumption (iv) means that the function f must
satisfy the following condition:

v.
Similarity is additive with respect to commonality.
If
common(A,B) consists of two independent parts, then the sim(A,B)
is the sum of the similarities computed when each part of the
commonality is considered. In other words: f(x1 +
x2, y) = f(x1,y) + f(x2,y). Therefore,
,
which means that when there is no commonality
between A and B, their similarity is 0, no matter
how different they are.
vi.
The similarity between a pair of identical objects is 1.
When A and B
are identical, knowing their commonalities means knowing what they are,
i.e., I(common(A,B)) = I(describe(A,B)). Therefore, the function f
must have the following property:

vii.
The function f(x,y) is continuous.
Similarity Theorem:
The similarity between A and B is measured by the ratio
between the amount of information needed to state the commonality of A
and B and the information needed to fully describe what A and
B are.
sim(A,B)

Proof:
To prove the theorem, we need to show f(x,y) =
.
Since f(x,y) = f(
,y)
(due to assumption (iv)), we only need to show that when
is
a rational number, f(x,y) =
.The
result can be generalized to all real numbers because f is continuous
and for any real number, there are rational numbers that are infinitely
close to it.
Suppose m and
n are positive integers.
f(nx,y) = f((n-1)x,y) + f(x,y) = n
f(x,y) (due to assumption (v))
Thus,

Substituting
for
x in this equation:

Since