June 21, 2006

Clustering Word Pairs to Answer Analogy Questions

Ergun Biçici and Deniz Yuret. In Proceedings of the Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN 2006)

Abstract: We focus on answering word analogy questions by using clustering techniques. The increased performance in answering word similarity questions can have many possible applications, including question answering and information retrieval. We present an analysis of clustering algorithms' performance on answering word similarity questions. This paper's contributions can be summarized as: (i) casting the problem of solving word analogy questions as an instance of learning clusterings of data and measuring the effectiveness of prominent clustering techniques in learning semantic relations; (ii) devising a heuristic approach to combine the results of different clusterings for the purpose of distinctly separating word pair semantics; (iii) answering SAT-type word similarity questions using our technique.

Full post... Related link

June 11, 2006

HLT/NAACL/CONLL '06 Braindump


Joakim Nivre
Sabine Bucholz
Gulsen Eryigit
Giuseppe Attardi

Ablation study
Active learning
Bayes point machines
Bayesian = no overfitting?
Bayesian discriminative
Bayesian hypothesis testing
Bayesian nonparametric
Bayesian unsupervised
Belief propagation
Boosting - vs. SVM?
Bootstrapping - co-training, active learning
CKY packed chart
CKY parsing algorithm
CRF conditional random fields vs EM vs Grad.Descent
Collins algorithm for large margin learning
Construction Grammar
Convex duality
DIRT, phrasal similarity etc. (RTE-2?)
DOP math
Derivational entropy
Discriminative log linear models
EM vs ML vs just sampling
Expectation propagation - q||p focus on mode, p||q focus on volume
Exponential updates
FTA finite tree automata
Fisher kernel
G^2 statistic: unseen events - sampling or dependency
Gamma distribution - Gaussian conjugate
Generative vs discriminative
Global linear model
Grammar consistency
ILP inductive logic programming - (can't do pos tagging?) (vs. GPA?)
Inside outside probabilities
Jensen inequality, EM math
Kernel over sent. tree pairs? data dependent kernels?
Laplace approx, Hessian vs. variational vs. bayes integration
Log linear parsing model
Margins vs boosting
Markov random fields
MCMC - why is q in the acceptance probability?
MCMC - do we need it just for EM hidden vars? when two hidden next to each other in bayesian network
P-test problems
ROC curve
Regularization vs Bayesian priors
SVM math
SVM vs Winnow vs Perceptron
Sampling paradigm vs diagnostic paradigm
Self-training vs EM?
Self-training, co-training
Shift-reduce for dependency
Stochastic gradient descent
Tree kernels, molt kernel
Variational approximation
Variational EM
X^2 statistic, anova: dependency?

ACE data
CCGbank - penn treebank in categorial? grammar with dependencies
Charniak parser
Chinese treebank
Conll (04,05), Senseval-3 SRL data
Conll-x original treebanks
Mallet maxent
Manning parser
McDonald parser
NER system from BBN (Boris)
Nivre Maltparser
Nlpwin parser - heidorn 2000
Ontonotes - palmer
Opennlp maxent
RTE data
Wordnet evoke relation
Yamada, Matsumoto head rules

Cited Papers
Hearst92 (auto-wn)
Yamada and Matsumoto (vs. Nivre?)
Eisner05 (from dreyer)
Eisner bilexical chapter
CLE - chu, liu, edmonds? mst parsing.
Smith, Eisner 05: log-linear models on unlabeled data
Roth and Yih ICML '05
Geffen and Dagan
VanZaanen 00, 01 unsupervised 39%
Alex Clarke, unsupervised, 00, 01, 42%
Klein and Manning, unsupervised, 02, 04, 78% on 10 word sent.
kazama, torisawa '05
roth 2000 - linear xform not sufficient for nlp
JMLR03 - LDA paper
Berger, statistical decision theory and bayesian analysis
Mackay, information theory, inference and learning algorithms
Wasserman, all of statistics
Bishop, pattern recognition and machine learning
Andreiu ML2003, introduction to MCMC
Wainwright, Jordan, graphical models etc. UCB stat tr649, 2003
Murphy, A brief intro to graphical models (url)
Minka, using lower bounds to approx integrals
Minka, bayesian conditional random fields
Barnard, matching words and pictures
Neal and Minka has a bunch of stuff on bayesian

Papers in HLT

*** Data sparseness:
Wang06, svm with similarity
Haghighi06, Klein, prototypes best paper
Ando06, ASO - multi-task learning
Semi-supervised tutorial

1. semi supervised learning, prototypes (klein)
2. word similarity, ASO, ECOC - share information
3. context similarity: homology idea
4. LSA, ecoc, clustering
5. what kind of bayes prior is this?

dim reduction, LSA, grouping similar cases, clustering: if
predicting y|x, we are grouping x's similar in y space (i.e. words
that are similar in their distributions, then predicting
distribution of rarely seen word). idea behind semi-supervised, or
homologues: y|x,c (c=context, x=word, y=sense): group things similar
in x,c space.

External vs internal similarity:
people use externally defined similarity (i.e. lin thesaurus) to
unsparse the data. however similarity should be part of the
learning process, i.e. similar means takes place in similar
structures when you are learning structures.
re: wang paper

Similarity as soft parts of speech:
use feature bags instead of continuous space
feature bags = mixture of distributions?

Log-linear models on unlabeled data smith, eisner '05:
apply to wsd, ner

semi-supervised: abney04 analyzes yarowsky95 bootstrap
blum and mitchell 98 - co-training
nigam and ghani 2000 - co-em
abney 2002,
collins and singer 1999 (NER)
goldman and zhou 2000 (active learning)
ando and zhang 2005 - classifier struct vs input space
data manifold: lower dimensional surface on which data is restricted to may change the distance metric

*** Dependency parsing:
Corston-Oliver06, multilingual dep parsing bayes pt mach
Nivre06, maltparser
Grammar induction section
missed: Atterer, Brooks, Clarke on unsup parsing

Shift-reduce parsers:
how do they deal with a->b->c vs. a->b, a->c

Iterative dependency parsing:
Do you order by distance or label type?

Add GPA to Nivre parser:
Attardi - maxent does worse than SVM
maxent vs naive bayes vs SVM question

Handling non-projective trees:
1. nivre does pre and post processing
2. attardi adds moves to switch during parsing
3. mcdonald uses mst algorithm to parse non-projective
Example: I saw a man yesterday who was suspicious looking.

Parse order:
Can process left to right or right to left with
deterministic dependency parsers

Fixing errors in det. parsing
can backtrack or can perform repairs
how does my phd algorithm relate?

Four dep parsing approaches: (long talks at conll-x)
1. learn parse moves
1.1 non-projectives by adding moves
1.2 non-projectives by pre-post processing
2. learn tree structures
2.1 eisner n^3 algorithm
2.2 mcdonald mst algorithm

Parsing algorithms:
n^2 algorithm for non-projective
n^3 is the best for projective ???

Lazy application of constraints usually wins
- do not always have to check them all

ando - ASO: rich corwana - backprop in multi-task learning
bach and jordan - predictive low rank decomposition
vs. semi supervised?

*** DOP parsing:
Zuidema06, what are the productive ... DOP
Bod06, unsupervised dop model

Goodman03, PSFG reduction
Johnson02, estimator inconsistent
Bod: unsuper better than super on <= 40 words!
Previous results from bod (u-dop): VanZaanen, Klein, Clarke

*** Anaphora:
Garera06 (yarowsky), resolving anaphora
WSD ecoc idea:
feature vectors for senses - could include common anaphora

Natural levels:
natural levels in wn like animal can be picked by anaphora?
same word, many anaphora - depends on what you want to emphasize
more support on the feature bag representation
garera-yarowsky paper.

*** Bayesian:
Beyond EM tutorial

*** Semantic role labeling:
SRL tutorial
MacCartney06, RTE

*** WSD:
levin06 - evaluation of utility ... WSD
Clustering vs supervised:
score classes using clustering metrics
re: levin06 - evaluation of utility

*** Misc.
Mohri06 best paper: PCFG - zeroes because of sampling or constraints?
Whittaker06 QA system ask-ed
NER tutorial
Satta: 2 theory papers
Derivational entropy (vs cross entropy): when equal cross ent
minimized, can use instead of ML for inf models

General Ideas, Questions
what are they, similarity? features? inner products?
integral kernels in wikipedia?
help compute margin?
euclidean distance vs inner products?
nonparametric bayesian has infinite inner product spaces, related?
if features on input, output => kernels vs. ecoc?
is it phi(y) or phi(x,y): attardi, collins
do they solve feature selection, transformation?
how are they an alternative for manual feature construction?
why does collins compare to boosting?
why no overfitting due to margin?
kernel related to nearest neighbor distance transformations?

Capturing distr similarity of words:
process from more frequent to less frequent
each successive word is modeled as one of the previous
distributions + original extension (gzip idea)

Machine learning course:
pick classic nlp papers,
target ml topics that will let students understand them
give projects on nlp
1. wsd: yarowsky; clustering, unsup wsd: schutze, charniak
2. parsing: dop, collins99, charniak00, mcdonald05
3. semrel: SRL, RTE, Rel.Ext., Framenet, Propbank, Turney
4. morphology: induction, child data, disambiguation
5. NER, MT, coreference ...

How to remove label dependency in WSD and SRL:
1. use similarity ranking turney style
2. learn one label from few positive examples (prototypes)
given a corpus, measure precision and recall
example: hatchback-sedan vs car-truck vs train-airplane
all imply different resolution
3. generation: replace words, synt structures
4. if we need negative examples, how do you come up with them?

Clustering words and senses together:
ws discrimination works on a single word multiple senses
word similarity groups multiple words without regard to senses
must do both!

ECOC on WSD: first project to try

1. convert penn treebank using nivre, ccg
2. write supervised parser
3. write unsupervised parser
Idea: Co-training with frequent word pairs and syntactic patterns
Idea: organize training sequence - shorter sentences first
Idea: only evaluate on 10 word sentences!
Idea: how does dop math (all subtrees) translate to dependency?

people fighting phrase structure
use dep parser on previous SRL data, framenet, propbank
participate in next

1. get latest turkish tb from gulsen
2. use semi-supervised or other to improve
3. exp models, naive bayes, svm on irrelev. redundant attr?
4. paraphrasing turkish word in english
5. n-best for turkish morphology?
idea: think about turkish tb representation - can we split and make
like english?

do you need mcmc only with hidden variables?
other questions on tutorial notes

Modeling inter sentence dependencies:
follow S-V or V-O across sentences
see where dependencies are
statistical tests? bayesian network induction?
can get entailment or time distr of concepts...
similar to the definition in semantic relations:
do we want a frequency distribution or logical necessity
can we distinguish the two?

Probabilistic common sense db
concepts - semantic relations vs words - syntactic structures

Full post...

June 08, 2006

Dependency Parsing as a Classification Problem

Deniz Yuret. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

Abstract: This paper presents an approach to dependency parsing which can utilize any standard machine learning (classification) algorithm. A decision list learner was used in this work. The training data provided in the form of a treebank is converted to a format in which each instance represents information about one word pair, and the classification indicates the existence, direction, and type of the link between the words of the pair. Several distinct models are built to identify the links between word pairs at different distances. These models are applied sequentially to give the dependency parse of a sentence, favoring shorter links. An analysis of the errors, attribute selection, and comparison of different languages is presented.

Full post... Related link

June 05, 2006

Learning Morphological Disambiguation Rules for Turkish

Deniz Yuret and Ferhan Türe. In Proceedings of the Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT-NAACL 2006)
You can download the paper here, the stand-alone Turkish morphological disambiguator here, and the code used in this paper to train the model here.

Abstract: In this paper, we present a rule based model for morphological disambiguation of Turkish. The rules are generated by a novel decision list learning algorithm using supervised training. Morphological ambiguity (e.g. lives = live+s or life+s) is a challenging problem for agglutinative languages like Turkish where close to half of the words in running text are morphologically ambiguous. Furthermore, it is possible for a word to take an unlimited number of suffixes, therefore the number of possible morphological tags is unlimited. We attempted to cope with these problems by training a separate model for each of the 126 morphological features recognized by the morphological analyzer. The resulting decision lists independently vote on each of the potential parses of a word and the final parse is selected based on our confidence on these votes. The accuracy of our model (96%) is slightly above the best previously reported results which use statistical models. For comparison, when we train a single decision list on full tags instead of using separate models on each feature we get 91% accuracy.

Full post... Related link

June 01, 2006

Ray Solomonoff

Ray Solomonoff gave a nice series of lectures at Bilgi University this week and peaked my interest in algorithmic probability again. The link below contains some of his stuff. Should find out more about Levin's search (http://www.idsia.ch/~juergen/optimalsearch.html). Juergen's website contains more interesting stuff as well:

Full post... Related link

Non-parametric Bayesian methods

[Math_ Learning_] Some tutorials:
Full post...

Information Theory, Inference, and Learning Algorithms by David MacKay

[Books_] I just discovered this wonderful book by David MacKay. It has a very good interpretation of Occam's razor in Bayesian terms. Very nice examples of Bayesian inference and pitfalls of orthodox statistics. The book is at the intersection of communication, inference, and learning (although very little learning). I am still undecided about the pedagogic value of information theory for teaching inference - for the most part the information theory results can be stated in probability terms and vice versa. It inspired me to replan my machine learning course more on foundations and less on various algorithmic incarnations. The book is freely available on the net at the above address (which is another reason I appreciated it and bought a hardcopy at Amazon).
Full post... Related link