[Language_]

People

Joakim Nivre

Sabine Bucholz

Gulsen Eryigit

Giuseppe Attardi

Terms

Ablation study

Active learning

Bayes point machines

Bayesian = no overfitting?

Bayesian discriminative

Bayesian hypothesis testing

Bayesian nonparametric

Bayesian unsupervised

Belief propagation

Boosting - vs. SVM?

Bootstrapping - co-training, active learning

CCG

CKY packed chart

CKY parsing algorithm

CRF conditional random fields vs EM vs Grad.Descent

Co-training

Collins algorithm for large margin learning

Construction Grammar

Convex duality

DIRT, phrasal similarity etc. (RTE-2?)

DOP math

DOP, L-DOP, U-DOP

Derivational entropy

Discriminative log linear models

EM vs ML vs just sampling

Expectation propagation - q||p focus on mode, p||q focus on volume

Exponential updates

FTA finite tree automata

Fisher kernel

G^2 statistic: unseen events - sampling or dependency

Gamma distribution - Gaussian conjugate

Generative vs discriminative

Global linear model

Grammar consistency

ILP inductive logic programming - (can't do pos tagging?) (vs. GPA?)

Inside outside probabilities

Jensen inequality, EM math

Kappa

Kernel over sent. tree pairs? data dependent kernels?

LDA

LFG

Laplace approx, Hessian vs. variational vs. bayes integration

Laplacian

Lattice

Log linear parsing model

Margins vs boosting

Markov random fields

MCMC - why is q in the acceptance probability?

MCMC - do we need it just for EM hidden vars? when two hidden next to each other in bayesian network

P-test problems

ROC curve

RTG

Regularization vs Bayesian priors

Renormalization

SCFG

STSG

SVM math

SVM vs Winnow vs Perceptron

Sampling paradigm vs diagnostic paradigm

Self-training

Self-training vs EM?

Self-training, co-training

Semi-supervised

Shift-reduce for dependency

Stochastic gradient descent

TAG

Tree kernels, molt kernel

Variational approximation

Variational EM

X^2 statistic, anova: dependency?

Resources

ACE data

ATIS3

BUGS

CCGbank - penn treebank in categorial? grammar with dependencies

CHILDES

Charniak parser

Chinese treebank

Conll (04,05), Senseval-3 SRL data

Conll-x original treebanks

Framenet

Mallet maxent

Manning parser

McDonald parser

Mindnet

NER system from BBN (Boris)

Nivre Maltparser

Nlpwin parser - heidorn 2000

NomBank

Ontonotes - palmer

Opennlp maxent

Pascal-network.org

Propbank

RTE data

Wordnet evoke relation

Yamada, Matsumoto head rules

Cited Papers

Abney96

Goodman03

Bod03

Hearst92 (auto-wn)

McDonald05

Yamada and Matsumoto (vs. Nivre?)

Nivre03

Collins02

Eisner05 (from dreyer)

Eisner bilexical chapter

CLE - chu, liu, edmonds? mst parsing.

Smith, Eisner 05: log-linear models on unlabeled data

Schutze93

Schutze98

Roth and Yih ICML '05

Geffen and Dagan

VanZaanen 00, 01 unsupervised 39%

Alex Clarke, unsupervised, 00, 01, 42%

Klein and Manning, unsupervised, 02, 04, 78% on 10 word sent.

kazama, torisawa '05

roth 2000 - linear xform not sufficient for nlp

JMLR03 - LDA paper

Berger, statistical decision theory and bayesian analysis

Mackay, information theory, inference and learning algorithms

Wasserman, all of statistics

Bishop, pattern recognition and machine learning

Andreiu ML2003, introduction to MCMC

Wainwright, Jordan, graphical models etc. UCB stat tr649, 2003

Murphy, A brief intro to graphical models (url)

Minka, using lower bounds to approx integrals

Minka, bayesian conditional random fields

Barnard, matching words and pictures

Neal and Minka has a bunch of stuff on bayesian

Papers in HLT

*** Data sparseness:

Wang06, svm with similarity

Haghighi06, Klein, prototypes best paper

Ando06, ASO - multi-task learning

Semi-supervised tutorial

1. semi supervised learning, prototypes (klein)

2. word similarity, ASO, ECOC - share information

3. context similarity: homology idea

4. LSA, ecoc, clustering

5. what kind of bayes prior is this?

dim reduction, LSA, grouping similar cases, clustering: if

predicting y|x, we are grouping x's similar in y space (i.e. words

that are similar in their distributions, then predicting

distribution of rarely seen word). idea behind semi-supervised, or

homologues: y|x,c (c=context, x=word, y=sense): group things similar

in x,c space.

External vs internal similarity:

people use externally defined similarity (i.e. lin thesaurus) to

unsparse the data. however similarity should be part of the

learning process, i.e. similar means takes place in similar

structures when you are learning structures.

re: wang paper

Similarity as soft parts of speech:

use feature bags instead of continuous space

feature bags = mixture of distributions?

Log-linear models on unlabeled data smith, eisner '05:

apply to wsd, ner

semi-supervised: abney04 analyzes yarowsky95 bootstrap

blum and mitchell 98 - co-training

nigam and ghani 2000 - co-em

abney 2002,

collins and singer 1999 (NER)

goldman and zhou 2000 (active learning)

ando and zhang 2005 - classifier struct vs input space

data manifold: lower dimensional surface on which data is restricted to may change the distance metric

*** Dependency parsing:

Corston-Oliver06, multilingual dep parsing bayes pt mach

Nivre06, maltparser

Grammar induction section

missed: Atterer, Brooks, Clarke on unsup parsing

Shift-reduce parsers:

how do they deal with a->b->c vs. a->b, a->c

Iterative dependency parsing:

Do you order by distance or label type?

Add GPA to Nivre parser:

Attardi - maxent does worse than SVM

maxent vs naive bayes vs SVM question

Handling non-projective trees:

1. nivre does pre and post processing

2. attardi adds moves to switch during parsing

3. mcdonald uses mst algorithm to parse non-projective

Example: I saw a man yesterday who was suspicious looking.

Parse order:

Can process left to right or right to left with

deterministic dependency parsers

Fixing errors in det. parsing

can backtrack or can perform repairs

how does my phd algorithm relate?

Four dep parsing approaches: (long talks at conll-x)

1. learn parse moves

1.1 non-projectives by adding moves

1.2 non-projectives by pre-post processing

2. learn tree structures

2.1 eisner n^3 algorithm

2.2 mcdonald mst algorithm

Parsing algorithms:

n^2 algorithm for non-projective

n^3 is the best for projective ???

Lazy application of constraints usually wins

- do not always have to check them all

ando - ASO: rich corwana - backprop in multi-task learning

bach and jordan - predictive low rank decomposition

vs. semi supervised?

*** DOP parsing:

Zuidema06, what are the productive ... DOP

Bod06, unsupervised dop model

Goodman03, PSFG reduction

Johnson02, estimator inconsistent

Bod: unsuper better than super on <= 40 words!

Previous results from bod (u-dop): VanZaanen, Klein, Clarke

*** Anaphora:

Garera06 (yarowsky), resolving anaphora

WSD ecoc idea:

feature vectors for senses - could include common anaphora

Natural levels:

natural levels in wn like animal can be picked by anaphora?

same word, many anaphora - depends on what you want to emphasize

more support on the feature bag representation

garera-yarowsky paper.

*** Bayesian:

Beyond EM tutorial

*** Semantic role labeling:

SRL tutorial

MacCartney06, RTE

*** WSD:

levin06 - evaluation of utility ... WSD

Clustering vs supervised:

score classes using clustering metrics

re: levin06 - evaluation of utility

*** Misc.

Mohri06 best paper: PCFG - zeroes because of sampling or constraints?

Whittaker06 QA system ask-ed

NER tutorial

Satta: 2 theory papers

Derivational entropy (vs cross entropy): when equal cross ent

minimized, can use instead of ML for inf models

General Ideas, Questions

Kernels:

what are they, similarity? features? inner products?

integral kernels in wikipedia?

help compute margin?

euclidean distance vs inner products?

nonparametric bayesian has infinite inner product spaces, related?

if features on input, output => kernels vs. ecoc?

is it phi(y) or phi(x,y): attardi, collins

do they solve feature selection, transformation?

how are they an alternative for manual feature construction?

why does collins compare to boosting?

why no overfitting due to margin?

kernel related to nearest neighbor distance transformations?

Capturing distr similarity of words:

process from more frequent to less frequent

each successive word is modeled as one of the previous

distributions + original extension (gzip idea)

Machine learning course:

pick classic nlp papers,

target ml topics that will let students understand them

give projects on nlp

1. wsd: yarowsky; clustering, unsup wsd: schutze, charniak

2. parsing: dop, collins99, charniak00, mcdonald05

3. semrel: SRL, RTE, Rel.Ext., Framenet, Propbank, Turney

4. morphology: induction, child data, disambiguation

5. NER, MT, coreference ...

How to remove label dependency in WSD and SRL:

1. use similarity ranking turney style

2. learn one label from few positive examples (prototypes)

given a corpus, measure precision and recall

example: hatchback-sedan vs car-truck vs train-airplane

all imply different resolution

3. generation: replace words, synt structures

4. if we need negative examples, how do you come up with them?

Clustering words and senses together:

ws discrimination works on a single word multiple senses

word similarity groups multiple words without regard to senses

must do both!

ECOC on WSD: first project to try

Parsing:

1. convert penn treebank using nivre, ccg

2. write supervised parser

3. write unsupervised parser

Idea: Co-training with frequent word pairs and syntactic patterns

Idea: organize training sequence - shorter sentences first

Idea: only evaluate on 10 word sentences!

Idea: how does dop math (all subtrees) translate to dependency?

SRL:

people fighting phrase structure

use dep parser on previous SRL data, framenet, propbank

participate in next

Morphology:

1. get latest turkish tb from gulsen

2. use semi-supervised or other to improve

3. exp models, naive bayes, svm on irrelev. redundant attr?

4. paraphrasing turkish word in english

5. n-best for turkish morphology?

idea: think about turkish tb representation - can we split and make

like english?

Bayes:

do you need mcmc only with hidden variables?

other questions on tutorial notes

Modeling inter sentence dependencies:

follow S-V or V-O across sentences

see where dependencies are

statistical tests? bayesian network induction?

can get entailment or time distr of concepts...

similar to the definition in semantic relations:

do we want a frequency distribution or logical necessity

can we distinguish the two?

Probabilistic common sense db

concepts - semantic relations vs words - syntactic structures

Would 2009 H1N1 (Swine Flu) ring the alarm bell?

37 minutes ago

## No comments:

Post a Comment