I am an associate professor of Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.
February 01, 2016
December 01, 2015
November 19, 2015
M.S. Thesis: Analysis of Context Embeddings in Word Sense Induction. Koç University, Department of Computer Engineering. November, 2015. (PDF, Presentation, Code)
There exist several drawbacks of representing the word senses with a fixed list of definitions of a manually constructed lexical database. There is no guarantee that they reflect the exact meaning of a target word in a given context, since they usually contain definitions that are too general. More so, lexical databases often include many rare senses while missing corpus/domain-specific senses. Word Sense Induction (WSI) focuses on discriminating the usages of a polysemous word without using a fixed list of definitions or any hand-crafted resources.
In contrast to the most common approach in WSI, which is to apply clustering or graph partitioning on a representation of first- or second-order co-occurrences of a word, my method obtains a probability distribution for each context suggested by a statistical model. This distribution helps to create context embeddings using the co-occurrence framework that represents the context with low-dimensional, dense vectors in Euclidean space. Then, these context embeddings are clustered by k-means clustering algorithm to discriminate usages (senses) of a word. This method proved its usefulness in Unsupervised Part-of-Speech Induction, and supervised tasks such as Multilingual Dependency Parsing. I examine this method on SemEval 2010 and SemEval 2013 Word Sense Induction lexical sample tasks, and the dataset I created using OntoNotes 5.0. This new lexical sample dataset has high inter-annotator agreement (IAA) (>90%) and number of instances for each word type is more than any previous lexical sample tasks (>500 instances).
The contributions in this thesis are as follows: (1) I suggest a method to attack the Word Sense Induction problem. (2) I provide a comprehensive analysis (a) in embedding step by comparing other popular word embeddings by transforming each of them to context embeddings using substitute word distributions for each context, and (b) in clustering step by comparing different clustering algorithms (kmeans, Spectral Clustering, DBSCAN) and different clustering approaches (local approach where instances of each word type clustered separately, and part-of-speech based approach where instances tagged with same-part-of-speech clusters independently).
The code to replicate the results in this thesis can be found at https://github.com/osmanbaskaya/wsid.