Course: Information retrieval and natural language processing
ECTS credits: 3
Language: Croatian
Duration: 1 semester
Status: Compulsory, elective
Method of teaching: 1 hour of lectures, 1 hour of exercises
Prerequisite: No
Assessment: Complete set of weekly writing tasks, final exam



Course description:


In this course we focus on a series of natural language processing tasks that are useful in the area of textual information retrieval. The course starts with introducing basic concepts like tokenization, indexing and weighting, moving on to more complex NLP problems such as morphological normalization (stemming and lemmatization) and document similarity assessment. It introduces multiple information retrieval paradigms such as the vector space model and probabilistic information retrieval. The course ends with a practical task of applying the supervised machine learning paradigm, more precisely the multinomial Naive Bayes classifier, on document classification. Multiple settings are being evaluated and compared.


Course objectives:


Students master the basic IR-related NLP tasks such as tokenization, construction of an inverted index, TF-IDF weighting, document vectorization, cosine vector similarity, stemming and lemmatization. They get acquainted with two information retrieval paradigms: the vector space model and the probabilistic information retrieval. Finally, they master the basics of supervised machine learning and its evaluation principles on a document classification task.


Reading list: