Subject: Natural language and text processing
Course: Language engineering
ECTS credits: 6
Language: Croatian, English
Duration: 1 semester
Status: compulsory elective
Method of teaching: 2 lecture hours and 2 hours of seminar
Prerequisite: Introduction to Natural Language Processing course
Assessment: written exam

Course description:
Natural language processing is the automatic analysis of the human language performed by computer algorithms. It is used to transform the one language form into another, but also for parsing the language form into a structured form. Language conversion includes summarization, paraphrasing and language translation, while parsing encompasses the transformation of the unstructured data into structured form. The course topics are: language engineering in the context of the intelligent systems, levels of lexical knowledge, word-based systems (regular expressions, word-based information retrieval, spelling checkers, word structure and dictionary knowledge presentation) and sentence-based systems (grammars, syntactic categories, automatic tagging, sentence parsing).

Course objectives:
Students should get the theoretical and practical knowledge of the main principles in the language engineering which is one of the most important types of the intelligent systems in information era. Students should by the end of the course be able:
a) To recognize the features which distinguish the natural language system from other intelligent systems
b) To show the in depth knowledge of the word-based system as well as of the sentence- based system
c) To show that they understand the difference in approach based on the linguistic rules from the approach based on pure statistics
d) To evaluate the existing systems

Quality check and success of the course: Quality check and success of the course will be done by combining internal and external evaluation. Internal evaluation will be done by teachers and students using survey method at the end of semester. The external evaluation will be done by colleagues attending the course, by monitoring and assessment of the course.

Reading list:
1. Ivan A. Sag & Thomas Wasow: Syntactic theory: A formal introduction, Stanford: CSLI 1999.
2. Marko Tadić. Jezične tehnologije i hrvatski jezik. Exlibris, Zagreb, 2003.

Additional reading list:
1. Daniel Jurafsky & James H. Martin. Speech and Language Processing. An Introduction to Natural Language Processing. Computational Linguistics, and Speech Recognition. Prentice Hall, 2000.
2. Copestake, Ann. Analysing Sentences, Noel Burton-Roberts, Longman, 1997.
3. Allen, James. Natural Language Understanding. Redwood, CA: Benjamin, 1995.
4. Marko Tadić i Božo Bekavac. Preparation of POS tagging of Croatian using CLaRK System. Proceedings of RANLP2003 Conference (Borovets 2003), Bugarska akademija znanosti, str. 455-459.
5. Marko Tadić i Krešimir Šojat. Finding Multiword Term Candidates in Croatian. Proceedings of RANLP2003 Conference (Borovets 2003), Bugarska akademija znanosti, str. 102-107.
6. Evans, Roger; and Gerald Gazdar. DATR: a Language for Lexical Knowledge Representation. Computational Linguistics 22 (2).167-216.
7. Pinker, Steven. The Language Instinct. London: Penguin, 1994.