Adobe PDF (1.32 MB)
Title Details:
Text corpora and applications
Authors: Tantos, Alexandros
Reviewer: Gkotsoulia, Paraskevi
Description:
Abstract:
Text corpora (TC) are one of the main linguistic resources for automated natural language processing. This chapter contains basic concepts for a) the creation and b) the utilization of TC. First, after highlighting the qualitative difference between annotated and unannotated TBs, the basic criteria for selecting and classifying TBs for targeted and more effective linguistic or non-linguistic processing of text data are analyzed. The creation of a corpus is a difficult task and requires the observance of various basic criteria for the selection of texts, so that the language sample collected is representative of the linguistic variety it aspires to represent. Next, the types of corpora are presented, along with examples of how they can be used. In addition, the reader becomes familiar with the XML markup language, which is the dominant markup language for the majority of annotated corpora today. The last part of the chapter presents the basic principles of probability theory that are necessary for a number of applications in computational linguistics. In this direction, and as an example for the analysis of categorical variables related to linguistic data, hypothesis formulation and testing are used. There, the process of hypothesis testing on corpora is described step by step on the basis of a concrete example. Hypothesis testing is an essential everyday tool for processing linguistic data for computational linguists and others.
Technical Editors: Minos, Panagiotis
Type: Chapter
Creation Date: 2015
Item Details:
License: http://creativecommons.org/licenses/by-nc-sa/3.0/gr
Handle http://hdl.handle.net/11419/2210
Bibliographic Reference: Tantos, A. (2015). Text corpora and applications [Chapter]. In Tantos, A., Markantonatou, S., Anastassiadis Symeonidis, A., & Kyriakopoulou, P. 2015. Computational Linguistics [Undergraduate textbook]. Kallipos, Open Academic Editions. https://hdl.handle.net/11419/2210
Language: Greek
Is Part of: Computational Linguistics
Publication Origin: Kallipos, Open Academic Editions