GENRE CLASSIFICATION IN THE GENERAL INTERNET CORPUS OF RUSSIAN

Piperski A.C

230 KB

Corpora are indispensable research tool in present-day linguistics. If a scholar wants to achieve reliable results in a corpus-based study, he should take into account metadata, i.e. sociolinguistic, regional and genre-related properties of the texts included into the corpus. In most corpora metadata are added manually, which is not possible when constructing large Web-based corpora. Since the General Internet Corpus of Russian (GICR) is one of such corpora, it has to use automated metadata tagging. The developers of GICR propose a novel approach to genre classification without postulating any a priori categories. Machine learning algorithms are used to cluster texts based on automatically extractable features.

Библиографическая ссылка

Пиперски А.Ч ЖАНРОВАЯ КЛАССИФИКАЦИЯ В ГЕНЕРАЛЬНОМ ИНТЕРНЕТ-КОРПУСЕ РУССКОГО ЯЗЫКА // Научное обозрение. Физико-математические науки . 2020. № 1. С. 48-48;
URL: https://physics-mathematics.ru/en/article/view?id=64 (дата обращения: 24.06.2026).