AN APPROACH OF AUTOMATIC EXTRACTION OF INFORMATION ABOUT THE AUTHORS AND THEIR TEXTS FROM WEB-FORUMS

Pronin A.K, Kopylov N.Y

239 KB

This article describes the approach of automatic extraction of information about the author and his/her texts from web forums. For building the algorithm the concept of style trees was used – approach of aggregating similar nodes in a tree representing Document Object Model. Nodes are similar if they all have the same name of the corresponding HTML-tags and have the same parent node. At final steps, simple heuristics were applied, employing observations about characteristics of texts containing users’ pseudonyms and their messages. When testing the developed algorithm 80 % accuracy was reached. Practical value of the developed algorithm resides in expansion of text resources, used as sources for natural discourse, especially when faced with a problem of building very large text corpus.

Библиографическая ссылка

Пронин А.К, Копылов Н.Ю АВТОМАТИЧЕСКОЕ ВЫДЕЛЕНИЕ ИНФОРМАЦИИ ОБ АВТОРЕ И ИХ ТЕКСТАХ НА СТРАНИЦАХ ИНТЕРНЕТ-ФОРУМОВ // Научное обозрение. Физико-математические науки . 2020. № 1. С. 49-50;
URL: https://physics-mathematics.ru/en/article/view?id=67 (дата обращения: 24.06.2026).