Данг Куинь Ньы

Факультет компьютерных наук

Профиль на hse.ru ↗ тел.: 27319

Публикаций

Языков

Наград

Конференций

Профиль Публикации (5) Курсы (0)

Должности

Приглашенный преподаватель — Факультет компьютерных наук, Департамент больших данных и информационного поиска

Био

· Начала работать в НИУ ВШЭ в 2026 году.

Опыт работы

· 2022-2024: гг. - международная лаборатория интеллектуальных систем и структурного анализа (Стажер-исследователь)

Идентификаторы исследователя

ORCID: 0000-0003-0450-7063

Публикации (5)

Relative Chaoticity of Natural Languages

2026 · ARTICLE · en

Tis paper presents a novel approach to analyzing and grouping natural languages based on the degree of their chaoticity. It clusters 52 languages from 18 language families, according to the value of the entropy–complexity pair, to reveal the chaotic properties of semantic trajectories. Te obtained clusters appear to be closely correlated with the family of languages under consideration as well as to certain language characteristics (word order, alignment, locus of marking, and morphological complexity). Testudyalsoproposes arobustmethodforassessingthe chaoticityof atimeseries. Tefndingssuggestthe pressing need for a more in-depth investigation of how particular linguistic features and chaotic aspects of language are interrelated.

DOI ↗ PDF ↗

A Language and Its Holes: The First-Order Homology of the Large-Scale Geometrical Structure of a Natural Language

2025 · ARTICLE · en

The present paper employs topological data analysis methods to reveal ‘holes’ (stable persistent homologies) in the semantic spaces of words, bigrams, and trigrams of the English and Russian languages, and to ascertain their boundaries. Furthermore, the paper selects those holes that belong to the large‐scale (coarse‐grained) structure of the language that are not just local inhomogeneities of the sample—it appears that there are around a dozen of them for each of the languages (English and Russian). These boundaries delineate ‘blind spots’ of the respective language—the regions of the semantic spaces that do not contain words/bigrams/trigrams of the language—that is, regions of concepts that the language cannot see through its lens. The secondary goal of the paper is to solve the bot‐detection problem in its strong statement, that is, one trains the classifiers on one set of bots and tests on the another set of bots. To this end, we estimate the average distances from words, bigrams, and trigrams of a text to the boundaries of the nearest ‘hole’, for texts both written by humans and generated by bots, and construct classifiers. The classifiers show comparatively good results: the average accuracy amounts to 0.8.

DOI ↗ PDF ↗

Spot the Bot: the Inverse Problems of NLP

2024 · ARTICLE · en

This paper concerns the problem of distinguishing human-written and bot-generated texts. In contrast to the classical problem formulation, in which the focus falls on one type of bot only, we consider the problem of distinguishing texts written by any person from those generated by any bot; this involves analysing the large-scale, coarse-grained structure of the language semantic space. To construct the training and test datasets, we propose to separate not the texts of bots, but bots themselves, so the test sample contains the texts of those bots (and people) that were not in the training sample. We aim to find efficient and versatile features, rather than a complex classification model architecture that only deals with a particular type of bots. In the study we derive features for human-written and bot generated texts, using clustering (Wishart and K-Means, as well as fuzzy variations) and nonlinear dynamic techniques (entropy-complexity measures). We then deliberately use the simplest of classifiers (support vector machine, decision tree, random forest) and the derived characteristics to identify whether the text is human-written or not. The large-scale simulation shows good classification results (a classification quality of over 96%), although varying for languages of different language families.

DOI ↗ PDF ↗

Semantic and sentiment trajectories of literary masterpieces

2023 · ARTICLE · en

The paper deals with semantic and sentiment trajectories of literary masterpieces (we used corpora of 12 languages of various language families), composed of individual embeddings or n-grams. We ascertain that, for all languages, semantic and sentiment trajectories are markedly chaotic: positive largest Lyapunov exponents; ‘entropy-complexity’ pairs belonging to the ‘chaotic’ area of the respective plane; the distinctive ‘chaotic’ drop of the number of false nearest neighbours at a particular value of an embedding dimension. The Russian language turns out to be more ‘chaotic’ than, for example, the English one; we attribute this fact to the free order of words. The Esperanto language, for various ‘approaches’ to different Indo-European languages. The results do not corroborate its claim to be equidistant from all languages. However, it seems to be equidistant from all Indo-European languages. These characteristics are utilised in order to develop a method to compare styles of an original masterpiece and its translations (to automatically assess translation quality). It appears that machine translations are still worse than human ones, however, for example, the Facebook translation is comparable with them.

DOI ↗

Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

2023 · CHAPTER · en

DOI ↗

Курсы (0)

Нет курсов.