Ляшевская Ольга Николаевна

Факультет гуманитарных наук

Профиль на hse.ru ↗ тел.: 22724 | +7 (906) 798-60-21

Публикаций

116

Языков

Наград

Конференций

Профиль Публикации (116) Курсы (12)

Профессиональные интересы

русский языклексикографиякомпьютерная лингвистикасемантикакогнитивная лингвистикакорпусная лингвистикасемантика грамматики16.00.00 Языкознание

Должности

Профессор — Факультет гуманитарных наук, Школа лингвистики

Био

· Начала работать в НИУ ВШЭ в 2011 году.
· Научно-педагогический стаж: 28 лет.

Образование

1999 · Кандидат филологических наук: Всероссийский институт научной и технической информации РАН, специальность 05.13.17 «Теоретические основы информатики», тема диссертации: Нестандартное числовое поведение русских существительных
1998 · Аспирантура: Всероссийский институт научной и технической информации РАН, специальность «Теоретические основы информатики»
1995 · Специалитет: Российский государственный гуманитарный университет, факультет: Факультет теоретической и прикладной лингвистики, специальность «Лингвистика», квалификация «Лингвист-специалист в области теоретической и прикладной лингвистики»

Опыт работы

· 2011: с : Старший научный сотрудник отдела корпусной лингвистики и лингвистической поэтики, Институт русского языка им. В.В.Виноградова РАН (ИРЯ РАН), Москва (совместитель)
· 2011–2012: : менеджер группы онтологий отдела лингвистики ООО "Яндекс"
· 2010–2011: : førsteamanuensis (Associate Professor)
· 2008–2010: : post-doc, Институт лингвистики Университета Тромсё, Норвегия
· 2008–2011: : докторант, Институт русского языка им. В.В.Виноградова РАН (ИРЯ РАН), Москва
· 2002–2008: : cтарший научный сотрудник Отдела лингвистических исследований
· 2000–2002: : старший научный сотрудник Отдела теоретических и прикладных проблем информатики, Всероссийский институт научной и технической информации (ВИНИТИ РАН), Москва
· 1997–2001: : преподаватель русского языка как иностранного
· 1996–1998: : учебный мастер, филологический факультет МГУ им. М.В.Ломоносова
· 1995-1996: : ведущий специалист деканата, факультет теоретической и прикладной лингвистики РГГУ

Награды и поощрения

· Благодарность проректора НИУ ВШЭ (июль 2025)
· Почетная грамота факультета гуманитарных наук НИУ ВШЭ (ноябрь 2024)
· Почетная грамота Министерства науки и высшего образования Российской Федерации (ноябрь 2022)
· Благодарность проректора НИУ ВШЭ (ноябрь 2021)
· Благодарность Высшей школы экономики (январь 2017)
· Надбавка за академические достижения и вклад в репутацию НИУ ВШЭ (2017–2019)
· Надбавка за академическую работу (2016–2017)
· Надбавка за публикацию в журнале из Списка B (2025–2026, 2024–2025)
· Надбавка за публикацию в журнале из Списка А (и приравненном к нему научном издании) (2023–2024)
· Надбавка за публикацию в международном рецензируемом научном издании (2022–2023, 2021–2022, 2019–2020)
· Надбавка за статью в зарубежном рецензируемом журнале (2014–2016, 2012–2014)
· Лучший преподаватель — 2019, 2017, 2013

Гранты и проекты

— · Научно-учебная группа «Материалы к частотному словарю русской поэзии»» (Научный фонд НИУ ВШЭ, 2018, руководитель)
2020 · DiAsPol250 «The Development of the Polish Aspect System in the Last 250 Years against the Background of Neighbouring Languages», Beethoven II – Polish-German Funding Initiative (DFG/NCN), 2018-2020, cooperation partner
— · TWIRLL: Targeting Wordforms in Russian Language Learning, грант международного академического сотрудничества Норвежского научного фонда SIU c Университетом Тромсе (CPRU-2017/10027)
2020 · DigiPalSlav: Digital Paleoslavistics, Alexander von Humboldt-Stiftung, Programm zur Förderung von Institutspartnerschaften Abteilung Förderung und Netzwerk, 2018-2020, cooperation partner
2017 · Научно-учебная группа «REALEC для реально необходимых слов» (Научный фонд НИУ ВШЭ, 2016-2017, руководитель)
2018 · Разработка модулей НКРЯ для автоматической разметки и словарной поддержки старорусских и церковнославянских текстов (РГНФ, грант № 17-04-12064, 2017-2018, исполнитель)
2016 · Стандарты оценки методов автоматического извлечения информации из текстов (РФФИ, грант № 15-07-09306, 2014-2016, руководитель)
2016 · Развитие Исторических модулей НКРЯ (РГНФ, грант № 15-04-12050, 2015-2016, исполнитель)
2015 · Квантитативное корпусное исследование грамматической категории числа (Научный фонд НИУ ВШЭ, индивидуальный проект, 2014-2015)
2014 · Синтаксическая разметка корпуса со снятой лексико-грамматической омонимией (Программа фундаментальных исследований Президиума РАН "Корпусная лингвистика", 2012-2014)
2014 · ФреймБанк: разметка семантических ролей и морфосинтаксического оформления участников фреймов (на базе НКРЯ) (Программа фундаментальных исследований Президиума РАН "Корпусная лингвистика", 2012-2014)
2013 · Частотный словарь русской грамматики и лексической сочетаемости (Научный фонд НИУ ВШЭ, индивидуальный проект, 2012 - 2013) Словообразовательная разметка НКРЯ (Программа фундаментальных исследований Президиума РАН "Корпусная лингвистика", 2011)
— · Фреймбанк (Программа фундаментальных исследований Президиума РАН "Корпусная лингвистика", 2011)
2012 · От корпуса к словарю: автоматические методы выявления и построения каталога русских конструкций (РФФИ, грант № 10-06-00586а, совместно с О.А.Митрофановой, 2010-2012)
2012 · Exploring Emptiness: Russian Verbal Morphology and Cognitive Linguistics" (Norsk forskningsråd/Норвежского совета научных исследований, грант проекта Лоры Янды и Туре Нессета, 2008 - 2012)
2009 · Топологические типы русских предметных имен (РГНФ, грант № 07-04-00240а, 2007 - 2009)

Конференции (30)

Показать все

· 2025: 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025), 31.07.2025, Вена, Австрия (Вена). Доклад: Rubic2: Ensemble Model for Russian Lemmatization
· 2025: Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025) (Таллинн). Доклад: The application of corpus-based language distance measurement to the diatopic variation study (on the material of the Old Novgorodian birchbark letters)
· 2024: Русский язык в многоязычном мире (Москва). Доклад: Русский конструктикон как научный и учебный ресурс
· 2022: 6-й Колмогоровский семинар по компьютерной лингвистике и наукам о языке (Москва). Доклад: К задаче разработки версии корпусов НКРЯ с разрешенной неоднозначностью морфологической и синтаксической разметки
· 2022: 46-я школа-конференция ИППИ РАН «Информационные технологии и системы» (ИТиС-2022) (Огниково Московской области). Доклад: Опыт применения моделей-трансформеров для лемматизации современных и исторических текстов русского языка
· 2022: International Conference on Historical Lexicography and Lexicology (ICHLL 2022) (Lorient). Доклад: Automatic collection of parallel thesauri in dictionary/corpus joint system
· 2022: 25th International Conference on Text, Speech, and Dialogue (TSD 2022) (Брно). Доклад: Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC
· 2022: 13th Conference on Language Resources and Evaluation (LREC 2022) (Марсель). Доклад: Constructing a Lexical Resource of Russian Derivational Morphology
· 2022: Гаспаровские чтения - 2022 (Москва). Доклад: В стенах кипучих городов: О семантических границах эпитета в свете корпусных данных
· 2021: 27-ая Международная конференция по компьютерной лингвистике и интеллектуальным технологиям «Диалог-2021» (Москва). Доклад: Adjunct role labeling for Russian
· 2021: XIX EURALEX Congress (Александруполис). Доклад: Revised entries in the multi-volume edition and TEI encoding: a case of the historical dictionary of Russian
· 2021: 11th International Conference on Historical Lexicography and Lexicology (ICHLL 11) (Logroño, La Rioja). Доклад: Example, usage variant, and linking between dictionary and corpus data
· 2021: 11th International Conference on Historical Lexicography and Lexicology (ICHLL 11) (Logroño, La Rioja). Доклад: Lemmatization in corpus-to-dictionary systems: The case study for Old Church Slavonic
· 2021: 18th International Conference on Distributed Computing and Artificial Intelligence (DCAI) (Саламанка). Доклад: Automated Metaphor Identification in Russian and its Implications for Metaphor Studies
· 2021: 11th International Conference SLOVKO 2021: NLP, Corpus Linguistics and Interdisciplinarity (Братислава). Доклад: An HMM-based PoS Tagger for Old Church Slavonic
· 2021: SCLC-2020/2021: The Slavic Cognitive Linguistics Conference (June 3-6, 2021) (Тромсё). Доклад: On syntactic structures in the Russian Constructicon entries and beyond
· 2021: El’Manuscript 2021. Textual heritage and information technologies (Фрайбург). Доклад: Lemmatization of the Middle Russian Corpus within the RNC: Choice of Solutions
· 2021: Slavic aspect and (diachronic) corpora. International workshop (Майнц). Доклад: Profiling the behavior of verbs in the Middle Russian Corpus
· 2021: The 10th International Conference on Analysis of Images, Social Networks and Texts (Тбилиси). Доклад: Sculpting enhanced dependencies for Belarusian
· 2020: 26-я международная конференция по компьютерной лингвистике и интеллектуальным технологиям (Москва). Доклад: Русский конструктикон: новый лингвистический ресурс, его устройство и специфика
· 2020: 26-я международная конференция по компьютерной лингвистике и интеллектуальным технологиям (Москва). Доклад: GRAMEVAL 2020 Shared Task: Russian Full Morphology and Universal Dependencies Parsing
· 2019: Digital Transformations & Global Society 2019 (DTGS’2019) (Санкт-Петербург). Доклад: A cross-genre morphological tagging and lemmatization of the Russian poetry: distinctive test sets and evaluation
· 2019: Диалог (25-я международная конференция по компьютерной лингвистике и интеллектуальным технологиям) (Москва). Доклад: A Simple Fingerprint Approach to Extracting the Global Prosodic Properties from Field Data
· 2019: Historical Corpora and Variation (Кальяри). Доклад: Spelling variation and word clusters in the Middle Russian Corpus
· 2019: QUANTITATIVE APPROACHES TO VERSIFICATION (Прага). Доклад: Lexical Diversity and Colour Hues in Russian Poetry: A Corpus-Based Study of Adjectives
· 2019: QUANTITATIVE APPROACHES TO VERSIFICATION (Прага). Доклад: Lexical Diversity and Colour Hues in Russian Poetry: A Corpus-Based Study of Adjectives
· 2019: QUANTITATIVE APPROACHES TO VERSIFICATION (Прага). Доклад: Lexical Diversity and Colour Hues in Russian Poetry: A Corpus-Based Study of Adjectives
· 2019: QUANTITATIVE APPROACHES TO VERSIFICATION (Прага). Доклад: Lexical Diversity and Colour Hues in Russian Poetry: A Corpus-Based Study of Adjectives
· 2019: Межкампусная конференция по Digital Humanities «DH Meet-Up HSE» (Москва). Доклад: Данные поэтического корпуса НКРЯ как объект цифровой культуры
· 2019: Towards a multilingual constructicon: issues, approaches, perspectives (Дюссельдорф). Доклад: Russian Constructicon: clusters, families, and usage scenarios

Идентификаторы исследователя

ORCID: 0000-0001-8374-423X
ResearcherID: E-8855-2014
SPIN РИНЦ: 6340-5599
Google Scholar: https://scholar.google.ru/citations?user=5XzprO8AAAAJ&hl=ru
Scopus AuthorID: 37090988800

Публикации (116)

Russian Constructicon 2.0: New Features and New Perspectives of the Biggest Constructicon Ever Built

2023 · CHAPTER · en

DOI ↗ PDF ↗

Disambiguation in context in the Russian National Corpus: 20 yeas later

2023 · CHAPTER · en

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology and lexicon) and UD-SynTagRus (syntax). The paper discusses the challenges in applying the models to texts of different registers, orthographies, and time periods, on the one hand, and making the new version convenient for users accustomed to the old search practices, on the other. The re-annotated corpus data form the basis for the enhancement of the RNC tools such as word and n-gram frequency lists, collocations, corpus comparison, and Word at a glance.

DOI ↗ PDF ↗

From web to dialects: how to enhance non-standard Russian lects lemmatisation?

2023 · CHAPTER · en

The growing need for using small data distinguished by a set of distributional properties becomes all the more apparent in the era of large language models (LLM). In this paper, we show that for the lemmatisation of the web as corpora texts, heterogeneous social media texts, and dialect texts, the morphological tagging by a model trained on a small dataset with specific properties generally works better than the morphological tagging by a model trained on a large dataset. The material we use is Russian non-standard texts and interviews with dialect speakers. The sequence-to-sequence lemmatisation with the help of taggers trained on smaller linguistically aware datasets achieves the average results of 85 to 90 per cent. These results are consistently (but not always), by 1-2 per cent. higher than the results of lemmatisation with the help of the large-dataset-trained taggers. We analyse these results and outline the possible further research directions.

PDF ↗

The Effect of (Historical) Language Variation on the East Slavic Lects Lemmatisers Performance

2023 · ARTICLE · en

The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.

DOI ↗ PDF ↗

Automated Metaphor Identification in Russian and Its Implications for Metaphor Studies

2022 · CHAPTER · en

PDF ↗

Sculpting enhanced dependencies for Belarusian

2022 · CHAPTER · en

Enhanced Universal Dependencies (EUD) are enhanced graphs expressed on top of basic dependency trees. EUD support repre- sentation of deeper syntactic relations in constructions such as coordi- nation, gapping, relative clauses, and argument sharing through control and raising. The paper presents experiments on the EUD parsing of the low-resource Belarusian language, for which no corpora with enhanced annotations were available. Models trained on the Universal Dependencies treebanks of two closely related Slavic languages, Russian and Ukrainian, were used to parse sen- tences translated from Belarusian. After that, EUD were projected to the original sentences, which gave us ELAS (Enhanced Labeled Attach- ment Score) 78.1% for both Russian and Ukrainian in evaluation. We also trained a model of one of the IWPT 2020 Shared Task participants on obtained the annotations in Belarusian and achieved ELAS 83.4%. The analysis shows that the most common mistakes of cross-lingual parsing are rooted in different theoretical perspectives and practice approaches to the annotation of particular types of clauses in the three Slavic treebanks. Russian and Ukrainian EUD transfer models tend to make mistakes when dealing with the predicate argument relations, which are hard to iden- tify without understanding the semantics of the sentence. The alignment method decreases the quality of the annotation by confusing tokens that occur in a sentence more than once.

DOI ↗ PDF ↗

Accuracy, syntactic complexity and task type at play in examination writing: A corpus-based study

2022 · CHAPTER · en

This chapter explores the association between syntactic complexity and syntactic accuracy in essays written by Russian learners of English in reply to two examination task types: a description of graphical material (Task 1) and an opinion essay (Task 2). A Poisson regression model served to predict the number of syntactic errors. Two syntactic complexity parameters were statistically significant in predicting syntactic accuracy in both tasks: the numbers of sentences and adverbial clauses. Three more parameters predicted the accuracy in Task 1 only: maximum depth of syntactic trees, and the numbers of adjective + noun and noun + infinitive constructions. Six parameters were related to syntactic accuracy in Task 2: the numbers of all clauses, of tokens and of T- units; the average length of sentence; and the numbers of coordinated and of participle + noun constructions.

DOI ↗ PDF ↗

Word-formation complexity: a learner corpus-based study

2022 · ARTICLE · en

В статье рассматривается словообразовательная сложность учебных текстов, которая трактуется как система измерений, показывающих разнообразие приемов словообразования разного уровня, от простых до продвинутых, используемых учащимся. Анализируется взаимосвязь между сложностью и ошибками, которые учащиеся допускают в словообразовании. Исследование основано на материалах REALEC - корпуса английских экзаменационных эссе, написанных студентами университета с родным русским языком. Предлагается подход к измерению словообразовательной сложности, основанный на классификации суффиксов Бауэра и Нейшена (Bauer & Nation 1993), и анализируется соответствие между показателями индексов сложности и количеством ошибок словообразования, размеченных в текстах корпуса, с учетом типа экзаменационного задания. Постулируется гипотеза о том, что с увеличением сложности количество ошибок должно уменьшаться, и проводится статистический анализ параметров сложности и безошибочности. В работе показано, во-первых, что использование словообразовательных суффиксов более высокой сложности связано с количеством ошибок в текстах. Во-вторых, разные уровни иерархии сложности оказывают разнонаправленное влияние на точность: в частности, использование нерегулярных словообразовательных моделей положительно связано с количеством ошибок. В-третьих, следует учитывать тип экзаменационного задания, в том числе ожидаемые формально-регистровые особенности текста. Гипотеза была подтверждена для регулярных, но нечастотных суффиксальных моделей при их использовании в описаниях рисунков и графиков - текстах, следующих определенному формату и включающих элементы академического письма. Однако в случае аргументативных эссе выдвинутая гипотеза требует уточнения.

DOI ↗ PDF ↗

Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

2022 · CHAPTER · en

REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing

DOI ↗ PDF ↗

Constructing a Lexical Resource of Russian Derivational Morphology

2022 · CHAPTER · en

Words of any language are to some extent related thought the ways they are formed. For instance, the verb ‘exempl-ify’ and the noun ‘example-s’ are both based on the word ‘example’, but the verb is derived from it, while the noun is inflected. In Natural Language Processing of Russian, the inflection is satisfactorily processed; however, there are only a few machine-trackable resources that capture derivations even though Russian has both of these morphological processes very rich. Therefore, we devote this paper to improving one of the methods of constructing such resources and to the application of the method to a Russian lexicon, which results in the creation of the largest lexical resource of Russian derivational relations. The resulting database dubbed DeriNet.RU includes more than 300 thousand lexemes connected with more than 164 thousand binary derivational relations. To create such data, we combined the existing machine-learning methods that we improved to manage this goal. The whole approach is evaluated on our newly created data set of manual, parallel annotation. The resulting DeriNet.RU is freely available under an open license agreement.

PDF ↗

Курсы (12)

Computer Tools for Linguistic Research · 5 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023, 2021/2022 · Нижний Новгород · Анг
Научно-исследовательский семинар "Анализ и визуализация текстовых данных" · 3 раза

2025/2026, 2024/2025, 2023/2024 · Магистратура · рус
Научно-исследовательский семинар «Интерпретация лингвистических явлений в больших языковых моделях»

2025/2026 · Бакалавриат · рус
Fundamentals of Corpus Research

2025/2026 · Магистратура / Маго-лего · Анг
Программирование и лингвистические данные · 5 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023, 2021/2022 · Бакалавриат · рус
Теоретическая и прикладная лексикография · 4 раза

2025/2026, 2023/2024, 2022/2023, 2021/2022 · Бакалавриат · рус
Корпусная лингвистика · 3 раза

2024/2025, 2023/2024, 2022/2023 · Магистратура / Маго-лего · рус
Мастер-классы

2024/2025 · Магистратура · рус
Научно-исследовательский семинар "Нейросетевое моделирование длинных языковых единиц"

2024/2025 · Бакалавриат · рус
Дополнительные главы корпусной лингвистики

2023/2024 · Магистратура / Маго-лего · рус
Анализ и визуализация текстовых данных

2022/2023 · Магистратура · рус
Научно-исследовательский семинар "Корпусная лингвистика и изучение иностранных языков"

2022/2023 · Нижний Новгород · рус