Браславский Павел Исаакович

Факультет компьютерных наук

Профиль на hse.ru ↗ тел.: 27276

Публикаций

Языков

Наград

Конференций

Профиль Публикации (50) Курсы (2)

Должности

Старший научный сотрудник — Факультет компьютерных наук, Научно-учебная лаборатория моделей и методов вычислительной прагматики
Доцент — Факультет компьютерных наук, Департамент больших данных и информационного поиска

Био

· Начал работать в НИУ ВШЭ в 2020 году.

Образование

2000 · Кандидат наук
1997 · Специалитет: Уральский государственный технический университет г. Екатеринбурга, специальность «Вычислительные машины, комплексы, системы и сети», квалификация «Инженер-системотехник»

Опыт работы

· Яндекс, СКБ Контур, JetBrains Research

Награды и поощрения

· Надбавка за публикацию в журнале из Списка А (и приравненном к нему научном издании) (2025–2026, 2023–2024)
· Надбавка за публикацию в международном рецензируемом научном издании (2022–2023, 2021–2022, 2020–2021)

Гранты и проекты

— · на соискание учёной степени кандидата наук

Идентификаторы исследователя

ORCID: 0000-0002-6964-458X
ResearcherID: P-5139-2016
SPIN РИНЦ: 7958-6601
Google Scholar: https://scholar.google.com/citations?user=ch2vRdcAAAAJ&hl=en
Scopus AuthorID: 16548847400

Публикации (50)

CausalQA: A Benchmark for Causal Question Answering

2023 · CHAPTER · en

You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models

2023 · CHAPTER · en

In this study, we focus on automatic humor detection, a highly relevant task for conversational AI. To date, there are several English datasets for this task, but little research on how models trained on them generalize and behave in the wild. To fill this gap, we carefully analyze existing datasets, train RoBERTa-based and Naïve Bayes classifiers on each of them, and test on the rest. Training and testing on the same dataset yields good results, but the transferability of the models varies widely. Models trained on datasets with jokes from different sources show better transferability, while the amount of training data has a smaller impact. The behavior of the models on out-of-domain data is unstable, suggesting that some of the models overfit, while others learn non-specific humor characteristics. An adversarial attack shows that models trained on pun datasets are less robust. We also evaluate the sense of humor of the chatGPT and Flan-UL2 models in a zero-shot scenario. The LLMs demonstrate competitive results on humor datasets and a more stable behavior on out-of-domain data. We believe that the obtained results will facilitate the development of new datasets and evaluation methodologies in the field of computational humor. We’ve made all the data from the study and the trained models publicly available at https://github.com/Humor-Research/Humor-detection.

Вычисление схожести комментариев Javadoc

2023 · ARTICLE · ru

Комментарии в исходном коде являются важной частью документации программного обеспечения. Многие программные проекты страдают от некачественных комментариев, которые часто создаются путем копирования и содержат многочисленные ошибки и неточности. В случае схожих методов, классов и т.п. копирование комментариев с небольшими изменениями оправдано, но и в этом случае разработчики делают ошибки. В этом исследовании мы решаем проблему обнаружения похожих комментариев к исходному коду, что позволяет улучшить комментариев к коду. Применительно к задаче определения сходства JavaDoc-комментариев мы провели оценку традиционных алгоритмов сходства строк и современных методов машинного обучения. В нашем эксперименте мы используем коллекцию комментариев Javadoc из четырех промышленных Java-проектов с открытым исходным кодом. Мы выяснили, что LCS (Longest Common Subsequence) является лучшим алгоритмом для решения нашей задачи, учитывая как качество (точность 94%, полнота 74%), так и производительность.

DOI ↗

Towards Understanding and Answering Comparative Questions

2022 · CHAPTER · en

In this paper, we analyze comparative questions and answers. At least 3% of the questions submitted to search engines are comparative; ranging from simple facts like "Did Messi or Ronaldo score more goals in 2021?'' to life-changing and probably highly subjective questions like "Is it better to move abroad or stay?''. Ideally, answers to subjective comparative questions would reflect diverse opinions so that the asker can come to a well-informed decision. To better understand the information needs behind comparative questions, we develop approaches to extract the mentioned comparison objects and aspects. As a first step to answer comparative questions, we develop an approach that detects the stances of potential result nuggets (i.e., text passages containing the comparison objects). Our approaches are trained and evaluated on a set of 31,000 English questions from existing datasets that we label as comparative or not. In the 3,500 comparative questions, we label the comparison objects, aspects, and predicates. For 950 questions, we collect answers from online forums and label the stance towards the comparison objects. In the experiments, our approaches recall 71% of the comparative questions with a perfect precision of~1.0, recall 92% of subjective comparative questions with a precision of 0.98, and identify the comparison objects and aspects with an F1 of 0.93 and 0.80, respectively. The stance detector fine-tuned on pairs of objects and answers achieves an accuracy of 0.63.

DOI ↗

Identifying Argumentative Questions in Web Search Logs

2022 · CHAPTER · en

DOI ↗

NamedEntityRangers at SemEval-2022 Task 11: Transformer-based Approaches for Multilingual Complex Named Entity Recognition

2022 · CHAPTER · en

This paper presents the two submissions of NamedEntityRangers Team to the MultiCoNER Shared Task, hosted at SemEval-2022. We evaluate two state-of-the-art approaches, of which both utilize pre-trained multi-lingual language models differently. The first approach follows the token classification schema, in which each token is assigned with a tag. The second approach follows a recent template-free paradigm, in which an encoder-decoder model translates the input sequence of words to a special output, encoding named entities with predefined labels. We utilize RemBERT and mT5 as backbone models for these two approaches, respectively. Our results show that the oldie but goodie token classification outperforms the template-free method by a wide margin. Our code is available at: https://github.com/Abiks/MultiCoNER.

DOI ↗

Jokingbird: Funny Headline Generation for News

2022 · CHAPTER · en

In this study, we address the problem of generating funny headlines for news articles. Funny headlines are beneficial even for serious news stories – they attract and entertain the reader. Automatically generated funny headlines can serve as prompts for news editors. More generally, humor generation can be applied to other domains, e.g. conversational systems. Like previous approaches, our methods are based on lexical substitutions. We consider two techniques for generating substitute words: one based on BERT and another based on collocation strength and semantic distance. At the final stage, a humor classifier chooses the funniest variant from the generated pool. An in-house evaluation of 200 generated headlines showed that the BERT-based model produces the funniest and in most cases grammatically correct output.

DOI ↗

Entity Linking over Nested Named Entities for Russian

2022 · CHAPTER · en

In this paper, we describe entity linking annotation over nested named entities in the recently released Russian NEREL dataset for information extraction. The NEREL collection is currently the largest Russian dataset annotated with entities and relations. It includes 933 news texts with annotation of 29 entity types and 49 relation types. The paper describes the main design principles behind NEREL’s entity linking annotation, provides its statistics, and reports evaluation results for several entity linking baselines. To date, 38,152 entity mentions in 933 documents are linked to Wikidata. The NEREL dataset is publicly available.

Text Simplification for Scientific Information Access

2021 · CHAPTER · en

DOI ↗

RuBQ 2.0: An Innovated Russian Question Answering Dataset

2021 · CHAPTER · en

The paper describes the second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata. Whereas the first version builds on Q&A pairs harvested online, the extension is based on questions obtained through search engine query suggestion services. The questions underwent crowdsourced and in-house annotation in a quite different fashion compared to the first edition. The dataset doubled in size: RuBQ 2.0 contains 2,910 questions along with the answers and SPARQL queries. The dataset also incorporates answer-bearing paragraphs from Wikipedia for the majority of questions. The dataset is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing. We provide the analysis of the dataset and report several KBQA and MRC baseline results. The dataset is freely available under the CC-BY-4.0 license.

DOI ↗

Курсы (2)

09.06.01. Информатика и вычислительная техника

2022/2023 · Аспирантура · Анг
Research Problems in Natural Language Processing

2021/2022 · Аспирантура · Анг