Трофимова Екатерина Алексеевна

Факультет гуманитарных наук

Профиль на hse.ru ↗ тел.: 27261

Публикаций

Языков

Наград

Конференций

Профиль Публикации (7) Курсы (1)

Профессиональные интересы

машинное обучениеАнализ данных в физикефизика высоких энергийастрофизика

Должности

Приглашенный преподаватель — Факультет гуманитарных наук, Школа лингвистики
Приглашенный преподаватель — Факультет компьютерных наук, Департамент больших данных и информационного поиска

Био

· Начала работать в НИУ ВШЭ в 2026 году.

Образование

2025 · Кандидат наук: Национальный исследовательский университет "Высшая школа экономики"
2020 · Магистратура: Национальный исследовательский университет "Высшая школа экономики", специальность «Прикладная математика и информатика», квалификация «Магистр»
2015 · Бакалавриат: Московский государственный университет им. М.В. Ломоносова, специальность «Экономика», квалификация «Бакалавр»

Опыт работы

· 2016: Март Январь
· 2017: Национальный расчетный депозитарий. Группа Московской Биржи. Отдел Ценных бумаг. Младший специалист
· 2017: Май Январь
· 2019: BNP Paribas Cardif Россия. Младший аналитик
· 2019: Октябрь Февраль
· 2020: Корпоративный университет Сбербанка. Семинарист

Награды и поощрения

· Благодарность старшего директора по научным исследованиям и разработкам НИУ ВШЭ (август 2024)

Гранты и проекты

2026 · В апреле 2026 года в Университете Низвы (Султанат Оман) состоялась Международная конференция по интеллектуальным системам и приложениям искусственного интеллекта (ISAA 2026). Высшая школа экономики выступила соорганизатором мероприятия совместно с Университетом Низвы и Университетом технологий и прикладных наук Ибри. Ученые НИУ ВШЭ также вошли в число ключевых спикеров конференции.

Конференции (2)

Показать все

· 2020: International Conference on Computer Simulation in Physics and beyond (Москва). Доклад: Galaxies Clusters Reconstruction with Variational Autoencoders
· 2020: International Conference on Computer Simulation in Physics and beyond (Москва). Доклад: Fast simulation of the LHCb electromagnetic calorimeter response using VAEs and GANs

Идентификаторы исследователя

ORCID: 0000-0001-5436-8511
ResearcherID: ABG-3251-2020
SPIN РИНЦ: 3612-0507
Google Scholar: https://scholar.google.com/citations?hl=ru&user=eu6bEGMAAAAJ

Публикации (7)

High-accuracy eosinophil detection in eosinophilic esophagitis histological images using machine learning model YOLO11

2025 · ARTICLE · ru

Цель исследования. Оценить эффективность модели машинного обучения (МО) с трансформерной архитектурой YOLO11 (и ее дообученной версии) для автоматизированной сегментации и детекции эозинофилов на гистологических изображениях с различным качеством фиксации, окрашивания тканей и срезов в условиях рутинной клинической практи- ки, для улучшения точности диагностики эозинофильного эзофагита (ЭоЭ). Материал и методы. Проведен многоцентровый ретроспективный анализ гистологических изображений, полученных ме- тодом полного сканирования стекол (whole slide imaging — WSI) 60 пациентов с ЭоЭ. Вручную проанализированы и раз- мечены 54 из 653 изображений срезов ткани. Подготовленный датасет использовали для обучения модели YOLO11. Результаты. К 150 эпохе обучения модели показатели точности (precision) и полноты (recall) для ограничивающих рамок (bounding boxes) стабильно улучшались, а показатель recall в итоге составил 0,98, что свидетельствует об очень высокой чувствительности модели к целевым областям на изображении (к эозинофилам). Показатель коэффициента сходства Жак- кара (IoU) достиг 0,94, что указывает на высокую эффективность модели в точной локализации и сегментации эозино- филов на гистологических изображениях. Учитывая, что модель также хорошо работает с точки зрения precision и recall как для ограничивающих рамок, так и для масок, это дополняет ее возможности сегментации, обеспечивая четкое и точ- ное определение границ эозинофилов. В ходе оценки качества работы модели на новых неразмеченных данных мы выяви- ли ряд ограничений с необходимостью дальнейшей доработки. В частности, отмечена сниженная эффективность при ра- боте со скоплениями эозинофилов. Заключение. Разработанное нами решение на основе модели YOLO11 представляет шаг вперед в автоматизации гистоло- гической оценки эозинофильного эзофагита, предлагая высокоточный инструмент для анализа эозинофильной инфильтра- ции. Перспективным направлением дальнейших исследований станет разметка дополнительного набора WSI-изображений, дообучение модели и эксперименты с моделями, основанными на концепции визуальных трансформеров (vision transformer). Результаты сегментации и определения пикового эозинофильного числа предстоит ретроспективно сравнить с пиковым эозинофильным числом, определенным патологоанатомами при световой микроскопии.

DOI ↗

Linguacodus: A synergistic framework for transformative code generation in machine learning pipelines

2024 · ARTICLE · en

In the ever-evolving landscape of machine learning, seamless translation of natural language descriptions into executable code remains a formidable challenge. This paper introduces Linguacodus, an innovative framework designed to tackle this challenge by deploying a dynamic pipeline that iteratively transforms natural language task descriptions into code through high-level data-shaping instructions. The core of Linguacodus is a fine-tuned large language model, empowered to evaluate diverse solutions for various problems and select the most fitting one for a given task. This paper details the fine-tuning process and sheds light on how natural language descriptions can be translated into functional code. Linguacodus represents a substantial leap towards automated code generation, effectively bridging the gap between task descriptions and executable code. It holds great promise for advancing machine learning applications across diverse domains. Additionally, we propose an algorithm capable of transforming a natural description of an ML task into code with minimal human interaction. In extensive experiments on a vast machine learning code dataset originating from Kaggle, we showcase the effectiveness of Linguacodus. The investigations highlight its potential applications across diverse domains, emphasizing its impact on applied machine learning in various scientific fields.

DOI ↗

Code4ML: a large-scale dataset of annotated Machine Learning code

2023 · ARTICLE · en

The use of program code as a data source is increasingly expanding among data scientists. The purpose of the usage varies from the semantic classification of code to the automatic generation of programs. However, the machine learning model application is somewhat limited without annotating the code snippets. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions, and dataset descriptions publicly available from Kaggle—the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

DOI ↗

Machine learning code snippets semantic classification

2023 · ARTICLE · en

Program code has recently become a valuable active data source for training various data science models, from code classification to controlled code synthesis. Annotating code snippets play an essential role in such tasks. This article presents a novel approach that leverages CodeBERT, a powerful transformer-based model, to classify code snippets extracted from Code4ML automatically. Code4ML is a comprehensive machine learning code corpus compiled from Kaggle, a renowned data science competition platform. The corpus includes code snippets and information about the respective kernels and competitions, but it is limited in the quality of the tagged data, which is ~0.2%. Our method addresses the lack of labeled snippets for supervised model training by exploiting the internal ambiguity in particular labeled snippets where multiple class labels are combined. Using a specially designed algorithm, we effectively separate these ambiguous fragments, thereby expanding the pool of training data. This data augmentation approach greatly increases the amount of labeled data and improves the overall quality of the trained models. The experimental results demonstrate the prowess of the proposed code classifier, achieving an impressive F1 test score of ~89%. This achievement not only enhances the practicality of CodeBERT for classifying code snippets but also highlights the importance of enriching large-scale annotated machine learning code datasets such as Code4ML. With a significant increase in accurately annotated code snippets, Code4ML is becoming an even more valuable resource for learning and improving various data processing models.

DOI ↗

Fast simulation of the LHCb electromagnetic calorimeter response using VAEs and GANs

2021 · ARTICLE · en

Modern experiments in high-energy physics require an increasing amount of simulated data. Monte-Carlo simulation of calorimeter responses is by far the most computationally expensive part of such simulations. Recent works have shown that the application of generative neural networks to this task can significantly speed up the simulations while maintaining an appropriate degree of accuracy. This paper explores different approaches to designing and training generative neural networks for simulation of the electromagnetic calorimeter response in the LHCb experiment.

DOI ↗

Galaxy Clusters Reconstruction

2021 · ARTICLE · en

In the present work, we introduce a machine learning-based approach for galaxy clustering. It requires to determine clusters to provide further galaxies groups' masses estimation. The knowledge of mass distribution is crucial in dark matter research and study of the large-scale structure of the Universe. State-of-the-art telescopes allow various spectroscopy range data accumulation that highlights the need for algorithms with a substantial generalization property. The data we deal with is a combination of more than twenty different catalogues. It is required to provide clustering of all combined galaxies. We produce a regression on the redshifts with the coefficient of determination R2 equals 0.99992 on the validation dataset with training dataset for 3,154,894 of galaxies (0.0016 z

DOI ↗

Segmentation of EM showers for neutrino experiments with deep graph neural networks

2021 · ARTICLE · en

We introduce a first-ever algorithm for the reconstruction of multiple showers from the data collected with electromagnetic (EM) sampling calorimeters. Such detectors are widely used in High Energy Physics to measure the energy and kinematics of in-going particles. In this work, we consider the case when many electrons pass through an Emulsion Cloud Chamber (ECC) brick, initiating electron-induced electromagnetic showers, which can be the case with long exposure times or large input particle flux. For example, SHiP experiment is planning to use emulsion detectors for dark matter search and neutrino physics investigation. The expected full flux of SHiP experiment is about 1020 particles over five years. To reduce the cost of the experiment associated with the replacement of the ECC brick and off-line data taking (emulsion scanning), it is decided to increase exposure time. Thus, we expect to observe a lot of overlapping showers, which turn EM showers reconstruction into a challenging point cloud segmentation problem. Our reconstruction pipeline consists of a Graph Neural Network that predicts an adjacency matrix and a clustering algorithm. We propose a new layer type (EmulsionConv) that takes into account geometrical properties of shower development in ECC brick. For the clustering of overlapping showers, we use a modified hierarchical density-based clustering algorithm. Our method does not use any prior information about the incoming particles and identifies up to 87% of electromagnetic showers in emulsion detectors. The achieved energy resolution over 16,577 showers is σE/E = (0.095 ± 0.005) + (0.134 ± 0.011)/√(E). The main test bench for the algorithm for reconstructing electromagnetic showers is going to be SND@LHC.

DOI ↗

Курсы (1)

Research Seminar "Introduction to Specialty"

2023/2024 · Магистратура · Анг