Попцова Мария Сергеевна

Факультет компьютерных наук

Профиль на hse.ru ↗ тел.: +7 (495) 531-00-00 | 27335

Публикаций

Языков

Наград

Конференций

Профиль Публикации (57) Курсы (9)

Профессиональные интересы

биоинформатикагеномикасравнительная геномикамашинное обучениеМашинное обучение и анализ данных34.03.23 Математическая биология и теоретическое моделирование биологических процессов. Биоинформатика

Должности

Директор центра — Факультет компьютерных наук, Институт искусственного интеллекта и цифровых наук, Центр биомедицинских исследований и технологий
Доцент — Факультет компьютерных наук, Департамент больших данных и информационного поиска
Академический руководитель образовательной программы — Анализ данных в биологии и медицине

Био

· Начала работать в НИУ ВШЭ в 2016 году.
· Научно-педагогический стаж: 17 лет.

Образование

2004 · Кандидат физико-математических наук: МГУ имени М.В. Ломоносова, специальность 01.00.00 «Физико-математические науки» и 03.01.02 «Биофизика», тема диссертации: Трансформация автоволн в локально неоднородных активных средах
1995 · Специалитет: Московский государственный университет им. М.В. Ломоносова, специальность «Физика», квалификация «Физик»

Опыт работы

· 09/16-09/17 ,
· доцент
· факультет бизнеса и менеджмента, Высшая Школа Экономики
· 10/12-н вр,
· Старший научный сотрудник
· кафедра биофизики физического факультета МГУ
· 01/10-05/11
· Научный сотрудник
· Факультет Патологии и Лабораторной Медицины, Институт Вычислительной Биомедицины, Медицинский колледж Уэйлла-Корнелла, Корнелльский университет (Department of Pathology and Laboratory Medicine, Institute for Computational Biomedicine, Weill Cornell Medical College)
· Краткая информация: работала в лаборатории, занимающейся исследованиями рака простаты. Разработала алгоритм и написала программу по вычислению степени влияния CNVs на биологические пути (в процессе публикации). Занималась анализом данных технологий секвенирования второго поколения с целью найти эндогенные причины разрыва генома при агрессивных формах опухоли.
· 4/09-12/09, 2/05-1/08
· Научный сотрудник
· Факультет молекулярной и клеточной биологии, Коннектикутский университет (Molecular and Cell Biology Department, University of Connecticut)
· Краткая информация: работала по гранту НАСА в рамках программы Applied Information Systems Research (AISR) program (http://aisrp.nasa.gov/ ). Участвовала в разработке алгоритмов по обработке больших массивов данных (в применении к биологическим системам) и реализации данных алгоритмов методом параллельных вычислений на кластерных системах (параллельных суперкомпьютерах) на основе Unix.
· основатель и совладелец
· Janussys, Ltd. (www.janussys.ru)
· Компьютерно-лингвистическая компания по разработке программного обеспечения
· Краткая информация: компания, работающая в области математической лингвистики, в частности, занимающаяся разработкой алгоритмов машинного перевода и созданием многоязычных словарей. Издатель мультимедийного англо-русского иллюстрированного словаря «Янус» (2002). В настоящее время ведет поиск инвесторов в проект создания системы машинного перевода нового поколения.

Награды и поощрения

· Благодарность проректора НИУ ВШЭ (ноябрь 2025)
· Почетная грамота НИУ ВШЭ (май 2025)
· Благодарность проректора НИУ ВШЭ (октябрь 2024)
· Благодарность проректора НИУ ВШЭ (декабрь 2023)
· Благодарность первого проректора НИУ ВШЭ (декабрь 2023)
· Благодарственное письмо первого проректора НИУ ВШЭ (февраль 2023)
· Благодарность НИУ ВШЭ (май 2022)
· Благодарность Факультета компьютерных наук НИУ ВШЭ (сентябрь 2019)
· Благодарность проректора НИУ ВШЭ (май 2019)
· Надбавка за публикацию в журнале из Списка А (и приравненном к нему научном издании) (2025–2026, 2024–2025, 2023–2024)
· Надбавка за публикацию в международном рецензируемом научном издании (2022–2023, 2021–2022, 2020–2022, 2017–2019)
· Лучший преподаватель — 2021
· Лучший академический руководитель в номинации «Цифровые навыки студентов» — 2024–2025
· Лучший академический руководитель в номинации «Удовлетворенность студентов качеством образовательной программы» — 2025
· Лучший академический руководитель в номинации «Межфакультетское взаимодействие» — 2023–2024
· Лучший академический руководитель в номинации «Работа студентов с внешними заказчиками» — 2023
· Лучший академический руководитель в номинации «Привлечение студентов» — 2023

Гранты и проекты

— · на соискание учёной степени кандидата наук

Идентификаторы исследователя

ORCID: 0000-0002-7198-8234
ResearcherID: G-6985-2014
SPIN РИНЦ: 1361-1087
Google Scholar: https://scholar.google.com/citations?hl=en&user=9MoA58MAAAAJ
Scopus AuthorID: 16177766600

Публикации (57)

Prediction of protein-protein interactions using point transformer and spherical Convex Hull graphs

2026 · ARTICLE · en

Accurate predictions and large-scale identification of protein-protein interactions (PPIs) are crucial for understanding their inherent biological mechanisms and protein functions in virtually all biological processes. Nowadays, graph-based deep learning models have made significant contributions in modeling proteins with physicochemical and geometric features. However, most of these models rely on conventional graph construction methods, such as radial cutoff or k-nearest neighbor (k-NN), which often produce sparse and weakly connected graphs, limiting the ability of neural networks to exploit the spatial relationships between nodes. To address this, we introduce PT-PPI, a geometric deep learning framework that combines protein surface point clouds with geometric graphs. Protein surfaces are encoded as oriented point clouds enriched with geometric features, then transformed into sparse, well-connected graphs using the hyperparameter-free Spherical Convex Hull (SCHull) method. These graphs are processed by a Point Transformer network, with representations coupled to ProstT5 sequence embeddings. Evaluations on the PINDER dataset show that PT-PPI surpasses LLM-based (D-SCRIPT), graph-based (GCN, GAT, Struct2Graph), and hybrid sequence-structural-based models (SpatialPPIv2). Ablation studies confirm the complementary value of surface geometry and sequence information, demonstrating that geometric deep learning on protein surfaces and point cloud representations offers a promising approach that opens the doors for further research on large-scale interactome mapping and the understanding of protein function.

DOI ↗ PDF ↗

Multimodal graph, surface, and language-based model for protein protein interaction prediction

2026 · ARTICLE · en

Accurate prediction of protein-protein interactions (PPIs) is fundamental to understanding biological processes and disease mechanisms. While deep learning offers a powerful alternative to costly experimental methods, existing approaches often overlook critical protein-surface information and rely on simplistic feature fusion techniques, thereby limiting performance. To address this, we introduce GSMFormer-PPI, a novel multimodal framework that integrates protein molecular surface features, 3D structural graphs, and residue-level sequence embeddings. Our architecture employs geometric deep learning (MaSIF) to extract physicochemical surface descriptors, graph convolutional networks to process structural context, and a transformer encoder with linear projectors to learn complex, cross-modal interactions beyond simple concatenation. GSMFormer-PPI was evaluated on a curated PINDER dataset, and direct comparisons showed that it outperforms traditional graph-based models. Furthermore, a cross-dataset comparison revealed that it achieves similar or higher performance to that reported by other top models. Ablation studies confirm the critical contribution of surface features and our advanced fusion strategy to the model’s superior predictive power. This work demonstrates that the integrative analysis of surface, structure, and sequence data is a vital and promising direction for advancing PPI prediction.

DOI ↗ PDF ↗

Molecular dynamics simulations refine the pathogenicity of ACVRL1 kinase domain variants by quantifying impacts on ATP binding in pulmonary arterial hypertension

2026 · ARTICLE · en

Single amino acid substitutions in the ATP-binding domain of ACVRL1, a key receptor in the bone morphogenetic protein (BMP) signaling pathway, are frequently classified as variants of uncertain significance (VUS), complicating molecular diagnosis for pulmonary arterial hypertension (PAH) and Hereditary Hemorrhagic Telangiectasia (HHT). Since aberrant ATP binding disrupts downstream SMAD1/5/8 phosphorylation, we employed molecular dynamics (MD) simulations to quantitatively assess the functional impact of these variants. We first validated our approach on 20 known pathogenic/likely pathogenic variants within 5Å of the ATP-binding site, finding that 18 (90%) caused significant alterations in binding affinity (|d| ≥ 0.8, p in silico mutagenesis of all possible substitutions at ATP-binding pocket positions, combined with InterVar classification under HHT phenotype, enabled reclassification of 9 of 12 (75%) VUS as likely pathogenic. Finally, we demonstrated the applicability of this approach in two PAH patients with HHT carrying ACVRL1 VUS. This work establishes MD simulation of ATP-binding affinity as an effective and scalable tool for the functional interpretation of kinase variants, with broad potential for application across other disease-associated kinases.

DOI ↗ PDF ↗

Deep learning captures the effect of epistasis in multifactorial diseases

2025 · ARTICLE · en

Polygenic risk score (PRS) prediction is widely used to assess the risk of diagnosis and progression of many diseases. Routinely, the weights of individual SNPs are estimated by the linear regression model that assumes independent and linear contribution of each SNP to the phenotype. However, for complex multifactorial diseases such as Alzheimer’s disease, diabetes, cardiovascular disease, cancer, and others, association between individual SNPs and disease could be non-linear due to epistatic interactions. The aim of the presented study is to explore the power of non-linear machine learning algorithms and deep learning models to predict the risk of multifactorial diseases with epistasis.

DOI ↗

Data augmentation with generative models improves detection of Non-B DNA structures

2025 · ARTICLE · en

Non-B DNA structures, or flipons, are important functional elements that regulate a large spectrum of cellular programs. Experimental technologies for flipon detection are limited to the subsets that are active at the time of an experiment and cannot capture whole-genome functional set. Thus, the task of generating reliable whole-genome annotations of non-B DNA structures is put on deep learning models, however their quality depends on the available experimental data for training. The data augmentation approach as the combination of synthetic and real data is widely used in various fields. Deep generative models demonstrated promising results in data augmentation improving classifiers’ performance. Here we aimed at testing performance of diffusion models in comparison to other generative models in generating synthetic non-B DNA structures for data augmentation approach. We tested denoising diffusion probabilistic and implicit models (DDPM and DDIM), Wasserstein generative adversarial network (WGAN), vector quantised variational autoencoder (VQ-VAE) and showed that data augmentation improves the quality of classifiers. Diffusion models overall show the best results, but when considering three criteria of generative trilemma - quality of generated samples, diversity and sampling speed, we conclude that trade-off is possible between generative diffusion model and other architectures such as WGAN and VQ-VAE.

DOI ↗

The prevalence of pathogenic variants in the BMPR2 gene in patients with the idiopathic pulmonary arterial hypertension in the Russian population: sequencing data and meta-analysis

2025 · ARTICLE · en

Background Idiopathic pulmonary arterial hypertension (IPAH) is a rare and severe form of pulmonary hypertension, with a genetic basis most commonly associated with mutations in the BMPR2 gene. However, no genetic testing has been reported for IPAH patients in the Russian population, nor have systematic studies been conducted to assess the frequency of pathogenic variants in this group. Methods The study cohort included 105 IPAH patients, consisting of 23 males and 82 females, who were managed at the PH care center in Moscow, Russia, from 2014 to 2024. Genetic testing was performed using whole-genome sequencing. Variant identification and annotation were conducted using GATK, DeepVariant, VEP, sv-callers and AnnotSV. A meta-analysis, performed with MOOSE, included 24 studies involving 3124 IPAH patients and 470 P/LP variants. Pathogenicity reassessment was carried out using InterVar, which incorporates ACMG criteria. Results Analysis of 105 adult IPAH patients in Russia revealed 11 patients (10.48%) as carriers of pathogenic or likely pathogenetic (P/LP) BMPR2 variants. As the result of reassessment, the number of P/LP BMPR2 variants raised from 394 (59%) to 445 (67%) with 80 pathogenic variants became of uncertain significance, and 152 unclassified variants became P/LP. The meta-analysis of these reevaluated pathogenic variants showed that while the frequency of P/LP variants in our cohort (10.48%) is lower than the overall average of 17.75% from the meta-analysis, the difference is not statistically significant (p = 0.062). Additionally, we report three P/LP BMPR2 variants, not reported in literature, with one being structural, and four P/LP variants in TBX4, ATP13A3 and AQP1 genes from 27 IPAH genes in 3 patients. Conclusions For the first time, we present the results of genetic testing in IPAH patients from the Russian population. Despite the considerable heterogeneity in the world-wide data, the prevalence of pathogenic BMPR2 mutations in IPAH patients from the Russian population does not significantly differ from the overall average in the meta-analysis. It is crucial to periodically reassess the pathogenicity of published variants, as half of the pathogenic BMPR2 IPAH variants were reclassified as LP or of uncertain significance.

DOI ↗

Deep learning deciphers the related role of master regulators and G-quadruplexes in tissue specification

2025 · ARTICLE · en

G-quadruplexes (GQs) are non-canonical DNA structures encoded by G-flipons with potential roles in gene regulation and chromatin structure. Here, we explore the role of G-flipons in tissue specification. We present a deep learning-based framework for the genome-wide G-flipon predictions across 14 human tissue types. The model was trained using high-confidence experimental maps of GQ-forming sequences and ATAC-seq peaks, conjoined with the location of RNA polymerase, histone marks, and transcription factor binding sites. The training dataset for the DeepGQ model was derived from EndoQuad level 4–6 GQs. Model predictions were subsequently validated against the comprehensive EndoQuad dataset (levels 1–6) to optimize the whole-genome prediction threshold. To identify tissue-specific regulatory patterns, we classified GQ promoter predictions as either ‘core’ or ‘tissue-specific’. We identified a notable overlap between predicted unique tissue-specific GQ sites and master regulatory genes (MRGs), tissue-specific DNase-hypersensitivity sites, and proteins that modulate R-loop formation. Collectively, the findings highlight the transactions between MRG and G-flipons intermediated by RNA: DNA hybrids associated with tissue specification.

DOI ↗ PDF ↗

The prevalence of pathogenic variants in the BMPR2 gene in patients with the idiopathic pulmonary arterial hypertension in the Russian population: sequencing data and meta-analysis

2025 · ARTICLE · en

DOI ↗

GQ-DNABERT reveals GQ proximal enhancer–promoter interactions associated with tissue-specific transcription

2025 в печати · ARTICLE · en

Alternative DNA conformation formed by sequences called flipons are thought to play an important role in regulating various genomic processes, either repressing or enhancing transcription, chromatin organization, DNA repair, telomere maintenance, RNA splicing, translation, and stress responses. The formation of G-quadruplexes (GQs) has been investigated experimentally using various methodologies with varying degrees of overlap between the results underscoring the need for a gold-standard GQ dataset. With this aim we trained a large language model, GQ-DNABERT using EndoQuad, the most comprehensive human GQ dataset. GQ-DNABERT recalled the training data and predicted de novo GQs in intergenic and intronic regions, enriched for cis-regulatory elements (cCREs) and ATAC-seq peaks. We evaluated the predicted GQ-DNABERT proximal enhancer–promoter (pEP) pairs, using annotations from ENdb, ENCODE, Zoonomia, Chromium multiomics scATAC-seq and scRNA-seq data from normal cells, and cCREs from normal-cancer pairs. We found GQ pEP pairs correlating with gene expression, with some pairings potentially acting as tissue-specific switches. Genes with GQ pEP pairs in cancer cells are enriched in different processes compared to the corresponding normal tissues. Overall, GQ-DNABERT is a valuable tool for extending and harmonizing data collected ex vivo. We demonstrate the usefulness of GQ-DNABERT for investigating transcriptional regulation in single-cell experiments.

DOI ↗

Host cell Z-RNAs activate ZBP1 during virus infections

2025 · ARTICLE · en

Herpes simplex virus 1 (HSV-1) and influenza A viruses (IAV) induce Z-form-nucleic-acid-binding protein 1 (ZBP1)-initiated cell death1,2,3,4,5,6,7,8. ZBP1 is activated by Z-RNA1,7,9, and the Z-RNAs that trigger ZBP1 during HSV-1 and IAV infections were assumed to be of viral origin1. Here, however, we show that host cell-encoded Z-RNAs are major and sufficient ZBP1-activating ligands after infection by these two human pathogens. The majority of cellular Z-RNAs mapped to intergenic endogenous retroelements embedded within abnormally long 3′ extensions of host cell mRNAs. These aberrant host cell transcripts arose as a consequence of disruption of transcription termination (DoTT)—a virus-driven phenomenon that disables cleavage and polyadenylation specificity factor (CPSF)-mediated 3′ processing of nascent pre-mRNAs10,11,12,13,14,15. Mutant viruses lacking ICP27 or NS1—the virus-encoded proteins responsible for inhibiting CPSF and triggering DoTT13,15—did not induce host cell Z-RNA accrual and were attenuated in their ability to stimulate ZBP1. Ectopic expression of HSV-1 ICP27 or IAV NS1 or pharmacological blockade of CPSF activity induced accumulation of host cell Z-RNAs and activated ZBP1. These results demonstrate that DoTT-generated cellular Z-RNAs are bona fide ZBP1 ligands, and position ZBP1-activated cell death as a host response to counter viral disruption of the cellular transcriptional machinery.

DOI ↗

Курсы (9)

Биоинформатика ДНК, РНК и белков · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Майнор · рус
Медицинская биоинформатика · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Майнор · рус
Machine Learning in Bioinformatics · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Магистратура / Маго-лего · Анг
Методы машинного обучения в биоинформатике

2024/2025 · Маго-лего · рус
Сравнительная геномика

2024/2025 · Магистратура / Маго-лего · рус
Биоинформатика · 2 раза

2022/2023, 2021/2022 · Бакалавриат · рус
Research Seminar "Data Analysis in the Natural Sciences"

2022/2023 · Бакалавриат · Анг
Молекулярная эволюция

2021/2022 · Магистратура · рус
Modern Methods of Data Analysis

2021/2022 · Магистратура · Анг