Попцова Мария Сергеевна

Факультет компьютерных наук

Профиль на hse.ru ↗ тел.: +7 (495) 531-00-00 | 27335

Публикаций

Языков

Наград

Конференций

Профиль Публикации (57) Курсы (9)

Профессиональные интересы

биоинформатикагеномикасравнительная геномикамашинное обучениеМашинное обучение и анализ данных34.03.23 Математическая биология и теоретическое моделирование биологических процессов. Биоинформатика

Должности

Директор центра — Факультет компьютерных наук, Институт искусственного интеллекта и цифровых наук, Центр биомедицинских исследований и технологий
Доцент — Факультет компьютерных наук, Департамент больших данных и информационного поиска
Академический руководитель образовательной программы — Анализ данных в биологии и медицине

Био

· Начала работать в НИУ ВШЭ в 2016 году.
· Научно-педагогический стаж: 17 лет.

Образование

2004 · Кандидат физико-математических наук: МГУ имени М.В. Ломоносова, специальность 01.00.00 «Физико-математические науки» и 03.01.02 «Биофизика», тема диссертации: Трансформация автоволн в локально неоднородных активных средах
1995 · Специалитет: Московский государственный университет им. М.В. Ломоносова, специальность «Физика», квалификация «Физик»

Опыт работы

· 09/16-09/17 ,
· доцент
· факультет бизнеса и менеджмента, Высшая Школа Экономики
· 10/12-н вр,
· Старший научный сотрудник
· кафедра биофизики физического факультета МГУ
· 01/10-05/11
· Научный сотрудник
· Факультет Патологии и Лабораторной Медицины, Институт Вычислительной Биомедицины, Медицинский колледж Уэйлла-Корнелла, Корнелльский университет (Department of Pathology and Laboratory Medicine, Institute for Computational Biomedicine, Weill Cornell Medical College)
· Краткая информация: работала в лаборатории, занимающейся исследованиями рака простаты. Разработала алгоритм и написала программу по вычислению степени влияния CNVs на биологические пути (в процессе публикации). Занималась анализом данных технологий секвенирования второго поколения с целью найти эндогенные причины разрыва генома при агрессивных формах опухоли.
· 4/09-12/09, 2/05-1/08
· Научный сотрудник
· Факультет молекулярной и клеточной биологии, Коннектикутский университет (Molecular and Cell Biology Department, University of Connecticut)
· Краткая информация: работала по гранту НАСА в рамках программы Applied Information Systems Research (AISR) program (http://aisrp.nasa.gov/ ). Участвовала в разработке алгоритмов по обработке больших массивов данных (в применении к биологическим системам) и реализации данных алгоритмов методом параллельных вычислений на кластерных системах (параллельных суперкомпьютерах) на основе Unix.
· основатель и совладелец
· Janussys, Ltd. (www.janussys.ru)
· Компьютерно-лингвистическая компания по разработке программного обеспечения
· Краткая информация: компания, работающая в области математической лингвистики, в частности, занимающаяся разработкой алгоритмов машинного перевода и созданием многоязычных словарей. Издатель мультимедийного англо-русского иллюстрированного словаря «Янус» (2002). В настоящее время ведет поиск инвесторов в проект создания системы машинного перевода нового поколения.

Награды и поощрения

· Благодарность проректора НИУ ВШЭ (ноябрь 2025)
· Почетная грамота НИУ ВШЭ (май 2025)
· Благодарность проректора НИУ ВШЭ (октябрь 2024)
· Благодарность проректора НИУ ВШЭ (декабрь 2023)
· Благодарность первого проректора НИУ ВШЭ (декабрь 2023)
· Благодарственное письмо первого проректора НИУ ВШЭ (февраль 2023)
· Благодарность НИУ ВШЭ (май 2022)
· Благодарность Факультета компьютерных наук НИУ ВШЭ (сентябрь 2019)
· Благодарность проректора НИУ ВШЭ (май 2019)
· Надбавка за публикацию в журнале из Списка А (и приравненном к нему научном издании) (2025–2026, 2024–2025, 2023–2024)
· Надбавка за публикацию в международном рецензируемом научном издании (2022–2023, 2021–2022, 2020–2022, 2017–2019)
· Лучший преподаватель — 2021
· Лучший академический руководитель в номинации «Цифровые навыки студентов» — 2024–2025
· Лучший академический руководитель в номинации «Удовлетворенность студентов качеством образовательной программы» — 2025
· Лучший академический руководитель в номинации «Межфакультетское взаимодействие» — 2023–2024
· Лучший академический руководитель в номинации «Работа студентов с внешними заказчиками» — 2023
· Лучший академический руководитель в номинации «Привлечение студентов» — 2023

Гранты и проекты

— · на соискание учёной степени кандидата наук

Идентификаторы исследователя

ORCID: 0000-0002-7198-8234
ResearcherID: G-6985-2014
SPIN РИНЦ: 1361-1087
Google Scholar: https://scholar.google.com/citations?hl=en&user=9MoA58MAAAAJ
Scopus AuthorID: 16177766600

Публикации (57)

Comprehensive analysis of cancer breakpoints reveals signatures of genetic and epigenetic contribution to cancer genome rearrangements

2021 · ARTICLE · en

Understanding mechanisms of cancer breakpoint mutagenesis is a difficult task and predictive models of cancer breakpoint formation have to this time failed to achieve even moderate predictive power. Here we take advantage of a machine learning approach that can gather important features from big data and quantify contribution of different factors. We performed comprehensive analysis of almost 630,000 cancer breakpoints and quantified the contribution of genomic and epigenomic features–non-B DNA structures, chromatin organization, transcription factor binding sites and epigenetic markers. The results showed that transcription and formation of non-B DNA structures are two major processes responsible for cancer genome fragility. Epigenetic factors, such as chromatin organization in TADs, open/closed regions, DNA methylation, histone marks are less informative but do make their contribution. As a general trend, individual features inside the groups show a relatively high contribution of G-quadruplexes and repeats and CTCF, GABPA, RXRA, SP1, MAX and NR2F2 transcription factors. Overall, the cancer breakpoint landscape can be represented by well-predicted hotspots and poorly predicted individual breakpoints scattered across genomes. We demonstrated that hotspot mutagenesis has genomic and epigenomic factors, and not all individual cancer breakpoints are just random noise but have a definite mutation signature. Besides we found a long-range action of some features on breakpoint mutagenesis. Combining omics data, cancer-specific individual feature importance and adding the distant to local features, predictive models for cancer breakpoint formation achieved 70–90% ROC AUC for different cancer types; however precision remained low at 2% and the recall did not exceed 50%. On the one hand, the power of models strongly correlates with the size of available cancer breakpoint and epigenomic data, and on the other hand finding strong determinants of cancer breakpoint formation still remains a challenge. The strength of predictive signals of each group and of each feature inside a group can be converted into cancer-specific breakpoint mutation signatures. Overall our results add to the understanding of cancer genome rearrangement processes.

DOI ↗ PDF ↗

Randomness in Cancer Breakpoint Prediction

2021 · ARTICLE · en

Cancer genomes are susceptible to multiple rearrangements by deleting, inserting, and translocating genomic regions. Recently, the problem of finding determinants of breakpoint formations was approached with machine learning methods; however, unlike cancer point mutations, breakpoint prediction appeared to be a more difficult task, and various machine learning models did not achieve high prediction power often slightly exceeding the threshold of random guessing. This raised the question of whether the breakpoints are random noise in cancer mutagenesis or there exist determinants in structural mutagenesis. In the present study, we investigated randomness in cancer breakpoint genome distributions through the power of machine learning models to predict breakpoint hot spots. We divided all cancer types into three groups by degree of randomness in their breakpoint formation. We tested different density thresholds and explored the bias in hot spot definition. We also compared prediction of hot spots versus individual breakpoints. We found that hot spots are considerably better predicted than individual breakpoints; however, some individual breakpoints can also be predicted with a satisfactory power, and thus, it is not proper to filter them from analyses. We demonstrated that positive-unlabeled learning can provide insights into insufficiency of cancer data sets, which are not always reflected by data set sizes. Overall, the present results support the view that cancer breakpoint landscape can be represented by predictable dense breakpoint regions and scattered individual breakpoints, which are not all random noise, but some are generated by detectable mechanism.

DOI ↗

Special Issue: A, B and Z: The Structure, Function and Genetics of Z-DNA and Z-RNA

2021 · ARTICLE · en

It is now difficult to believe that a biological function for the left-handed Z-DNA and Z-RNA conformations was once controversial. The papers in this Special Issue, "Z-DNA and Z-RNA: from Physical Structure to Biological Function", are based on presentations at the ABZ2021 meeting that was held virtually on 19 May 2021 and provide evidence for several biological functions of these structures. The first of its kind, this international conference gathered over 200 scientists from many disciplines to specifically address progress in research involving Z-DNA and Z-RNA. These high-energy left-handed conformers of B-DNA and A-RNA are associated with biological functions and disease outcomes, as evidenced from both mouse and human genetic studies. These alternative structures, referred to as "flipons", form under physiological conditions, regulate type I interferon responses and induce necroptosis during viral infection. They can also stimulate genetic instability, resulting in adaptive evolution and diseases such as cancer. The meeting featured cutting-edge science that was, for the most part, unpublished. We plan for the ABZ meeting to reconvene in 2022.

DOI ↗ PDF ↗

Understanding cancer breakpoint determinants with omics data

2020 · ARTICLE · en

Over the last 20 years whole-genome sequencing of cancer genomes supported the phenomenon of cancer mutation heterogeneity both for point and structural variants. Alongside with the whole-genome sequencing projects many next-generation sequencing experiments including ChIP-seq for histone modifications and transcription factors, DNase-seq, MeDIP-Seq, Hi-C, and others were collected for thousands of cancer genomes. Machine learning approach became an efficient method of predictive modeling because machine learning algorithms are able to consider multiple factors and their interactions and range them in an order of importance. Machine learning models, predicting cancer point mutations at 1Mb scale and using as predictors state of the chromatin, epigenetic factors and non-B DNA structures, achieved a good predictive power. However, predicting cancer breakpoints appeared to be a more difficult task than predicting point mutations. Machine learning models, that were successfully used to predict cancer point mutations, using the same features, could not achieve high performance in predicting cancer breakpoints. Nevertheless, the available models demonstrate that aggregating information from omics experiments increases the model prediction power. Here we review state-of-the art machine learning approaches to predict cancer breakpoints and discuss current understanding of the determinants of cancer breakpoint formation.

DOI ↗ PDF ↗

Interethnic differences in the prevalence of main cardiovascular pharmacogenetic biomarkers

2020 · ARTICLE · en

The aim of this study was to determine the prevalence of CYP2C9, VKORC1, CYP2C19, ABCB1, CYP2D6 and SLCO1B1 genes polymorphisms among residents of the Volga region (Chuvash and Mari) and northern Caucasus (Kabardins and Ossetians). Materials & methods: The study involved 845 apparently healthy volunteers of both sexes of the four different ethnic groups living in the Russian Federation: 238 from the Chuvash ethnic group, 206 Mari, 157 Kabardins and 244 Ossetians. Results: Significant differences were identified in allele frequency of CYP2C9, VKORC1, CYP2C19, ABCB1, CYP2D6 and SLCO1B1 genes polymorphisms between the Chuvash and Kabardins, Chuvash and Ossetians, Mari and Kabardians, Mari and Ossetians.

DOI ↗

Cancer Breakpoint Hotspots Versus Individual Breakpoints Prediction by Machine Learning Models

2020 · CHAPTER · en

Genome rearrangement is a hallmark of all cancers. Cancer breakpoint prediction appeared to be a difficult task, and various machine learning models did not achieve high prediction power. We investigated the power of machine learning models to predict breakpoint hotspots selected with different density thresholds and also compared prediction of hotspots versus individual breakpoints. We found that hotspots are considerably better predicted than individual breakpoints. While choosing a selection criterion, the test ROC AUC only is not enough to choose the best model, the lift of recall and lift of precision should be taken into consideration. Investigation of the lift of recall and lift of precision showed that it is impossible to select one criterion of hotspot selection for all cancer types but there are three to four distinct groups of cancer with similar properties. Overall the presented results point to the necessity to choose different hotspots selection criteria for different types of cancer.

DOI ↗ PDF ↗

Deep learning approach for predicting functional Z-DNA regions using omics data

2020 · ARTICLE · en

Computational methods to predict Z-DNA regions are in high demand to understand the functional role of Z-DNA. The previous state-of-the-art method Z-Hunt is based on statistical mechanical and energy considerations about B- to Z-DNA transition using sequence information. Z-DNA CHiP-seq experiment results showed little overlap with Z-Hunt predictions implying that sequence information only is not sufficient to explain emergence of Z-DNA at different genomic locations. Adding epigenetic and other functional genomic mark-ups to DNA sequence level can help revealing the functional Z-DNA sites. Here we take advantage of the deep learning approach that can analyze and extract information from large volumes of molecular biology data. We developed a machine learning approach DeepZ that aggregates information from genome-wide maps of epigenetic markers, transcription factor and RNA polymerase binding sites, and chromosome accessibility maps. With the developed model we not only verify the experimental Z-DNA predictions, but also generate the whole-genome annotation, introducing new possible Z-DNA regions, which have not yet been found in experiments and can be of interest to the researchers from various fields.

DOI ↗ PDF ↗

Recognition of DNA Secondary Structures as Nucleosome Barriers with Deep Learning Methods

2020 · CHAPTER · en

DOI ↗

Tissue-specific impact of stem-loops and quadruplexes on cancer breakpoints formation

2019 · ARTICLE · en

Background: Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Among the factors contributing to genomic instability are non-B DNA structures with stem-loops and quadruplexes being the most prevalent. We aimed at investigating the impact of specifically these two classes of non-B DNA structures on cancer breakpoint hotspots using machine learning approach. Methods: We developed procedure for machine learning model building and evaluation as the considered data are extremely imbalanced and it was required to get a reliable estimate of the prediction power. We built logistic regression models predicting cancer breakpoint hotspots based on the densities of stem-loops and quadruplexes, jointly and separately. We also tested Random Forest models varying different resampling schemes (leave-one-out cross validation, train-test split, 3-fold cross-validation) and class balancing techniques (oversampling, stratification, synthetic minority oversampling). Results: We performed analysis of 487,425 breakpoints from 2234 samples covering 10 cancer types available from the International Cancer Genome Consortium. We showed that distribution of breakpoint hotspots in different types of cancer are not correlated, confirming the heterogeneous nature of cancer. It appeared that stem-loop- based model best explains the blood, brain, liver, and prostate cancer breakpoint hotspot profiles while quadruplex- based model has higher performance for the bone, breast, ovary, pancreatic, and skin cancer. For the overall cancer profile and uterus cancer the joint model shows the highest performance. For particular datasets the constructed models reach high predictive power using just one predictor, and in the majority of the cases, the model built on both predictors does not increase the model performance. Conclusion: Despite the heterogeneity in breakpoint hotspots’ distribution across different cancer types, our results demonstrate an association between cancer breakpoint hotspots and stem-loops and quadruplexes. Approximately for half of the cancer types stem-loops are the most influential factors while for the others these are quadruplexes. This fact reflects the differences in regulatory potential of stem-loops and quadruplexes at the tissue-specific level, which yet to be discovered at the genome-wide scale. The performed analysis demonstrates that influence of stem- loops and quadruplexes on breakpoint hotspots formation is tissue-specific.

DOI ↗ PDF ↗

Recognition of 3′-end L1, Alu, processed pseudogenes, and mRNA stem-loops in the human genome using sequence-based and structure-based machine-learning models

2019 · ARTICLE · en

The role of 3’-end stem-loops in transposition was experimentally demonstrated for transposons of various species, where LINE-SINE transposons share the same 3’-end sequences, containing a stem-loop. We have discovered that 62-68% of processed pseduogenes and mRNAs also have 3’-end stem-loops. We investigated the properties of 3’-end stem-loops of human L1s, Alus, processed pseudogenes and mRNAs that do not share the same sequences, but all have 3’-end stem-loops. We have built sequence-based and structure-based machine-learning models that are able to recognize 3’-end L1, Alu, processed pseudogene and mRNA stem-loops with high performance. The sequence-based models use only sequence information and capture compositional bias in 3’-ends. The structure-based models consider physical, chemical and geometrical properties of dinucleotides composing a stem and position-specific nucleotide content of a loop and a bulge. The most important parameters include shift, tilt, rise, and hydrophilicity. The obtained results clearly point to the existence of structural constrains for 3’-end stem-loops of L1 and Alu, which are probably important for transposition, and reveal the potential of mRNAs to be recognized by the L1 machinery. The constructed models are freely available at github (https://github.com/AlexShein/transposons/) and can be used for de novo discovery of transposon-related stem-loops.

DOI ↗ PDF ↗

Курсы (9)

Биоинформатика ДНК, РНК и белков · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Майнор · рус
Медицинская биоинформатика · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Майнор · рус
Machine Learning in Bioinformatics · 4 раза

2025/2026, 2024/2025, 2023/2024, 2022/2023 · Магистратура / Маго-лего · Анг
Методы машинного обучения в биоинформатике

2024/2025 · Маго-лего · рус
Сравнительная геномика

2024/2025 · Магистратура / Маго-лего · рус
Биоинформатика · 2 раза

2022/2023, 2021/2022 · Бакалавриат · рус
Research Seminar "Data Analysis in the Natural Sciences"

2022/2023 · Бакалавриат · Анг
Молекулярная эволюция

2021/2022 · Магистратура · рус
Modern Methods of Data Analysis

2021/2022 · Магистратура · Анг