Doctoral defence: Tarmo Puurand “Human genome studies with k-mer frequencies”

Doktoridiplomite kaaned
Author: Andres Tennus

On 2 September at 14:15 Tarmo Puurand will defend his doctoral thesis “Human genome studies with k-mer frequencies” for obtaining the degree of Doctor of Bioinformatics.

Supervisors:
Professor Maido Remm, University of Tartu
Associate Professor Lauris Kaplinski, University of Tartu

Opponent:
Professor Kateryna Makova, Penn State University, USA

Summary:
The human genome is complex and constantly changing – mutations occur all the time. Just 25 years ago, studying the genome was slow and expensive, but advances in technology have brought major breakthroughs. In the past, researchers mainly used DNA microarrays, which could detect individual changes called SNPs. Today, it’s possible to sequence the entire genome and analyze billions of data points at once.

This study used an innovative approach based on k-mer analysis. K-mers are short DNA fragments (25 letters long), and their frequency in the genome can be calculated without the time-consuming process of comparing all sequences to a reference. This speeds up the analysis and allows researchers to detect changes that older methods often missed – especially in repetitive or technically difficult regions.

One of the key innovations in this work is identifying Y chromosome haplogroups using a very small amount of DNA. While traditional methods usually require about 20× coverage for reliable results, this study used less than 1% of randomly selected genome data. This was possible thanks to repetitive sequences on the Y chromosome, which were previously considered too complex to analyze.

The method presented in this study uses these repeats as a kind of natural “amplifier,” similar to how DNA is copied in a lab. Over time, these regions have accumulated unique mutations that help identify a person’s paternal lineage, or haplogroup. This technological approach – based on k-mer frequency, alignment-free, and scalable—opens up new possibilities for genome research, especially in cases where only limited data is available or where traditional methods fall short.