PhD Projects » ESR 12: A machine learning pipeline to identify and characterise novel viruses in metagenome data

Yasas Wijesekara

I’m Yasas Wijesekara, an aspiring bioinformatician from Sri Lanka. I obtained my undergraduate degree in Microbiology from the University of Pune, India. After completing my studies successfully,  I got an opportunity to join a research team that worked on Molecular Phygenetics of Reptiles. Over the course of time, my interest in computational biology grew and encouraged me to make a transition from wet-lab to dry-lab.  During my Master’s thesis, I developed an alignment free method for viral genome clustering. Moreover, I managed to develop an interest in applications of machine learning in the context of life sciences for building efficient predictive models. Following this idea, I joined the University of Greifswald in March 2021 to work in the field of computational virology.

In my project, I wish to explore the use of artificial intelligence to identify novel viruses from sequencing data. which would help us to understand the viral diversity and  their complex dynamics in different biomes.

Im deeply and madly in love with computers and I always keep a tab on latest trends in the tech world. Also, I love hiking and I usually don’t forget my camera when I do so.

Host institution:
University Medicine Greifswald (UMG), Germany
Local supervisor:
Prof. Dr. Lars Kaderali (UMG)
Local co-supervisor:
Prof. Dr. Martin Beer (Friedrich-Löffler-Institut)
Project partner:
Nikolas Basler (ESR 11)
Work packages:
WP 1.1 Virus identification
WP 2.1 Microevolution: Virus quasispecies

Lars Kaderali
Martin Beer

Project description

The wide use of deep sequencing technology in biology and medicine has led to huge publicly available sequencing databases, offering novel insights into the genetic and genomic makeup of life. Similarly, metagenome sequencing projects offer a deep view into microbiomes. Only a minuscule fraction of the virome has so far been identified; however, many hitherto unknown viral sequences lie hidden in already available data.

The objective of this project is to develop efficient computational pipelines to search for novel viruses in genome or metagenome sequencing data, as well as in public DNA and RNA sequencing databases such as NCBI Genbank/SRA. Template-based search strategies will be developed to search for unknown viruses that are related to a known virus family which is used as search template. We have used a similar strategy to identify a new group of viruses related to Hepatitis B virus in fish in the past. We will extend the template-based approach by using machine learning methods. For this purpose, a deep convolutional neuronal network will be used and will be trained on sequencing data from known viruses. Additionally, we will employ support vector machines with a suitable string kernel, and compare these two methods. In addition to comparing the performance of these two pattern recognition methods, we will furthermore apply them to two different problems: identifying viral sequences or integrated viral subsequences after genome assembly, versus their identification directly at the level of raw sequencing reads. Efficient hashing techniques, k-mer string matching and rapid filtering based on suffix arrays will be used to speed up the search through large databases. In addition to the identification of viral sequences, we will also train the algorithms to identify the most likely host(s) based on assembled viral sequences, and to predict pathogenicity. The developed tools will be tested on NCBI Genbank data, and using metagenome data from the Study of Health in Pomerania (SHIP), a population-based study carried out in Greifswald with genome as well as microbiome sequencing data available. Furthermore, in collaboration with partners Friedrich-Löffler- Institut and AllGenetics, developed methods will be tested on additional metagenome sequencing data and clinical samples available from these partners.