PhD Projects » ESR 12: A machine learning pipeline to identify and characterise novel viruses in metagenome data

Host institution:
University Medicine Greifswald (UMG), Germany
Local supervisor:
Prof. Dr. Lars Kaderali (UMG)
Local co-supervisor:
Prof. Dr. Martin Beer (Friedrich-Löffler-Institut)
Project partner:
ESR 11
Work packages:
WP 1.1 Virus identification
WP 2.1 Microevolution: Virus quasispecies

The wide use of deep sequencing technology in biology and medicine has led to huge publicly available sequencing databases, offering novel insights into the genetic and genomic makeup of life. Similarly, metagenome sequencing projects offer a deep view into microbiomes. Only a minuscule fraction of the virome has so far been identified; however, many hitherto unknown viral sequences lie hidden in already available data.

The objective of this project is to develop efficient computational pipelines to search for novel viruses in genome or metagenome sequencing data, as well as in public DNA and RNA sequencing databases such as NCBI Genbank/SRA. Template-based search strategies will be developed to search for unknown viruses that are related to a known virus family which is used as search template. We have used a similar strategy to identify a new group of viruses related to Hepatitis B virus in fish in the past. We will extend the template-based approach by using machine learning methods. For this purpose, a deep convolutional neuronal network will be used and will be trained on sequencing data from known viruses. Additionally, we will employ support vector machines with a suitable string kernel, and compare these two methods. In addition to comparing the performance of these two pattern recognition methods, we will furthermore apply them to two different problems: identifying viral sequences or integrated viral subsequences after genome assembly, versus their identification directly at the level of raw sequencing reads. Efficient hashing techniques, k-mer string matching and rapid filtering based on suffix arrays will be used to speed up the search through large databases. In addition to the identification of viral sequences, we will also train the algorithms to identify the most likely host(s) based on assembled viral sequences, and to predict pathogenicity. The developed tools will be tested on NCBI Genbank data, and using metagenome data from the Study of Health in Pomerania (SHIP), a population-based study carried out in Greifswald with genome as well as microbiome sequencing data available. Furthermore, in collaboration with partners Friedrich-Löffler- Institut and AllGenetics, developed methods will be tested on additional metagenome sequencing data and clinical samples available from these partners.