Research » WP 1.2 Host prediction

Hosts of viruses infecting harmful higher eukaryotes have been well described in the past, whereas the specific hosts of bacteria-infecting viruses are usually unknown. The aims of VIROINF in this work package are to investigate how viruses of Bacteria or Archaea (bacteriophages or just phages) are functionally linked to each other and to their multicellular hosts. This is crucial for understanding how phages impact the microbiome as they act as key regulators of bacterial population size, and can be introduced into natural systems in order to influence species presence. Evidence to date indicates this relationship tends to be highly specific with phages targeting a limited number of bacterial hosts.

VIROINF will, for the first time, address this topic using a combined experimental and computational approach: ESR 5 develops methods to experimentally link phages to their hosts as well as screen and isolate phages that infect specific hosts at the single phage-particle level. A multiple layer database will be generated by ESR 5 and ESR 10 using the above-developed methods containing:

  • phage genomes that infect selected hosts;
  • viral-tagged metagenomes of phages infecting the same hosts in contaminated groundwater ecosystems;
  • community co-occurrence of microbiomes and viromes from the same ecosystems.

This data will be complemented by text-mined literature data from BioRelate, and data from ESR 13. Combined this will comprise many 10,000s of interactions. This dataset will be available for the benchmarking of bioinformatics approaches for inferring host specificity (ESR 10).

ESR 10 will develop machine learning tools to link newly identified bacteriophages to their bacterial host using a range of signals including CRISPR sequences and prophages in bacterial genomes (obtained by ESR 5), k-mer frequencies/genome composition signals, protein domains etc. based on approaches, such as random forests and neural networks/deep learning. Both methods are emerging as two of the most successful machine learning approaches, due to their ability to discover non-linear signals in data and their relative insensitivity to biases in training datasets. This is necessary since known viruses represent only a minority of all viruses, and VIROINF aims to create a scalable tool that predicts a host at the optimal level for a given virus. The knowledge of different levels of virus-host co-evolution that will be generated in WP 2.1 will be applied here. The aim will be to infer networks of probable virus-host relationships at species level. To study virus-host specificity at strain level we will focus on the bacteriophages that infect Staphylococcus aureus. Phages are important mediators of bacterial virulence and hence impact human health. Understanding how bacterial gene content is influenced by this lateral gene transfer mechanism is important for understanding how viruses impact both the specific bacteria and the human host.