Virus-host interactions are highly specific as viruses (including bacteriophages/phages) have co-evolved with the systems they infect and depend on for replication. As a consequence, phylogenetic relatedness and shared genomic properties of host species are a strong predictor of host infectivity. However, our knowledge of host relationships remains sparse and is unknown for many phages, particularly for metagenomically assembled data. In this project, existing resources and genomic signals will be leveraged to infer missing/unobserved and probable hosts, e.g., where infection has not been observed due to host immunity. Machine learning models will be trained on this data to predict phage-bacteria interactions at the species and strain levels.
Specific objectives are:
- Construction of bipartite networks of observed virus-host species links based on databases such as Virus-Host DB and MVP, text-mined literature data from company partner BioRelate, and data from ESR 13. Combined this will comprise many 10,000s of interactions from observed phage-bacteria interactions, and inferred from prophage and CRISPR- Cas associated viral sequences embedded in host genomes.
- Quantification of the co-phylogenetic signal using the software Jane to test the preferential host-switching model in bacteria and to determine the extent of co-speciation versus host-switching. This will give information on how stable phage-host interactions are at the species level.
- Application of combined SVM models to virus-host prediction by using different representations of the virus genome information: nucleotide sequence, amino acid sequence, amino acid properties and protein domains.
- Analysis of virus-host links at the strain level. This objective will use the bacteria Staphylococcus aureus (the area of speciality of José Penades) as a model system to study strain level phage-host interaction prediction in the context of CRISPR-Cas immunity and host pan-genomes.