Switch to desktop Register Login


Modelling HIV-1 evolution

RNA viruses, such as HIV, have extremely high mutation rates largely attributable to a low fidelity reverse transcriptase. This high mutation rate of HIV combined with a large effective population size and a short generation time, result in an enormous capacity to generate diversity and to adapt to changes in the host environment. This has important implications for the design and monitoring of drug treatment regimens and is the primary reason for the difficulty experienced in developing a HIV vaccine that is effective across the globe. Moreover, the intra-host evolution of the virus is an essential part of the HIV disease dynamics. It is therefore vital to develop tools to understand HIV evolution within and across infected individuals. HIV data collection efforts are on the increase in scale and resolution as a result of ultra-high throughput sequencing technology and the large-scale application of the Single Genome Amplification (SGA) technique. Efforts to monitor the large-scale roll-out of antiretroviral treatment in South Africa will generate large quantities of molecular sequence data. The availability of data as well as the biologically and clinically relevant insights that can be obtained by modelling intra-host evolution, emphasize the need to develop and apply more accurate and biologically-relevant evolutionary models.

Probabilistic models of sequence evolution typically assume a continuous-time Markov process whereby a rate matrix Q with elements qij denoting the instantaneous DNA substitution rates from codon i to codon j. This enables the determination of a transition probability matrix P(t), as a function of time, that describes the probability of a substitution from state i to state j in a time interval t. Typically, sequences are modelled along the branches of a phylogenetic tree which may be known already or may be included as one of the parameters to be estimated. In one approach the likelihood of a codon alignment, given an estimated phylogenetic tree of relationships, is calculated using a model of evolution that constrains all DNA sites to evolve naturally and then compares it to the likelihood using a model in which a subset of sites are permitted to evolve adaptively. Positive selection is inferred when the latter model is favoured (using standard model comparison techniques).

Top of the page

The distinction between neutral and adaptive evolution is captured by comparing the relative rates of substitutions that alter the encoded amino acid (referred to as non-synonymous substitutions) and substitutions that, because of the degeneracy of the genetic code, have no effect on the encoded amino acid sequence (referred to as synonymous substitutions). The latter class results in the substitution of one codon (or nucleotide triplet, encoding an amino acid) for another codon that encodes the same amino acid, often assumed not to have any effect on organism fitness, and therefore to occur at the neutral rate of evolution (which, according to evolutionary theory, is the same as the rate of mutation). When coding sequences evolve such that the rate of non-synonymous substitution (dN) is greater than the rate of synonymous substitution (dS), positive Darwinian selection (i.e. adaptive evolution) can be inferred. The inference of positive selection and the sites that are evolving under selection can reveal key aspects of organism biology – particularly in the context of pathogen infection and host resistance.

Strategies as the one discussed supra have been used to investigate the evolution of resistance to drug treatment in pathogens, including HIV-1, and have also been used to study escape from immune responses. The models proposed in this study are modifications of these general models of sequence evolution and will be designed to provide more power to detect the effects of immune responses on HIV evolution intra-host, and to detect more sensitively the footprint of host immune genotype on the virus. The study proposes to extend directional selection models, in the context of antiretroviral drug treatment, the detection of secondary resistance mutations and apply these models to sequence data sets from southern Africa and to sequence data sets that are available to the public.

The descriptive project objectives are:

  • to develop and apply methods to evaluate the intra-host evolutionary selective pressure acting on HIV-1 protein-coding genes;
  • to develop a method to model viral evolution and host genotype simultaneously and to use this model to identify the footprint of the host genotype on the autologous viral sequence;
  • to model viral evasion of host immune responses, in particular T-lymphocyte (CTL) responses;
  • to use evolutionary modelling to confirm putative CTL (and, later, antibody) epitopes; and
  • to detect compensatory mutations associated with the evolution of anti-retroviral drug resistance.

Each of the objectives stated above will be addressed through the development of custom probabilistic models of evolution implemented in HyPhy, the R-package or C/C++.

Top of the page

Develop and apply methods to evaluate the intra-host evolutionary selective pressure acting on HIV-1 protein-coding genes

Previous models of sequence evolution give very high rates of false positives when applied to recombining sequences such as HIV. A model was recently developed to alleviate the confounding effects of recombination on parameter estimation. As part of another study, the researchers intend to carry out an exhaustive survey of selective pressures in HIV protein-coding genes using this model with a view to re-evaluating the published evidence of selection and its biological significance using an unbiased method. The intention is to distinguish between evolutionary selective pressures acting on viruses within individual hosts and selective pressures that shape the evolution of the pandemic more generally. In the first case, the researchers will focus on selective pressures acting on the virus in acute infection. This critical stage of the disease is believed to have a significant impact on set-point viral load, which in turn affects the rate of disease progression. Understanding the nature of the virus isolated from acute infection and its adaptation to the new host environment has potential implications for vaccine development.

The PI of this research is an investigator on the CHAVI project ( through which he has access to a large set of highly accurate sequences from the HIV env gene, generated through the Single Genome Amplification technique from samples collected in North America (approximately 3 400 unpublished complete sub-type B coding sequences generated from samples from 102 acute individuals). A large number of env coding sequences isolated from southern African individuals acutely infected with HIV sub-type C became available in 2008. These data form a very comprehensive data set and give rise to a unique opportunity to understand the nature of the virus that is transmitted to newly-infected individuals (a key objective of the CHAVI project) as well as the nature of the selective forces that shape the evolution of the virus in the crucial phase of infection (the primary objective of this study). The CHAVI project will also generate whole genome HIV sequences at later stages, and this will enable investigation of the evolution of other HIV genes.

Top of the page

With input from other researchers, the research team has implemented a method to assess the selective pressure acting on HIV sequences in acute infection, using the HyPhy batch language for evolutionary modelling. The method combines evidence over multiple phylogenetic trees representing the evolutionary relationships of sequences from different infected individuals. For each infected individual, the time to the most recent ancestor of the sequences in the individual is estimated using Bayesian Evolutionary Analysis Sampling Trees (BEAST). This method makes use of the Markov Chain Monte Carlo (MCMC) algorithm to sample polygenetic trees (the graphs representing the relationships between the viral sequences) according to their posterior probabilities, given the sequence data. Given a set of sampled trees and a set of input parameters, including mutation rate and generation time, confidence intervals on demographic parameters of interest can be inferred, including the time to the most recent common ancestor of the viral sequences. If the most recent common ancestor of the sequences post-dates the estimated time since the patient was infected (based on clinical stage information), we can be confident that the observed viral diversity was generated during acute infection. The MCMC algorithm used by BEAST is naturally parallelisable since a single chain from which posterior distributions are derived can be sub-divided across multiple nodes. This research team has developed an automated MPI wrapper for BEAST and has tested scalability on an acute infection data set derived from the CHAVI project and because MCMC methods are well-suited to parallel implementation we expect good parallelisability.

This research team will use the methodologies set out above to determine whether the HIV env gene evolves under positive Darwinian selective pressure in acute infection and, if so, identify the sites in HIV env that evolve rapidly at this stage. The researchers expect that these sites will include reversion of mutations that enabled the virus to escape from immune responses that were present in the infecting individual, but absent from the newly-infected host and possibly mutations that facilitate escape from the earliest antibody responses of the newly-infected host. CHAVI investigators have revealed a severe population bottleneck in HIV transmission to the extent that most new infections result from expansion of just a single viral particle, and there is great interest in determining whether this bottleneck is selective (Does the process of HIV transmission from one host to another select for a viral strain that is particularly adapted for this purpose?). This concept is topical, because if transmission involves a severe selective bottleneck, vaccines need not target the full diversity of HIV, but merely a sub-set of strains selected for improved inter-host transmission. If the bottleneck is selective, consistent reversion of amino acids that are advantageous fro transmission to a state that is most advantageous for expansion in a newly-infected host, will leave a detectable signature of positive selection.

Top of the page

Develop a method to model viral evolution and host genotype simultaneously and to use this model to identify the footprint of the host genotype on the autologous viral sequence

One of the most exciting developments in genomics recently has been the development and broad application of whole genome genotyping technologies. These technologies enable genotypes of up to a million single nucleotide polymorphisms (SNPs) to be determined in a single experiment. One of the objectives of this research is to develop the capacity to model the genome sequence of a pathogen and its host simultaneously. HIV is known to evolve under extremely strong evolutionary selective pressure. This fact is central to the current research and an essential aspect of the pathogenesis of the virus. The main driving force for the very rapid and adaptive evolution of HIV is the host environment. The dynamics of the adaptive immune response within individual hosts drives adaptive evolution of HIV within the host.

Differences in the innate and adaptive immune capacity between hosts as well as differences between hosts in a large number of non-immune-related host proteins with which the virus must interact, drive adaptive evolution between the hosts. There have been several attempts to determine the footprint of the host immune capacity on HIV sequences, but no attempt has been made to determine the more general footprint of host genotype on viral sequence evolution. Collaborators in the CHAVI project carried out the first large-scale whole-genome association study of human loci that affect HIV disease progression. This project proposes to use the data from the whole genome association study, together with HIV sequence data from the same individuals that were genotyped as part of that study to investigate general host factors that place a selective pressure on HIV and thereby to establish the general host molecular footprint on the virus. This has profound implications for understanding the mechanisms through which different hosts progress to disease at very different rates and for understanding the biology of HIV pathogenesis. Determining the footprint of the host genotype on HIV sequence requires simultaneous models of sequence evolution and the host environment. Previous studies that investigated the footprint of HLA alleles on an HIV sequence have generally either neglected to account for the non-independence of epidemiologically-linked data or have use heuristic methods to take this into account.

This research proposes to take a more complete model approach to the problem of finding a relationship between viral sequence polymorphisms and host HLA genotype. This model will explicitly model the host genotype and allow sequence evolution to depend on the HLA allele. The method in this research will sum over the genotypes of individuals infected by the ancestral viral sequences, represented by the internal nodes of the tree since these genotypes are unknown. It is possible to encode this analysis using the HyPhy batch language; however, the analysis is extremely computationally intensive owing to the addition of a variable (HLA allele) over which all ancestral nodes of the tree have to be summed. This approach furthermore requires the researchers to test for relationships by fitting models independently for each pair consisting of an amino acid site and an HLA allele. Approximately 20 HLA alleles are readily found in southern African populations to enable testing with this method. The HIV genome is approximately 10 000 base pairs long and approximately one third as many amino acids will be encoded. The implementation of this project thus requires the optimization of 20 x 3000 models. A reduction in computational requirements will be obtained by removing invariant and slow-moving sites, although the computational requirements of this part of the project remain significant. This task is highly parallelizable since there is independent model-fitting for each pair of amino acid site and HLA allele.

Top of the page

Model viral evasion of host immune responses, in particular T-lymphocyte (CTL) responses

Escape from host immune responses can be associated with a significant cost in terms of viral fitness. Reversion of escape mutations to wild-type in hosts lacking a specific immune response and re-escape in hosts with the response can cause a pattern of amino acid toggling. Sequences tend to toggle between a very fit state and an easily accessible or very unfit escape variant. This research proposes to develop a model of amino acid toggling to improve the sensitivity with which it is possible to detect evasion of immune responses that carry high costs in terms of viral fitness. Positive Darwinian selection can be inferred when the rate of non-synonymous, dN, (amino-acid changing) substitution is greater than the rate of synonymous, dS, substitutions. Although dN.dS is sufficient to infer positive Darwinian selection, it is not necessary. The objective in this research is to develop methods to infer positive Darwinian selection from coding sequences when dN=dS. In the case of selective pressure for escape from a specific CTL response and reversion to the most fit amino acid in individuals lacking the CTL response, the researchers expect to observe toggling between the most replicative fit, but immune susceptible state and toggling between the least fit, but immune escaped state. If the rate at which mutations between these states occur are greater than the rate predicted under random neutral mutation, positive Darwinian selection can be inferred, even when dN

Validation of the models of sequence evolution proposed requires extensive simulation with significant high-performance computing implications. In order to determine whether these models are applicable to real-world HIV and other comparable sequence data sets, the researchers will simulate data sets using parameter values and sizes comparable to published data sets of HIV genes that have been used to carry out relatively heuristic investigations of the relationship between immune system alleles (in particular alleles of the Human Leukocyte Antigen – HLA) and sequence evolution. HyPhy allows simulation of data sets using flexible input models and will also be used to fit the models of the simulated data. Fitting the evolutionary models to data involves calculating the likelihood of a data set given a phylogenetic tree (byfurcating tree structure representing the relationships between the sequences, with the lengths of the tree branches representing evolutionary distance), a model and a set of parameter values. The ancestral states of the sequence (the states of the sequence at the ancestral or extinct nodes of the tree) are unknown, and therefore the data of the tree is calculated as a sum over all possible ancestral states. The number of ancestral nodes are more or less equal to the number of sequences (which can be a few hundred) and the alphabet size for codon models of evolution is relatively large, making it computationally intricate to calculate. It can however be calculated in a reasonable time using the dynamic Felsenstein’s Pruning programming algorithm. Fitting the model to the data involves optimization and the EM algorithm will be applied, but this makes for difficulty in parallelisation. The number of simulations carried out will depend on the availability of computer resources, but ideally this research would like to simulate 1 000 replicate data sets fro each model tested in order to reduce the variance in the false positive and power estimates.

Top of the page

Evolutionary modelling to confirm putative CTL (and, later, antibody) epitopes

As an extension of the model proposed above, the research team will investigate the development of a model specifically designed to aid in the identification of epitopes (the regions with pathogen proteins that are recognised by the host immune system). The researchers will firstly concentrate on epitomes recognised by human cytotoxic T-lymphocytes (CTLs) since these epitopes are contiguous on the viral sequence. The PI of this research has worked with researchers at the South African National Institute of Communicable Diseases (NCID) where they have experimentally identified the approximate regions within HIV proteins recognised by the CTL response (using interferon-gamma ELISPOT assay). They have also generated HLA genotype data for these HIV-infected individuals. This type of data has also been generated in laboratories in other parts of the world.

HLA proteins function in presenting the epitope at the cell surface, where it interacts with receptors on immune cells, causing a CTL-mediated immune response. It is possible to hypothesise that the CTL response to a given peptide is mediated by a given HLA allele when the HLA allele is statically over-represented among individuals indicating a specific CTL response. The researchers propose to confirm the existence of a relationship between specific HLA alleles and CTL-recognised peptides through the use of evolutionary modelling. In particular, the research will model the evolution of a CTL-recognised peptide, allowing different evolutionary rates along branches leading to individuals with and without a focal HLA allele. For branches of the phylogeny leading to ancestral nodes (where the HLA state is unknown because the individual infected by the ancestral virus is not present in the data set), a mixture model will be used with mixture proportions equal to the frequency (and one minus the frequency) of the HLA allele. This will enable the researchers to determine whether the evolutionary rate within a specific peptide is dependent on the presence or absence in an infected individual of the HLA allele under investigation. This approach can easily be extended by developing directional models of selection analogous to the model used to investigate drug resistance. In this instance the researchers will test whether there is a tendency at specific sites within the peptide to mutate away from the wild-type (or default state) in the presence of the HLA allele and towards the wild-type in its absence. This would provide a signature of immune escape and enable the researchers to pinpoint mutations that enable the virus to evade the immune response, while it would also provide strong evidence for an epitope for the investigated HLA allele within the peptide region modelled. The researchers have not determined parallelisability. Since it is similar in concept to the modelling of the human genotype-HIV evolution, it is likely to require substantial computational resources. The researchers predict that the project will scale more or less linearly with the number of epitope-HLA allele combinations.

Top of the page

Detect compensatory mutations associated with the evolution of anti-retroviral drug resistance

The recently-developed models of the evolution of drug resistance will be extended to the detection of compensatory mutations and this model will be applied to the serially sampled HIV sequences. These models will be used to detect compensatory mutations associated with the evolution of resistance to anti-retroviral drugs using the data set of Nevirapine-treated individuals that formed the basis for a recent study and from the new drug treatment data sets as they become available from the National Institute of Communicable Disease. The proposed method will test for dependence between evolutionary rate at a candidate accessory resistance site and the resistance state at known resistance sites. This will require independent model fits and optimizations for each site and for each of the most important known drug resistance sites. This task is computationally highly intensive, but also highly parallelisable and the researchers expect linear gains in efficiency with the number of available computer nodes.

Each of the five objectives described above requires substantial computing resources: in obtaining sufficiently large numbers of samples for accurate estimation using MCMC sampling techniques, for analysis of multiple databases, and for extensive testing of custom models through simulations. Preliminary analyses using some of the models were computationally extremely intensive (approximately two hours per codon site). Since these models are subdivided by codon site, they can be parallelised relatively easily.

Previous project  |  Next project  |  Back to Flagship projects  |  Top of the page