Molecular modelling of peptides and protein-ligand complexes using knowledge-based potentials


 

Principal Investigator :   Debasisa Mohanty

PhD Students
Gitanjali Yadav
Mohd Zeeshan Ansari

Collaborators
Rajesh S Gokhale

The main theme of the research project is to understand the structural principles that govern binding of various ligands to proteins and folding of peptides/proteins to stable conformations, and use these structural principles for developing computational approaches for structure prediction of peptides/proteins and protein-ligand complexes. The specific objective of the project is to investigate, whether knowledge-based potentials i.e. scoring functions obtained from analysis of structural features in databases of known protein structures can be used for predicting the (a) substrate specificity of proteins involved in biosynthesis of polyketides and peptide antibiotics, (b) bound conformation of peptides in MHC-peptide complexes and ranking of peptides as per their binding free energy and (c) structures for short peptides and folds for proteins of unknown function in various genomes.

A.    Substrate specificity of proteins involved in polyketide biosynthesis

Our computational approach was successful in detecting the correlation between the substrate specificity of CHS like proteins and their active site residues, but no such correlation could be found in case of KS domains of modular polyketide synthases (PKS). The key difference between CHS and KS domain of modular PKS is that, while CHS is a mono functional enzyme which carries out polyketide synthesis by taking different starter and extender units, KS domains are part of the multi-enzyme complexes which contain several other domians for activities such as phosphopantetheine binding (ACP), acyl transfer (AT), ketoreduction (KR), dehydration (DH) and enoylreduction (ER), etc. Thus a systematic analysis involving all the domains of modular PKS was necessary for understanding their substrate specificity.

Analysis of various domains present in modular PKS

We have carried out a detailed computational analysis of the amino acid sequences of various modular PKS with known substrates to decipher the relationship between their amino acid sequence and substrate specificity. Since, the polyketide product synthesized by a given gene cluster is determined by the number of modules present in the cluster and type of domains in each of the modules, the first step in the computational prediction of the polyketide product is the correct identification of the domains present in a given ORF. The results of our sequence analysis indicated that all domains except ACP have enough sequence homology so that they can be detected by pairwise comparison with a reference template sequence of the corresponding domain. However, the ACP domains show a high degree of sequence variability and hence multiple reference templates are required for their identification by pairwise sequence alignment. Our domain identification protocol has been tested on all the modular PKS with known substrates and the results have also been compared with the CDD server from NCBI, which is widely used for identification of protein domains. While CDD gives an ambiguous prediction for reductive domains as it can not distinguish between ER and KR domains, our program is able to correctly predict ER and KR domains. In some cases, we have also identified ACP domains which are not predicted by CDD, but the presence of such ACP domains in the sequence is consistent with the experimentally characterised polyketide product.

The next step in the computational prediction of the polyketide product is to determine the specificity of the AT domains for various types of extender units. The active site residues in various AT domains have been identified by multiple sequence alignment and pairwise alignment of the sequence of each AT domain with the crystal structure of acyl transferase from E. Coli. fatty acid synthase. Comparison of the active site sequences using evolutionary dendogramas indicate that there are distinct patterns of active site residues characterizing specificity for malonate, methyl malonate and other unusual substrates. Docking of various types of substrates on the homology models of AT domains are being carried out to understand the structural basis of their specificity, which would presumably permit more accurate prediction.

Based on the results of our computational analysis of modular PKS sequences, we have developed a web enabled software for prediction of PKS domains in a given protein sequence. The program pictorially depicts the various domains and inter domain linker regions, with clickable links to their sequences in FASTA format. Using this program we have developed a searchable database for modular PKS in collaboration with Dr Gokhale’s group. This database gives the domain organization of each modular PKS and the chemical structure of the polyketide product. It also permits search of domains in terms of their specificity for various extender units or level of sequence similarity/divergence from a selected domain. This database is not only useful for our knowledge-based approach, which involves regular addition of new sequence data and analysis using different training and test sets, it will also be a vaulable tool for choice of domains in rational design of novel polyketides.

Substrate specificity of chalcone synthase (CHS)

Using our computational approach active site residues have been identified for all CHS like proteins with known substrates. Active site geometry in these homology models have been analyzed in detail to find out residues which control selection of starter units and cyclization of the polyketide chain. Recent crystallographic and biochemical studies on CHS mutants have demonstrated that, one can indeed produce altered polyketide products by mutating these residues and the mutant CHS structures have very low RMSD (less than 1.0 Å) from the native CHS. These experimental studies give further validity to our assumption that substrate specificity for CHS like proteins can be predicted by our knowledge based approach. Cavity volumes have been computed for structural models of various CHS like proteins and compared with the sizes of the polyketide products synthesized by these proteins and number of condensation steps involved in their synthesis. Attempt is being made to investigate, given a PKS sequence, whether one can predict the number of condensations steps it catalyzes and the cyclisation pattern of the polyketide intermediate.

Computational analysis of acyl CoA synthetase like proteins from M. tb genome

In M. tb genome, 36 genes have been annotated as acyl CoA synthetase like proteins. It has been postulated that, these enzymes are involved in synthesis of fatty acyl CoA from fatty acid and CoA in presence of ATP. However, the exact substrate specificity of these enzymes are not known. Since, many of these proteins are located adjacent to PKS or NRPS clusters in M. tb and are believed to be involved in loading of starter units to the PKS cluster, identification of their substrate specificity is crucial for in silico prediction of polyketide products. Our computational analysis indicates that these sequences are likely to adopt a AMP binding fold similar to the adenylation domains of NRPS proteins. Detailed analysis involving entire sequence and active site residues indicate that, 12 of these proteins have many features similar to adenylation domains of NRPS proteins, which show specificity for amino acid like substrates rather than medium or longer chain fatty acids. Rest of the acyl CoA synthetase like proteins show features similar to known fatty acyl CoA ligase like proteins, which take fatty acids like substrates. Handling of such different types of substrates by proteins having same structural fold could possibly be achieved by presence of two different binding site on the structure, adjacent to the conserved AMP binding site. Docking of different types of substrates are being carried out to understand the substrate specificity of these proteins.

B.    MHC-peptide interactions

The computational protocol developed for prediction of class I MHC binding peptides, could predict the sequence of the peptide and its bound conformation with reasonable accuracy for 19 different MHC-peptide complexes available in PDB. However, for further validation, the approach had to be tested on a much larger data set. Therefore, detailed analysis of various class I MHC binding peptide sequences available in MHCPEP database is being carried out. The purpose of this analysis is to investigate, whether allele specific contacts inferred from the MHCPEP database could be predicted by a combination of residue based statistical potential and rotamer library, or allele specific amino acid pairing frequencies have to be incorporated in our protocol to achieve optimal prediction. Attempt is also being made to select few high scoring peptides based on our knowledge-based modelling approach and carry out detailed molecular dynamics or Monte Carlo simulations to address the issue of flexibility of the MHC-peptide complex.

C.    Structure prediction of peptides/proteins

Apart from structure prediction of peptides, the knowledge-based method has also been used for structure prediction of proteins. The M. tb genome contains a large number of protein sequences which do not show any detectable sequence similarity with any known protein, thus they have been categorized as proteins with unknown functions. We have used threading methods to predict whether these sequences adopt one of the known structural folds, even in absence of any detectable sequence similarity. It was found that for many of these unknown proteins, it is possible to assign a fold with high statistical confidence. We have also tried to check, if these sequences also contain the conserved catalytic or active site residues, which are required for a function compatible with the assigned structural fold. The presence of such conserved residues further ascertain the reliability of the fold prediction method.