Laboratory of Structural Biology,Tsinghua University Structural Genomics & Proteomics: A big stage for the e-Science 结构基因组学& 结构蛋白质组学 与疾病和创新药物 GENE PROTEIN STRUCTUURE FUNCTION (DRUG) With access to sequences of entire human genomes plus those of various model organisms and many important microbial pathogens, structural biology is on the verge of a dramatic transformation. Our newfound wealth of sequence information will serve as the foundation for an important initiative in structural genomics. We are poised to embark on a systematic program of high-throughput X-ray crystallography and NMR spectroscopy aimed at developing a comprehensive view of the protein structure universe. Structural genomics will yield a large number of experimental protein structures (tens of thousands) and an even larger number of calculated comparative protein structure models (millions). This enormous body of structural data will be freely available, and promises to accelerate scientific discovery in all areas of biologic science, including biodiversity and evolution in natural ecosystems, agricultural plant genetics, breeding of farm and domestic animals, and human health and disease. -- Stephen K Burley (Nature Structural Biology Volume 7 S) Structure Genomics Initiative in China —— Another 1% ? 1% Human Genome Sequence & a Draft of Whole Rice Genome Sequencing China is one of the six countries that have provided facilities for the Human Genome Project (HGP). After completion of the HGP, China based scientists have been continuously making achievements in genomic studies, including the recently sequenced a rice genome and related report has recently been published by “Science”. Three major Genome Research Centers in China have been involved in the HGP: BGI(Beijing Genomics Institute , CAS) Beijing Genomics Institute (BGI) is the largest non-profit genomics research institute in China. Founded in July 1999 by a group of overseas Chinese scientists, BGI has been growing rapidly with the support from the Chinese Academy of Sciences. Major Sequencing Projects in BGI: The Human Genome Thermoanaerobacter tengcongensis Genome The Porcine Genome The Super Hybrid Rice Genome Project Chinese National Human Genome Center, Beijing http://www.chgb.org.cn CHGC (Chinese National Human Genome Center at Shanghai) Chinese National Human Genome Center at Shanghai was established in Shanghai on March 4th, 1998 and hung out its brass plate on October 29th,1998. The Center is supported by National MST, Municipal Government and Chinese Academy of Science (CAS). Human Genome Sequencing Pathogenic Microbe Genome Sequencing EST Sequencing and Full-length cDNA cloning Disease Gene Positional Cloning and Scanning Structural Biology in China In 1967, China set up one of the earliest protein crystallographic research institutions: Peking Insulin Structure Group, CAS. Solved crystal structure of pig insulin in 1972. In the following years, China has been successfully exchanging expertise with other countries all over the world. At present, there are about 10 major research groups based in China undertaking structural biological studies. Institute of Biophysics, CAS (Former Peking Insulin Structure Group) Peking Insulin Structure Group (1973). Crystal structure of Insulin (1.8A) UK & China Collaboration in structural biology has been established by Professor Hodgkin in the early of 70’s Institute of Biophysics, CAS Structural Biology Project, National Biology Macromolecular Laboratory Protein Crystallography Project Protein Crystallography Project, National Biology Macromolecular Laboratory 含铜亚硝酸还原酶突变体(1.65Å) Protein Engineering Project 天花粉蛋白(Trichosanthin)结构 由 Ca2+连接的蝮蛇毒素 (Agkistrodotoxin)二聚体 哺乳动物的蝎神经毒 Shanghai Institute of Organic Chemistry, CAS HIV-RT 重组细胞色素b5(cytochrome b5) 的胰蛋白酶( Trypsin)水解片段落 Shanghai Institute of Biochemistry, CAS (Dr Jianping Ding’s group) FuJian Institute of Structural Chemistry, CAS 在不对称单位中两个 天花粉蛋白分子 热稳定的邻苯二酚2,3-双加氧酶 (catechol 2,3-dioxygenase)(2.0Å ) Fudan University University of Science &Technology of China ,CAS NMR Project, Structural Biology Laboratory, USTC 发 卡 DNA GCGCGAAAC-T-GTTTCGCGC 的 溶 液 构象,2D NMR 测定及分子动力学模拟结果,红线 及绿线分别代表IRMA计算中起始结构为A-DNA和BDNA所得结果 Protein Crystallography Project, Structural Biology Laboratory, USTC C型凝集素类似蛋白 (Agkistrodon acutus 蛇毒组分) Protein Structure Project Peking University Protein Structure Project, Physical Chemistry Institute 斑头雁和灰雁的氧合血红蛋白 NMR Project 猪胰蛋白酶与绿豆抑制剂三元复合物 (2.7Å) EM Project α-淀粉酶抑制剂 Professor Ming Lou Group 水稻矮缩(RDV)病毒内壳层—骨架结构 Proposed by Professor Dong-Cai Liang, the former leader of the Peking Insulin Structure Group. Structure Genomics Initiative Structure Genomics Initiative in China Supported by the National Biotechnology R&D Program and National Frontier Research Program from the Ministry of Science & Technology, together with the National Nature Science Foundation, China officially started her "Structural Genomics Initiative" by end of last year. Based on our achievements in genomic and medical studies, we have initiated pilot studies focused on human disease related proteins. The aims of the initiative are: to establish a fully integrated, high-throughput system for target gene selection; expression and purification of selected proteins; protein chemistry and biochemistry characterization; protein crystallization and X-ray crystallography; and finally, to lead to structure based functional genomics and drug discovery. Now, we have two “SGI” centers in China 1, CAS (150 structures?) (Chinese Academy of Sciences)(Y SHI) Institute of Biophysics Academia Sinica (D Wang) ShangHai Institute of Biochemistry (J Ding) University of Science &Technology of China (Y Shi) 2, MOE (300 structures?) (The Ministry of Education)(Z RAO) Tsinghua University (Zihe Rao) Peking University (Min Luo) With joint efforts in China, as well as international collaborations. Our ultimate goal is to determine the structures of more than 1% of the total proteins expressed from the human genome (say, >1,000 structures), or more than 1% of protein folds still to be discovered. (MOECenter) Structural Genomics Initiative of Tsinghua University Since December 2001, Tsinghua University has begun its Structural Genomics Initiative project for high-throughput, structural and functional studies of proteins, which is funded by 863 (MOST). The mission of Tsinghua Structural Genomics Initiative (TSGI) is to select the human disease-related genes of unknown structure and determine 150(?) structures (new fold or sequence homology less than 30% against the PDB) by high-throughput methodologies in 5 years. We also focus on the human proteins having potential new fold. Strategy: Our main collaborators have offered to provide 840 soluble proteins : 1.200 expression vectors (soluble) are expected from group led by Prof.Qian Bo-qun, Chinese National Human Genome Center, Beijing. 2. Cancer Institute, Chinese Academy of Medical Sciences will supply 500 tumor related genes or proteins (soluble). 3.The group led by Prof. He Fuchu from AMMS will provide as much as 100 expression vectors (soluble). 4. More than 40 tumor (soluble) antigens selected by means of antibody library selection platform will be on list within the coming 5 years from the group led by Prof. Chen Zhinan from the FMMU. We are also expecting to produce soluble proteins from our own lab: A high-throughput group led by Prof. Pang Hai is to provide 1000(?) constructs for protein expression (soluble). Strategy: Target selection strategy 1. Scientific literatures: CNS Choosing an appropriate Target Sources target source is a key step to do the effective target selection. We have developed several strategies to solve this seemingly tough issue. In my personal view, there are two poles of the strategies to choose SG (structural strategies) targets: 1) issue-oriented targets regardless of the techniques obstacle; 2) the low hanging fruit regardless of the functional relevance; Of course, the combination of the two strategies is the best solution to target selection. 2. Partners: CAS, National Genome Center,etc. 3. Public Databases: NCBI,etc. 1.Have structures in PDB? (Similarity<30%) Full length analysis 2. Gene size? (>2.0kb) 3. Have conserved domain? (New or old gene?) Domain analysis Selective Filter Single or multiple domain?(function clues) 1.Have signal peptide? Property prediction 2.Have Transmembrane regions? 3.Secondary struc.,pI,Mw,Cys%,Met%,Extinction Fold recognition 3D structure prediction by threading 1.Soluble and stable Crystallizable Targets 2.No transmembrane region 3.Appropriate protein size 4.No low complexity region 1. Medical relevance or functional significance Targets Prioritization 2. Protein/protein or protein/nucleic acid complex a. pI 3. Property b. Number of Methionine (MAD) c. Solubility and stability Target Database Strategy: TSGI’s Target selection platform modules Routine modules already contained in current selection platform: • Automated modules: a.Threading: z-score;(against PDB) b.PSI-Blast: homology,e-value;(against PDB) c.TMHMM: signal peptide,TMHs, topology; d.SeqQ: solubility prediction; • Manual modules: e.Expasy -ProPram: Mw, pI, Extinction, Cys%, Met% -GOR IV: sencondary structure prediction f.Pfam/SMART: domain boundary determination and function annotation(>20kD protein) g.NCBI(PubMed,OMIM)/Genecard: reference paper Strategy: Target selection process The targets selection strategy of our laboratory which is based on the Bioinformatics programs, algorithms and public databases comes to meet the demand of high-throughput structure determination. We have been developing an automatic targets selection platform integrating the existing public server & database and local BLAST and Threading program, quickening up the conversion from bit-by-bit lab work to high throughput methodologies. • Legend: A—Files or databases B--Program developed by TSGI C--Public available server/software D--Manual procedure Target selection program (1) Extract Hs_seq_uniq Target selection program (2) Batch file / C shell writer Target_sorting PDB_new fold_sorting Example: Target selection results We have selected ~2500 tractable genes from several different databases. All the target candidates meet following the criteria: 1. E-value>10-4; 2. No or just have one transmembane region; 3. Homology <30%; 4. Amino acid number<600; There are nearly 600 genes having ‘no hits found’ against PDB database by PSI-Blast. We expect to find potential new folds from this candidate pool by threading. Strategy: The power of our target selection strategy Target source Total Contain CDS Tractable & unknown structure 1.HBV-related gene - - 7 2.Leukemia related gene 169 156 29 3.Genecard 2100 1264 84 Disease-related gene program: New folds searching program: 1.Archea proteom 2802 2.Unigene cDNA/EST 95927 - 10311(<600aa) 583*/2303$ Note: $-Evalue:0.01,TMHs:1,Homology:30%,a.a.num:600; 0-200aa: 3035 (2858 non-transmembrane proteins, 94.2%) *-No hits found in PDB; 200-400aa: 4279 (37358 non-transmembrane proteins, 87.3%) 400-600aa: 2997 (2567 non-transmembrane proteins, 85.7%) Experimental progress I Target Selection Gl7aca acylase TTR C2 Molecular Cloning Protein Purification Crystallization Data Collection Structure Determination PDB Publication 1GHD 1GHE SAK C1027 1C76 1J48 TRXL Histone 1GH2 1KU5 S100P MGC 1J55 Fabp3 111 HBV X AFP Decorin HCC1 XIP AIP HAF4 HCA56 HCA58 SSPC CAB1 TDO SARG HPV E6 BCO SCPT SH STM TFX 5031 Experimental progress II Target Selection H3 LKB1 WNK1 LKB1_D Mad2 NDPP1 DIO1 HCCA2 NDPP1 TUCAN CARD8 Y14 MAPKKK3 Mad3 RHAMM RPL27 CDC20 BUB3 DEK Pinx HIRIP3 Rb1 E2F1AD HIRAH4 HIRAH2B NASP NAP2 H2A H2B H4S NAP1 H1S Molecular Cloning Protein Purification Crystallization Data Collection Structure Determination PDB Publication Experimental progress III Target Selection FOP GAS41 SMARCB1 SIX3 SIM1 SH3GL1 SGSH SGCG SEDL SCA1 SARDH SAH RPS19 RLBP1 RFXAP RFXANK REA RAMP RAG2 AIP HCC1 HCA56 HCA58 XAF4 XIP SCG3 IKBKG HMGIC HNMT BCL7A AGT CLN5 Molecular Cloning Protein Purification Crystallization Data Collection Structure Determination PDB Publication Experimental progress IV Target Selection MCCC2 MGAT2 LOR MDS1 TRH ITPA SEPN1 FANCE ACT APOA2 NEU1 PDGFRL TNNT1 BLNK PEX3 PABPN1 MSF PDHA1 ING1 P25 ARTEMIS AMY C10ORF2 TCAP BBS4 NET1 BCS1L LGS PRCC TPT EPM2A AAAS Molecular Cloning Protein Purification Crystallization Data Collection Structure Determination PDB Publication strategy Tsinghua Structural Genomics Initiative (TSGI) web page: xtal.tsinghua.edu.cn Introduction Submit Target list Strategy: Target selection platform hardware The target selection and sequence analysis of Human genome-scale cDNA sequences from NCBI unigene database is carried out on the supercomputer at Tsinghua High Performance Computing Institute (THPCI), one of our close collaborators. The computer cluster reside in THPCI has 34 4-CPUs Pentium III Xeon units, which can do ~1000 threading jobs in parallel way in just one day. Strategy: High through-put cloning, expression 新一代高通量分子克隆手段-Gateway Technology, NEB intein System, T vector…. 新一代的分子克隆手段 利用同源重组的技术避免了 传统的分子克隆中使用限制 性内切酶和连接酶所带来的 效率低下的瓶颈。大大提高 了灵活性和可靠性,缩短了 克隆供表达的基因的周期。 strategy High though-put purification, crystallization, data collection, structural solution An automated high-throughput protein crystallization system is needed. strategy Evaluation of high-throughput structure determination methods Data processing • Goal is to derive intensities of diffraction spots (and their standard deviations) from X-ray images and reduce data to appropriate crystallographic space group. • Many fast, user-friendly software suites for data processing. • Most popular software suite is the HKL (DENZO) package. • Alternatives include MOSFLM, DLS, D*TREK, DPS, POW Phasing • SOLVE program was first to provide fully automated phasing from a MIR experiment • Automated versions of other software (e.g. Auto-SHARP and CHART) soon available. • Direct methods approaches (SnB, SHELXD) very efficient but only provide positions of heavy atom sites. • CNS/CCP4i can proceed from heavy atom sites or molecular replacement solution to density map - require user intervention strategy Evaluation of high-throughput structure determination methods Model building • Many maps are of average-to-poor quality - not straightforward to build accurate model. • Interactive (I.e. manual) model building very time consuming, introduces errors into model. • Tools to build model from Ca coordinates; accurate recognition of Ca positions remains a challenge. • ARP/wARP provides complete automation for building model structures at resolutions higher than 2.3 Å Refinement • Adjustment of parameters of the model through minimization of residuals between experimental diffraction amplitudes (Fobs) and those calculated from a model (Fcalc). • Commonly used programs: CNS and REFMAC (CCP4) • Other programs: TNT, SHELXL (for high resolution refinement), BUSTER • Explicit definition of stereochemical restraints required. strategy Evaluation of high-throughput structure determination methods Software for structure determination General Packages • CNS • CCP4 Data Processing • D*TREK • DPS • HKL2000/DENZO • MOSFLM (CCP4) • XDS • STRATEGY • PREDICT Phasing Molecular replacement: • AMORE (CCP4) • CNS • MOLREP (CCP4) • EPMR • ARP/wARP* Heavy atom sites: • SHELXD • SNB • RANTAN, RSPS (CCP4) Heavy atom phasing: • CHART* • MLPHARE (CCP4) • PHASES • SHARP* • SOLVE/RESOLVE* strategy Evaluation of high-throughput structure determination methods Software for structure determination Model Building Pattern Searching • ESSENS • FFFEAR Interactive Graphics • MAIN •O • QUANTA • TURBO-FRODO • XTALVIEW Automated model building • ARP/wARP* • RESOLVE* Refinement • BUSTER • CNS • REFMAC (CCP4) • SHELXL • TNT Validation • PROCHECK (CCP4) • SFCHECK (CCP4) • WHATCHECK strategy Evaluation of high-throughput structure determination methods Examples >> tabtoxin resistance protein (TTR) • Use the TTR structure to compare several manual and automated structure determination methods Advantages: • High resolution data (1.55 Å) collected at APS, Argonne, USA • Three wavelength MAD data • Data meets criteria for automated structure determination software (e.g. ARP/wARP) strategy Evaluation of high-throughput structure determination methods Examples >> tabtoxin resistance protein (TTR) Software used Method Number of residues Figure of RMSD merit <FOM> Approx. time CNS & O manual 170 0.81 ~ 1 week CNS & semiARP/wARP automated 168 0.81 0.34 Å ~ 3 days SOLVE & automated ARP/wARP 168 0.79 0.33 Å ~ 24 hours CHART & automated ARP/wARP 168 0.76 0.26 Å ~ 6 hours [All calculations performed using SGI Origin2000 server] strategy Evaluation of high-throughput structure determination methods Examples >> tabtoxin resistance protein (TTR) — Another 1% from China? I don’t know…but… Laboratory of Structural Biology,Tsinghua University Structural Genomics & Proteomics: A big stage for the e-Science Collaboration between UK & China in the field of structural biology has a long history and also has a series fruitful results. I am confident that under the umbrella of e-science, with the joint efforts between UK & Chinese structural biologists, we can achieve our goal. GENE PROTEIN STRUCTUURE FUNCTION (DRUG) National Exhibition of “Art & Science”(a-Science), May 1-14, 2001, National Art Gallery, Beijing Life Science Building Campus Main building and its surroundings Gymnasium Incubating high-tech enterprises (5) Tsinghua Science Park 20 hectares, 100,000 m2 built ; Major Tsinghua Enterprises ¾International Corporations; National Eng. Centers Entrepreneur Park of Tsinghua University School of Sciences Thank you