是否所有的RNA都翻译？

生物信息学中的分类学习问题邹权厦门大学计算机科学系 http://datamining.xmu.edu.cn/main/~zq 提纲  DNA/RNA层面上的分类学习问题  蛋白质层面的分类学习问题  新技术带来的分类学习问题 2/57  几个概念：基因、基因组、DNA、染色体、细胞 3/57 4/57 5/57 6/57  DNA如何影响生物的性状？  什么叫表达？ 7/57 中心法则 8/57 . . A T T C A C A G T G G A . . 9/57 I H S G  是否所有的DNA都转录？  人—仅仅1%  是否所有的RNA都翻译？ 10/57 真核生物的基因结构非编码区 RNA聚合酶结合位点编码区外显子非编码区内含子  真核细胞基因结构示意图 11/57 12/57  问题1：识别编码区(ORF)  Snyder, E. E., and Stormo, G. D. (1993). Identification of coding regions in genomic DNA sequences: An application of dynamic programming and neural networks. Nucleic Acids Res. 21: 607-613.  问题2：辨别外显子、内含子  T.M. Chen, C.C. Lu, W.H. Li,(2005) Prediction of splice sites with dependency graphs and their expanded Bayesian networks, Bioinformatics, 21:471–482.  问题3：识别可变剪切  Gideon D. et al(2005) Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics, 21:897-901  问题4：识别调控元件  Jiang B, Zhang MQ, Zhang X, (2007) OSCAR: one-class SVM for accurate recognition of cis-elements, Bioinformatics, 23(5): 531-537 13/57  问题1：识别ORF  NN(GRAIL: a multi-agent neural network system for gene identification)  HMM  决策树（A decision tree system for finding genes in DNA .JCB98） 14/57 15/57 问题2：辨别外显子、内含子     外显子内含子的分界线——剪切位点也可以称为“识别剪切位点” 特征：三连核苷酸… 分类器：SVM，NB，HMM，BP NN 编码区外显子 16/57 内含子问题3：识别可变剪切 17/57  问题4：识别motif  EM算法  Gibbs Sampling 参考：王峻,郭茂祖.转录因子结合位点识别算法的研究. 电子学报.2007,35(12A):83-89 18/57  是否所有的DNA都转录？  人—仅仅1%  4个与机器学习有关的问题，还有更多  是否所有的RNA都翻译？  编码RNA与非编码RNA 19/57 DNA DNA chromosome chromosome transcription ncRNA miRNA mRNA tRNA rRNA nucleolus translation Protein ribosome 20/57 cytoplasm 21/57 MicroRNA 参与调节的疾病举例 •癌 • 其它疾病症 - 胸腺癌 - 老年性痴呆 - 肺癌 - 直肠癌 - 白血病 - 皮肤癌 - 成神经细胞瘤 - 鼻咽癌 - 卵巢癌 - 糖尿病 - 心肌肥大 - AIDS 22/57 研究1：从长的 DNA序列中找出前体 ··· ··· DNA ··· microRNA 前体 (precursor) 细胞核出核细胞质 microRNA 成熟体研究2：根据成熟体寻找靶标靶标 23/57 mRNA microRNA中的分类问题  挖掘---对前体的真伪辨别  同源比对  ab initio  靶标---对靶标的真伪辨别 24/57 基于同源比对的方法  利用已知的microRNA 信息  BLAST  逐步过滤参考: Wang,X.J. et al (2004) Prediction and identification of Arabidopsis thaliana microRNA genes and their mRNA targets. Genome Biology. 5:R65 25/57 26/57 microRNA的挖掘-- ab initio方法  Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, Xuegong Zhang. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005.6:310(他引167次，截至11.12.12)  Peng Jiang, Haonan Wu, Wenkai Wang, Wei Ma, Xiao Sun, Zuhong Lu. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Research. 2007,35:W339-W344 (他引107次，截至11.12.12) 27/57 一级序列 CUUUCUACACAGGUUGGGAUCGGUUGCAAUGCUGUGUUUCUGUAUGGUAUUGCACUUGUCCCGGCCUGUUGAGUUUGG 二级结构 ..(((...((((((((((((.(((.(((((((((((......)))))))))))))).)))))))))))).)))..... 说明：”（”和”）”意义相 G 同，均表示发生了配对。 ((. .(( ((( ((. “.”表示没有发生配对 UU C U .(( 每一位核苷酸和它及其相邻的两个核苷酸的配对情况 32 个三元组——32 维特征向量 ( U ( ( ( , U ( ( . , U ( . ( , U ( . . , U . ( ( , U .(. , U . . (, U . . . , G ( ( ( , G ( ( . , . . .) 出现的次数 (12,4,3,1,2,0,0,0,10,1,. . .) 归一化三元组 (0.1846,0.0615,0.0462,0.0154,0.0308,0,0,0,0.1538,0.0154, …) 28/57 29/57 http://dbgroup.cs.tsinghua.edu.cn/zouquan/libid/ 30/57 microRNA中的分类问题  挖掘---对前体的真伪辨别  同源比对  ab initio  靶标---对靶标的真伪辨别 31/57 靶标预测  参考：Improving the prediction of human microRNA target genes by using ensemble algorithm. FEBS Letters 581 (2007) 1587–1593 32/57 33/57 提纲  DNA/RNA层面上的机器学习问题  蛋白质层面的机器学习问题  分类、鉴别  结构预测  相互作用预测  新技术带来的机器学习问题 34/57 35/57  参考：LY Han, J Cui, HH Lin, ZL Ji, ZW Cao, YS Li, and YZ Chen. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006, 6(14): 40234037 36/57 The Protein Folding Problem  Secondary structures  α-helix  Average 10 residues, or three turns  Glutamine, methionine, and leucine favor -helix  Valine, serine, aspartic acid, and asparagine tend to destabilize helices  β-sheet  Generally 5~10 residues  Valine, isoleucine, and phenylalanine enhance -Sheets  Proline doesn’t fit well into -Sheets  Loop  The sections of the sequence that connect the other two kinds of secondary structure 37/57 蛋白质二级结构预测问题  输入 IRNSSNISPASMIFRNLLILEDDLRRQAHEQKILKWQFTLFLASMAGVGAFTFYELYF  输出 -----------HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH----EEEEEEEE 参考：Fusion of classifiers for protein fold recognition. Neurocomputing 68 (2005) 315–321 38/57 39/57 40/57 相互作用预测  相互作用网络  相互作用位点预测参考：于建涛, 郭茂祖, 蔡禄. 蛋白质相互作用及其网络预测方法研究进展. 电子学报.2007,35(12A):1-7 Li Minghui, et al. Protein–protein interaction site prediction based on conditional random fields. BIOINFORMATICS. Vol. 23 no. 5 2007, pages 597–604 41/57 提纲  DNA/RNA层面上的机器学习问题  蛋白质层面的机器学习问题  新技术带来的机器学习问题  microArray  Assembling  SNP 42/57 43/57 44/57 45/57 microArray中的机器学习问题  样本分类（疾病诊断）  高维、小样本  代价敏感  正反例不平衡  属性缺失  基因聚类  层次聚类参考:时序微阵列数据中的同步和异步共调控基因聚类. 计算机学报. 2007,30:1302-1314 对于基因表达数据的基于类别树和SVM的多类癌症分类算法.计算机研究与发展,2004,41:436-441. Hierarchical clustering of gene expression profiles with graphics hardware acceleration. Pattern Recognition Letters. 2006,27:676-681 A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005.21:631-643 46/57 测序技术的发展(1) 已在市场上的下一代平台 GA – Illumina/Solexa  SBS 通过可逆荧光终止法(FISSEQ) GS FLX – Roche/454 Life Sciences  SBS 通过焦磷酸测序 SOLiD – ABI/Agencourt  SBL 通过双碱基编码 47/57 三十年来测序的发展 48/57 测序技术的发展(2) 2nd Generation Performance 49/57 50/57 重复区域 51/57 52/57 图模型下的片段组装  参考： Butler, J., Maccallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., and Jaffe, D.B. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res.2008, 18: 810–820. Zerbino, D. and Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008,18: 821–829. J.T.Simpson et al. ABySS: A parallel assembler for short read sequence data. Genome Res. 2009 53/57 SNP  什么是SNP  为什么研究它 54/57 55/57 SNP中的计算问题  疾病预测/人群分类  参考：Haplotype Pattern Mining & Classification for detecting disease associated Site.CSB2003  nsSNP  参考：Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP).Bioinformatics. 2007,23(12):1444–1450  tagSNP  参考： Jun Wang, Mao-zu Guo, Chun-yu Wang. CGTS: a site-clustering graph based tagSNP selection algorithm in genotype data.BMC Bioinformatics. 2009  挖掘SNP  参考：Jun Wang*, Quan Zou*, Maozu Guo. Mining SNPs from EST sequences using filters and ensemble classifiers. Genetics and Molecular Research. 2010,9(2):820-834.  基因组压缩  compress a human genome from 3.2GB to 4.1MB  参考：Human genomes as email attachments. Bioinformatics 25: 274-275 (2009). 56/57  欢迎给出任何意见和建议  zouquan@xmu.edu.cn 57/57

是否所有的RNA都翻译？

Related documents

Products

Support

是否所有的RNA都翻译？

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib