CGDB_FAQs_CGIdent_Text

Identification of conjoined genes in the human genome To estimate the total number of conjoined genes in the human genome, UCSC alignments of known gene to mRNA and EST tracks were used. A simple automated algorithm “Conjoin” was developed and applied on the entire human genome (UCSC assembly of March 2006). The steps of the algorithm are as follows: (i) Merge the Entrez Gene IDs for the RefSeq IDs of all the known genes using the UCSC file associating known genes and their respective Entrez Gene IDs (ii) For any genome the alignments of known gene tracks are compared with those of mRNA and EST to identify the cases where two distinct known genes (child genes) with different Entrez IDs form part of a single mRNA or EST. Such mRNA or ESTs are reported as possible candidate conjoined genes. (iii) Based on the HGNC gene symbols of the child genes, the conjoined gene candidates are classified as belonging to same or different gene families. (iv) Based on the location of the reference coordinates of the child genes with respect to each other, the conjoined genes are classified into formed by non-overlapping or overlapping (including gene within gene and partially overlapping genes) child genes. For conjoined genes formed by non-overlapping child genes, those candidates with at least one exon having a match of more than 30 bps from each child gene are considered as true conjoined genes. A pre-processing step is included for conjoined genes formed by overlapping child genes in which all overlapping exons are excluded from both the child genes. Those candidates with at least one exon from the remaining unique exons having a match of more than 30 bps from each child gene are considered as true conjoined gene candidates. (v) In order to remove the false positives arising due to duplicated regions in the genomes or genes belonging to the same gene families or variants of the same gene loci a final step of manual curation is performed. In some cases more than one transcript of the child genes are found to satisfy the above mentioned criteria, all of which are considered to be responsible for the formation of conjoined genes. This resulted in 623 and 942 conjoined gene candidates using human mRNA and EST libraries respectively. On merging gene symbol information from HGNC, some transcripts were identified as formed from child genes belonging to the same gene family. The products of genes of same gene families usually show more than 40% amino acid sequence identity. Thus, the mRNA or EST sequences spanning two or more such child genes have higher possibility of misalignment or alignment at more than two locations. In addition to these, due to the poor quality of EST sequences and relatively smaller size, similar situation might arise while aligning ESTs to the whole genome. Also alternatively spliced variants of one gene locus are named with similar gene names but carry different Entrez gene IDs. Thus some conjoined gene candidates could also result from such child genes. In order to obtain the true conjoined genes, it was essential to remove all those candidates which arise from child genes belonging to same gene family or those representing variants of the same loci. For this purpose a manual curation step was introduced in which the reference sequence coordinates of the child genes were used to remove all the false positives. This resulted into 91 conjoined genes represented by both mRNA and EST evidences. In addition 226 and 434 conjoined genes were also identified which were represented by only mRNA and only EST transcripts respectively. It was interesting to note that 155 of 751 conjoined genes had more than one mRNA or EST transcripts as evidences, yet a large fraction of those (596) were represented by only one mRNA or EST transcript. In total all the 751 conjoined genes were represented by 1,466 mRNA and EST transcripts connecting 1,451 human child genes. There were 38 mRNA and 45 EST conjoined genes which were formed by child genes belonging to the same gene family or those representing variants of the same loci.

CGDB_FAQs_CGIdent_Text

Related documents

Products

Support

CGDB_FAQs_CGIdent_Text

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib