CGDB_FAQs_CGIdent_Text

advertisement
Identification of conjoined genes in the human genome
To estimate the total number of conjoined genes in the human genome, UCSC alignments
of known gene to mRNA and EST tracks were used. A simple automated algorithm “Conjoin”
was developed and applied on the entire human genome (UCSC assembly of March 2006). The
steps of the algorithm are as follows: (i) Merge the Entrez Gene IDs for the RefSeq IDs of all the
known genes using the UCSC file associating known genes and their respective Entrez Gene IDs
(ii) For any genome the alignments of known gene tracks are compared with those of mRNA and
EST to identify the cases where two distinct known genes (child genes) with different Entrez IDs
form part of a single mRNA or EST. Such mRNA or ESTs are reported as possible candidate
conjoined genes. (iii) Based on the HGNC gene symbols of the child genes, the conjoined gene
candidates are classified as belonging to same or different gene families. (iv) Based on the
location of the reference coordinates of the child genes with respect to each other, the conjoined
genes are classified into formed by non-overlapping or overlapping (including gene within gene
and partially overlapping genes) child genes. For conjoined genes formed by non-overlapping
child genes, those candidates with at least one exon having a match of more than 30 bps from
each child gene are considered as true conjoined genes. A pre-processing step is included for
conjoined genes formed by overlapping child genes in which all overlapping exons are excluded
from both the child genes. Those candidates with at least one exon from the remaining unique
exons having a match of more than 30 bps from each child gene are considered as true conjoined
gene candidates. (v) In order to remove the false positives arising due to duplicated regions in the
genomes or genes belonging to the same gene families or variants of the same gene loci a final
step of manual curation is performed. In some cases more than one transcript of the child genes
are found to satisfy the above mentioned criteria, all of which are considered to be responsible
for the formation of conjoined genes. This resulted in 623 and 942 conjoined gene candidates
using human mRNA and EST libraries respectively. On merging gene symbol information from
HGNC, some transcripts were identified as formed from child genes belonging to the same gene
family. The products of genes of same gene families usually show more than 40% amino acid
sequence identity. Thus, the mRNA or EST sequences spanning two or more such child genes
have higher possibility of misalignment or alignment at more than two locations. In addition to
these, due to the poor quality of EST sequences and relatively smaller size, similar situation
might arise while aligning ESTs to the whole genome. Also alternatively spliced variants of one
gene locus are named with similar gene names but carry different Entrez gene IDs. Thus some
conjoined gene candidates could also result from such child genes. In order to obtain the true
conjoined genes, it was essential to remove all those candidates which arise from child genes
belonging to same gene family or those representing variants of the same loci. For this purpose a
manual curation step was introduced in which the reference sequence coordinates of the child
genes were used to remove all the false positives. This resulted into 91 conjoined genes
represented by both mRNA and EST evidences. In addition 226 and 434 conjoined genes were
also identified which were represented by only mRNA and only EST transcripts respectively. It
was interesting to note that 155 of 751 conjoined genes had more than one mRNA or EST
transcripts as evidences, yet a large fraction of those (596) were represented by only one mRNA
or EST transcript. In total all the 751 conjoined genes were represented by 1,466 mRNA and
EST transcripts connecting 1,451 human child genes. There were 38 mRNA and 45 EST
conjoined genes which were formed by child genes belonging to the same gene family or those
representing variants of the same loci.
Download