Methods - Columbia University

advertisement
1
Isolation of members of a novel vomeronasal receptor gene
superfamily from the mouse genome using a comprehensive
data mining strategy
Michael Pearce
Summer Research Project
Chasin Lab
Columbia University
2002
2
Introduction
The era of genome biology brings with it vast amounts of genomic data for a multitude of
organisms. One such organism of interest is the mouse. UCSC estimates that their February
2002 draft of the mouse genome is 90-96% complete (1). This nearly complete genome
sequence provides a tool with which to discover new genes. Database driven gene finding is a
common technique used to discover homologues of known genes as well as members of gene
families. The study of gene families can provide insight into the evolutionary forces that may
have shaped the gene sequences and it can aid in the assignment of functionality to certain
sequence elements.
The gene family of present interest is the set of genes for candidate pheromone receptors
or vomeronasal receptors (VRs) of the mouse. Pheromones are chemical signals that, when
sensed by organisms, can result in a variety of behaviors or reactions, relating to danger, mating,
etc. The VRs are found in the sensory neurons, whose cell bodies lie in the epithelium of the
vomeronasal organ found at the base of the nasal septum (2,3). The VR proteins are members of
the G-protein coupled receptor (GPCR) superfamily and therefore have the common 7transmembrane domain structure (3). However, two distinct superfamilies of vomeronasal
receptors exist. Dulac and Axel identified a superfamily of ~100 V1Rs that are expressed in the
Gi2 containing apical half of the receptor cell layer in the vomeronasal organ (4,5). The V1R
genes have intronless coding regions (2). Matsunami and Buck have identified members of a
second VR superfamily, the V2Rs, which are found in the G0 containing basal receptor layer of
the vomeronasal organ (3,5). The V2R genes are related to two other GPCRs: the Ca2+-sensing
receptor and the metabotropic glutamate receptor (3). Therefore, it was hypothesized that the
V2R genes have a six-exon structure, in which all seven of the transmembrane domains are
encoded by the sixth exon (3). Unlike the V1Rs, the V2Rs have an unusually large N-terminal
extracellular domain that is believed to be involved in ligand binding (3).
Hybridization experiments done on a mouse genomic library with a V2R gene probe
resulted in signal from about 140 potential V2R genes (3). The 15 V2R cDNA sequences,
discovered by Matsunami and Buck, were used to search the February 2002 draft of the UCSC
mouse genome for the remaining members of the V2R gene family. A comprehensive data
mining technique was used that involved locating the V2R gene candidates in the genome,
determining the exon structure of the candidate genes from aligning the homologous sequences,
and analyzing the protein sequences coded for by the candidate genes.
Methods
Original V2R Gene Extraction
Each cDNA sequence discovered by Matsunami et al. (3) and Ryba et al. (5) were downloaded
from GenBank. Each cDNA sequence was used as a query sequence in a BLAT search against
the February 2002 draft of the UCSC mouse genome (1). BLAT is a Blast like alignment tool
that finds a homologous protein sequence with 80% similarity over a window of 20 amino acids
(6). For protein alignments, BLAT searches through an index of the genome that is constructed
using non-overlapping 4-mers from the genomic assembly after removal of repeats. If there are
hits against the index, then the actual genomic sequence of area of probable homology is loaded
into the memory for alignment (6).
3
The genomic sequence from the highest scoring hit was collected for each search in FASTA
format. Each of the original cDNA sequences was then entered as input to the NCBI
mRNA/cDNA to genomic alignment tool Spidey, to determine the exon/intron structure (7).
Spidey functions by first aligning each mRNA to the genomic sequence using a high-stringency
BLAST. The result from BLAST is used to find genomic windows. Windows are constructed
using a recursive algorithm by merging the BLAST hits with consistent parameters. This is done
until all BLAST alignments are put into non-overlapping, consistent windows. A less stringent
BLAST alignment is then performed using the entire mRNA sequence within each window.
Once the program has determined that the mRNA is completely covered by the genomic
sequence, the alignments are adjusted so the good splice donor and acceptor sites are used.
The Spidey searches were done at high stringency. The resulting alignment data was manually
formatted to allow for use by a Perl script that created annotations for the original genes denoting
the starting and ending coordinates of the exons.
Original V2R cDNA Manipulations
Upon finding the genomic sequences for the original V2R's it became apparent that some of the
V2R cDNA's entered in GenBank were derived from the same genes. If two cDNA's aligned to
the same region with >98% identity, they were considered to be derived from the same gene and
were therefore merged together. This was the case for V2R10/V2R11, V2R8/V2R9,
V2R3/V2R13 and V2R2/V2R12.
Once the genomic sequence had been found for the original V2R genes, it was possible to go
back and try to correct some of the incomplete cDNA sequences that had been entered into
GenBank. This was done by aligning all of the original cDNA's to the genomic sequences of
those V2R's that were missing some exons, in Spidey at low stringency. If missing exons were
predicted by the alignment of other cDNA's, the exons were added to the cDNA's of the V2R's in
question. This procedure was done for V2R11, V2R3, V2R9, V2R16, and V2R2.
Original V2R Protein Translation
Once the cDNA sequences had been manipulated, each V2R cDNA sequence was translated
using the NCBI program ORF Finder (8). The program translates the nucleotide sequence in all
6 reading frames. The largest ORF was chosen and the corresponding protein sequence was
downloaded for each file and annotation was added to these files using the scripts.
BLAT Search for Candidate Sequences
In the initial candidate search for the other members of the V2R gene family, a BLAT search
using the protein sequence of exon 6 for V2R3 was done. The exon 6 sequence was chosen
because it codes for the 7 transmembrane domains of the V2Rs and it is, therefore, the most
conserved of the exon sequences (3). The V2R3 exon sequence was chosen because it was one
of the only original V2R cDNA's that had a full exon 6 sequence and for which we had found the
exon structure prediction.
The BLAT search returned 200 hits. The bounding coordinates of each hit were adjusted to
generate a 30 kb region encompassing the hit, in which to search for the remaining 5 exons of the
candidate. The genomic sequences of each chromosome were individually downloaded from
4
UCSC and Perl scripts were used to extract the 30 kb candidate sequences from their respective
chromosomes. These sequences were saved in files named by the chromosome number and the
coordinates.
In the second candidate search, the 200 hits returned from the initial candidate search were
expanded to a 50 kb region encompassing the bounding coordinates of the hits to allow for the
discovery of candidate genes that are larger than 30 kb.
In the final candidate search, the BLAT search was done using the exon 6 amino acid sequences
from all of the full length V2Rs found in the previous two searches. In this search, each block of
each hit was expanded to 50 kb. This approach increased the candidate search pool but most of
the additional candidates were overlapping.
Exon Predictions in Candidates
In the initial candidate search, exon predictions were made in the 30 kb candidate sequences by
aligning each of the original adjusted V2R cDNA’s with each candidate sequence using NCBI's
Spidey tool with low stringency. Each candidate was aligned with each cDNA to enable exon
prediction by the most homologous cDNA sequence for a given candidate. For these predictions,
the executable version of Spidey was downloaded and run locally on DOS.
For the initial candidate search, the results from Spidey for each candidate sequence were
separated into two groups: those alignments that had at least 80% overall identity and those that
had 60-80% identity. Only the alignments that had at least 80% identity were utilized and any
alignments that only predicted one exon were disregarded. To determine the best exon
predictions for the remaining candidates, the files were parsed manually. Choice of the best
exon prediction was first based on the highest overall percent identity, then on the fidelity of the
exon predictions with respect to the mRNA that was making the predictions (e.g. the correct
number being predicted and the correct size being predicted), and lastly on the percentage of the
mRNA that was covered.
In subsequent candidate searches, the predicted exon sequences (predicted cDNAs) from the
previous iteration of the gene search were aligned to the new candidate sequences using the
Spidey tool. Additionally, a script was written to choose the best predicted cDNA/candidate
alignment for each new candidate. The script calculates a score for each predicted
cDNA/candidate alignment and then ranks the alignments for each candidate by score. Twenty
percent of the score is based on the amount of cDNA coverage and thirty percent of the score is
based on the percent identity between the aligned sequences. The remaining fifty percent of the
score is based on the number of predicted exons. As the number deviates from 6, the alignment
score is decreased. The ranking of exon numbers from most favorable to least favorable is:
6>7>8>9>less than 6>more than 10.
Only candidates for which 6 exons were predicted were used in the next steps of the gene search
procedure. However, the Spidey data of 7, 8, and 9 exon candidates was analyzed manually to
determine if the exon predictions could be adjusted to 6 exons without sacrificing the fidelity of a
full length candidate. Examples of such adjustments are:
a. the removal of a 7th exon that was predicted after a full length exon 6
5
b.
c.
the splicing together two predicted exons that were separated by a small number of
nucleotides
the removal of a very small internal exon (<15 nt)
Extraction of Candidate Genes
Candidate genes were defined as being bounded by the first and last exon coordinates predicted
by Spidey. The candidate gene sequences were extracted from the candidate files using Perl
scripts and the Spidey output data from the best cDNA/candidate alignment.
Extraction of Candidate Predicted cDNA
The predicted cDNA sequence (exons only) was extracted from the candidate gene files,
according to the exon coordinates predicted by Spidey, using Perl scripts.
Candidate Predicted cDNA Translation
Each candidate predicted cDNA sequence was translated using the FASTY3 program of the
FASTA package (9). The program translates the nucleotide sequence in all 6 reading frames
taking into account frameshifts and premature stop codons by comparing the translated predicted
cDNA sequence to a homologous protein sequence. For the initial candidate search, the
predicted protein sequences of each of the original adjusted V2R genes were used as input to
FASTY3. For all subsequent candidate searches, the predicted protein sequences of the previous
iteration of the gene search were used as input to FASTY3. A Matlab program was used to parse
through the FASTY3 output and return the highest scoring predicted protein sequences for each
predicted cDNA.
Multiple Alignment of Predicted cDNA Sequences and Splice Site Adjustments
Each predicted cDNA sequence was broken down into its individual exon sequence components
plus 50 bp of flanking intron sequence on both sides of the exons (except for the 5’ end of exon 1
and the 3’ end of exon 6). These sequences were placed into six exon-specific files. Each exon
specific file was aligned using the ClustalX program with the default parameters (10). These
alignments allowed me to observe the positions and sequence composition of the exon/intron
boundaries predicted by Spidey. Some exon/intron boundaries were adjusted based on
discrepancies seen between the multiple alignments and discrepancies with the consensus splice
site sequences (5’:AG/GT, 3’:AG/GT). At each successive iteration of the candidate search, the
new candidates were aligned with the predicted exons from the previous iteration.
After the splice site boundaries had been manually adjusted, Perl scripts were used to
reconstruct the new predicted cDNA’s (with the newly adjusted boundaries). These adjusted
predicted cDNA’s were then used as input to Spidey and aligned to the candidate gene sequence
at high stringency to generate the new exon coordinates in the gene. The adjusted cDNA
sequence was also used as input to FASTY3 to obtain the new predicted protein sequence. The
new predicted protein sequences were then compared to the original predictions for the same
candidate to decide which predicted cDNA made the better protein prediction. The predicted
cDNA that was translated into the best protein prediction was then added to the list of complete
V2Rs.
6
Multiple Alignment of Predicted Protein Sequences
In order to observe the degree of homology between the predicted V2R genes, the predicted
protein sequences were aligned using ClustalX. The parameters for the alignments were:
Pairwise alignment parameter: gap open= 30; gap extension= .75.
Multiple alignment parameter: gap open=15; gap extension=.30.
A phylogenetic tree was also constructed from the multiple alignments using the Neighbor
Joining (NJ) method in ClustalX with the following parameters:
 5000 bootstraps
 excluding positions with gaps
 correcting for multiple substitutions
Results and Discussion
Original V2R Gene Extraction
Each original mRNA was used as a query sequence in a BLAT search to look for the original
V2R genes. In total, there were 1430 hits in the UCSC mouse genome database, all of which
were of 70% identity or greater with alignments ranging from ~100 nucleotides up to the full
length of the mRNA used for the query. The highest scoring, most complete alignment was
chosen for each V2R. The first seven V2R's resulted in approximately full-length alignments
against the database with 100% identity. These are the actual genes. Close inspection reveals that
multiple mRNA's aligned to the same genomic sequence with 100% identity and were therefore
produced from the same genes. This is the case for V2R3/V2R13, V2R10/V2R11, and
V2R2/V2R12. The duplicate genes were removed during all subsequent analysis. Therefore, a
total of 5 genes were found in the database and the remaining 8 genes fully aligned to genomic
sequences with high probability. Since 100% matches were not found for these genes, the
highest scoring, full-length alignments were selected to represent the genes in this project. In
cases where the mRNA's of these genes aligned to the same genomic regions, the duplicates
were removed.
BLAT Search for Candidate Sequences
The protein sequence of exon 6 of V2R3 was used in a BLAT search of the UCSC mouse
genome. For the first candidate search, there were 207 hits in the UCSC mouse genome database
with alignments ranging from ~40 amino acids up to 300, the full length of the query protein.
The nucleotide sequences, ranging from ~120bp-900bp, of all 207 hits were taken as candidate
sequences. The search area of each candidate sequence was expanded to 30 kb as explained in
the methods.
For the second candidate search, the same 207 hits were used, but the search area was expanded
to 50 kb as explained in the methods.
For the third candidate search, the protein sequence of exon 6 of all the previously predicted
V2R genes was used in a BLAT search of the UCSC mouse genome. This search resulted in
4492 hits in the UCSC mouse genome database with alignments ranging from ~18 amino acids
up to 301, the full length of one of the query proteins. For this search, each block within each hit
7
was expanded to 50 kb which generated 8481candidates. This technique created many duplicate
candidates that were removed in subsequent steps of the gene search.
Exon Predictions in Candidates
The comprehensive gene search strategy resulted in the prediction of 32 full length V2R genes.
The predicted genes range in size from 7396 bp to 41,213 bp. The gene sizes of the members of
this family are therefore diverse, as can be seen in Figure 1, which plots the gene size within
each discovered subfamily of V2R genes.
The ranges of exon lengths are as follows:
Exon
1
2
3
4
5
6
Range of Lengths (bp)
215-319
237-300
764-814
222-232
46-124
336-3032
The distributions of exon sizes were also plotted in Figures 2a-f. Exon sizes vary the greatest for
exons 1 and exon 6. This is not surprising since exon 1 may contain varying amounts of noncoding sequence upstream of the translation start site. In the same regard, exon 6 may contain
varying amounts of non-coding sequence following the translation termination site.
Intron sizes varied much more than exon sizes as could be expected. The range of intron lengths
are as follows:
Intron
1
2
3
4
5
Range of Lengths
938-17243
337-8032
816-24454
698-20120
1134-25114
The distributions of intron sizes were also plotted in Figures 3a-f. Since introns do not contain
coding sequence, their sizes are less conserved. Some of the intron sizes may be exaggerated
due to the incomplete nature of the UCSC mouse genome sequence. A number of introns
contain long stretches of N’s, or areas of undetermined sequence, that may not correlate to the
true intron sequence lengths.
Candidate Protein Translation
Figure 4 lists the statistics generated by Fasty3 for the predicted protein sequences. The
predicted V2R protein sequences ranged in size from 659-865 amino acids, with ~840 being the
average size of the full-length V2R proteins. The V2R protein that was 659 aa long was an
exception to the rest, which averaged 845 aa, and was significantly shorter than the others
because the exon 6 nucleotide sequence was only ~340 bp long followed by N’s. It was
8
considered a full length because the first 5 exons aligned well to the rest of the predicted V2R
protein sequences in multiple alignments and it is anticipated that the sequence represented by
the N’s will also be homologous to the rest of the exon 6 amino acid sequences. Six of the
predicted proteins contain premature termination codons, but this is not currently considered a
problem since the exact intron-exon borders have not been confirmed.
The ClustalX program was used to multiply align all the 32 protein sequences. A phylogenetic
tree (Figure 4) was then built from the alignment using 5000 bootstraps. This tree showed that
the predicted V2R proteins could be divided into four sub-families based on sequence
homologies:
1. Chr6 family
2. Chr 7 family
3. Chr 17 family
4. Other family
Despite the fact that chr7_75120603-75129887 did not cluster with the other Chr 7 family
members, this V2R gene was still considered a Chr 7 family member. However, this Chr 7
candidate may represent a fifth V2R sub-family.
Family specific multiple protein alignments (Figure 5a-d) were then done using ClustalX to
visualize the conservation among the protein sequences. From the multiple alignments it is clear
that the most conserved areas of the predicted protein sequences are those that code for the
transmembrane domains of the protein towards the last 300 aa of the sequence. The more
variable beginning parts of the protein sequence may be involved in binding the different
chemical molecules.
Conclusion
The comprehensive data mining technique has successfully extracted 32 full-length V2R gene
sequences from the mouse genome. It is suspected that there are about 140 genes in the V2R
gene family and our searches of the UCSC mouse genome reveal that there may be up to 200
genes. The inability to extract a larger number of gene sequences may be due to the incomplete
nature of the mouse genome. As more complete drafts are released, new gene searches can be
done to establish if more full-length genes actually exist. There is also the possibility that other
members of the family may not have a six exon structure.
The sequences that have been discovered should be complete although there may be some
ambiguities regarding the exact location of the donor/acceptor splice sites. The sites were
manually adjusted according to multiple alignments of the exon/intron borders, but a more indepth study of the splice site consensus sequences around the junctions may be beneficial. In the
manual adjustments only the 5’:AG/GT and the 3’:AG/GT were taken into consideration.
Once the exact donor/acceptor splice sites have been established, it may be sensible to build a
profile HMM from the 32 V2R protein sequences that have been discovered. One could then use
this profile to search the mouse genome for new V2R genes. The final step in the gene discovery
9
process would be to validate the V2R gene predictions that have been made by designing PCR
primers from the sequences that have been annotated.
10
Figure 1(a): Other V2R sub-family gene sizes
Main V2R Gene Sizes
chrUn:103014108-103021503
V2R2+:chrUn:49328346-49336745
chrUn:92193953-92202537
V2R11+:chrUn:87372130-87381462
V2R Genes
chr2:180322229-180332238
chrUn:115014417-115022220
Exon 1
chr10:131042490-131051055
Intron 1
V2R16+:chr5:106687469-106700532
Exon 2
chr5:106925119-106934639
Intron 2
Exon 3
chr5:106525789-106535653
Intron 3
chr10:131103973-131114145
Exon 4
chr5:106478149-106489895
Intron 4
chr10:131076023-131087866
Exon 5
chrUn:53580679-53592497
Intron 5
chr5:106740079-106752743
Exon 6
chr5:106796039-106808673
chrUn:83235359-83252446
chr5:107043226-107077377
chr5:106867960-106902975
chr5:107132937-107174149
0
5000
10000
15000
20000
25000
Gene Size (bp)
30000
35000
40000
45000
11
Figure 1(b): Chr 17 V2R sub-family gene sizes
Chr 17 V2R Genes
chr17:21698256-21716380
Exon 1
chrUn:56437989-56454365
Intron 1
V2R Genes
Exon 2
Intron 2
Exon 3
chrUn:92030462-92047610
Intron 3
Exon 4
Intron 4
Exon 5
Intron 5
chrUn:52898491-52919987
Exon 6
chr17:21645109-21670572
0
5000
10000
15000
20000
Gene Size
25000
30000
35000
12
Figure 1(c): Chr 7 V2R sub-family gene sizes
Chr 7 V2R Genes
chr7:75120603-75129887
Exon 1
Intron 1
V2R Genes
Exon 2
Intron 2
Exon 3
V2R14:chr7:6861052-6880163
Intron 3
Exon 4
Intron 4
Exon 5
Intron 5
Exon 6
chr7:6926823-6944823
0
5000
10000
15000
Gene Size (bp)
20000
25000
13
Figure 1(d): Chr 6 V2R sub-family gene sizes
Chr 6 V2R Genes
chr6:124683781-124721091
Exon 1
Intron 1
Exon 2
chr6:125052213-125086188
V2R Genes
Intron 2
Exon 3
Intron 3
Exon 4
Intron 4
chr6:124609108-124647363
Exon 5
Intron 5
Exon 6
chr6:125104653-125145083
0
5000
10000
15000
20000
25000
Gene Size
30000
35000
40000
45000
50000
r1
0
ch :13
r2 10
ch :18 76
r5 03 02
ch :10 22 3-1
r5 67 22 31
:
9
ch 106 960 -1 087
V
rU 9 39 80 86
2R
2
3
V 11 ch n:5 51 -10 32 6
2R + rU 3 19 6 2
5
16 :ch n:8 80 -1 808 38
0
V +:c rU 32 67 69 67
2R h n: 3 9 3 3
2+ r5: 87 535 -53 46
:c 10 37 9- 59 39
h 6 2
2
ch rU 68 13 832 49
r1 n:4 74 0-8 52 7
ch 0:1 93 69 73 44
r1 31 28 -1 81 6
0 0 3 06 4
ch :13 42 46 70 62
r5 11 49 -49 05
ch :10 03 0-1 33 32
r5 64 97 31 67
ch :10 78 3-1 05 45
r5 65 14 31 10
ch :10 25 9-1 11 55
r5 67 78 06 41
ch :10 40 9-1 48 45
0
ch r5:1 686 79 065 989
rU 0 79 -1 3 5
ch n:1 713 60 067 565
rU 03 29 -1 5 3
n: 0 37 06 27
ch 115 141 -1 902 43
r 0 0 07 9
ch Un 14 8-1 17 75
r5 :92 41 03 41
ch :10 19 7-1 02 49
r6 70 39 15 15
ch :12 43 53 02 03
r6 50 22 -92 22
ch :12 52 6-1 20 20
r6 51 21 0 25
ch :12 04 3-1 707 37
V r6 46 65 25 73
2R :1 8 3 0 7
14 24 37 -12 86 7
:c 60 81 51 18
8
h 9 ch r7 10 124 450
r7 :68 8- 7 83
:7 6 12 21
5 1
ch 12 05 464 091
ch r 0 2 7
r1 7:6 60 -68 36
ch 7:2 92 3-7 80 3
r1 16 68 5 16
ch 7:2 98 23 129 3
rU 16 25 -6 8
ch n:5 45 6- 944 87
rU 2 10 21 8
ch n:5 898 9- 716 23
rU 64 49 21 3
n: 3 1- 67 80
92 79 5 05
03 89 29 72
04 -5 19
62 64 98
-9 54 7
20 36
47 5
61
0
ch
Exon Size (bp)
14
Figure 2(a): V2R gene exon 1 size distribution
Exon 1 Size Distribution
350
300
250
200
150
Exon 1
100
50
0
V2R Genes
10
:
ch 131
r2 07
:
ch 180 602
r5 32 3:
1
ch 106 222 31
r5 79 9- 08
:1
7
6 1
ch 069 03 803 866
9
3
r
2
U
V2
n 5 10 22
R ch :5 11 68 38
V2 11+ rU 358 9-1 08
R :c n:8 06 06 67
9
3
16 hr 3
U 2 79 3
V2 +:c n: 35 -53 463
R hr5 873 359 59 9
2+ :1
2
:c 06 721 -83 49
hr
68 30 25 7
U
ch n 74 -8 24
r1 :49 6 73 46
9
0
8
ch :13 328 -10 14
r1 10 34 67 62
0:
4
6 0
ch 131 249 -49 053
r5 10 0- 33 2
:
6
1
ch 106 397 31 74
r5 47 3- 05 5
:
1
1
ch 106 814 31 05
r5 52 9- 11 5
:
1
4
ch 106 578 06 14
r5 74 9- 48 5
:
1
9
ch 106 007 06 89
r5 86 9- 53 5
5
ch :1
7 1
rU 07 96 067 653
13 0n
5
ch :10 2 10 27
rU 3 93 69 43
n: 01 7-1 02
1
4
ch 150 10 071 975
rU 1 8-1 74
ch n:9 441 03 14
r5 21 7- 02 9
:
1
1
ch 107 939 15 50
r6 04 53 02 3
:
2
ch 125 322 -92 22
r6 05 6- 20 0
:
2
1
ch 125 221 07 53
r6 10 3- 07 7
:
1
7
ch 124 465 25 37
0
7
V2 r6:1 683 3-1 86
R 24 78 25 18
14 6
8
1 1
:c 09 -12 450
hr 10 4
8
ch 7:6 8- 72 3
r7 86 12 10
:7
4
91
5 10 6
ch 120 52- 473
ch r7 60 68 63
r1 :69 3- 80
7
1
7
ch :21 268 51 63
r1 69 23 29
ch 7:2 82 -69 887
rU 16 56 44
4
ch n:52 51 217 823
rU 8 09 16
9
ch n:56 849 216 380
rU 4
1 7
n: 37 -52 057
92 98 9
2
03 9- 19
04 56 98
62 45 7
-9 43
20 65
47
61
0
ch
r
Exon Size (bp)
15
Figure 2(b): V2R gene exon 2 size distribution
Exon 2 Size Distribution
350
300
250
200
150
Exon 2
100
50
0
V2R Genes
10
:
ch 131
r2 07
:
ch 180 602
r5 32 3:
1
ch 106 222 31
r5 79 9- 08
:1
7
6 1
ch 069 03 803 866
9
3
r
2
U
V2
n 5 10 22
R ch :5 11 68 38
V2 11+ rU 358 9-1 08
R :c n:8 06 06 67
9
3
16 hr 3
U 2 79 3
V2 +:c n: 35 -53 463
R hr5 873 359 59 9
2+ :1
2
:c 06 721 -83 49
hr
68 30 25 7
U
ch n 74 -8 24
r1 :49 6 73 46
9
0
8
ch :13 328 -10 14
r1 10 34 67 62
0:
4
6 0
ch 131 249 -49 053
r5 10 0- 33 2
:
6
1
ch 106 397 31 74
r5 47 3- 05 5
:
1
1
ch 106 814 31 05
r5 52 9- 11 5
:
1
4
ch 106 578 06 14
r5 74 9- 48 5
:
1
9
ch 106 007 06 89
r5 86 9- 53 5
5
ch :1
7 1
rU 07 96 067 653
13 0n
5
ch :10 2 10 27
rU 3 93 69 43
n: 01 7-1 02
1
4
ch 150 10 071 975
rU 1 8-1 74
ch n:9 441 03 14
r5 21 7- 02 9
:
1
1
ch 107 939 15 50
r6 04 53 02 3
:
2
ch 125 322 -92 22
r6 05 6- 20 0
:
2
1
ch 125 221 07 53
r6 10 3- 07 7
:
1
7
ch 124 465 25 37
0
7
V2 r6:1 683 3-1 86
R 24 78 25 18
14 6
8
1 1
:c 09 -12 450
hr 10 4
8
ch 7:6 8- 72 3
r7 86 12 10
:7
4
91
5 10 6
ch 120 52- 473
ch r7 60 68 63
r1 :69 3- 80
7
1
7
ch :21 268 51 63
r1 69 23 29
ch 7:2 82 -69 887
rU 16 56 44
4
ch n:52 51 217 823
rU 8 09 16
9
ch n:56 849 216 380
rU 4
1 7
n: 37 -52 057
92 98 9
2
03 9- 19
04 56 98
62 45 7
-9 43
20 65
47
61
0
ch
r
Exon Size (bp)
16
Figure 2(c): V2R gene exon 3 size distribution
Exon 3 Size Distribution
820
810
800
790
780
770
Exon 3
760
750
740
730
V2R Genes
r1
0
ch :13
r2 10
: 7
ch 18 60
r5 03 23
: 2
ch 10 22 -13
r5 67 29 10
:1 9 -1 8
ch 06 603 80 786
rU 92 9- 33 6
V
2R c n 51 10 22
:
V 11 hrU 535 19 680 38
2R +
8 -1 8
n
16 :ch :83 06 06 67
r U 2 79 93 3
+
V :c n 3 -5 4
2R h : 8 53 3 6
2+ r5:1 73 59 59 39
:c 06 72 -8 24
h
6 1 32 9
ch rUn 87 30- 52 7
:
r1 4 46 87 44
3
0 9
ch :1 32 9-1 81 6
r1 31 83 06 46
0
0
4 4 70 2
ch :13 24 6-4 05
r5 11 90 93 32
3
: 0
ch 10 39 -13 67
r5 64 73 10 45
: 7
5
ch 10 81 -13 10
r5 65 49 11 55
: 2
1
ch 10 57 -10 41
r5 67 89 64 45
4
8
:
ch 10 00 -10 98
r 6 7 6 9
ch 5:1 867 9-1 535 5
rU 07 96 06 65
ch n:1 132 0-1 752 3
rU 03 93 06 74
n: 01 7- 90 3
1
1
ch 15 410 07 297
rU 01 8- 17 5
ch n: 44 10 41
r5 92 17 30 49
: 1
2
ch 10 93 -11 15
r6 70 95 50 03
: 4
ch 12 32 3-9 222
r6 50 26 22 20
0
: 5
ch 12 22 -10 25
r6 51 13 70 37
7
: 0
ch 12 46 -12 73
4
5
V r6: 68 53- 08 77
2R 12 3 1 6
14 46 781 251 18
:c 09 -1 45 8
h
1 2 0
ch r7: 08 47 83
r7 68 -1 21
6
: 7 1 24 09
5 0 6 1
ch 12 52 47
ch r7 06 -6 36
r1 :6 03 88 3
0
7 9
ch :2 26 -75 16
r1 16 82 12 3
ch 7:2 982 3-6 988
rU 16 56 94 7
ch n:5 451 -21 482
rU 28 09 71 3
ch n:5 984 -21 638
rU 64 91 67 0
n: 37 -5 05
92 9 2 7
03 89 91 2
04 -56 99
62 45 87
-9 43
20 6
47 5
61
0
ch
Exon Size (bp)
17
Figure 2(d): V2R gene exon 4 size distribution
Exon 4 Size Distribution
234
232
230
228
226
224
Exon 4
222
220
218
216
V2R Genes
10
:
ch 131
r2 07
:
ch 180 602
r5 32 31
:
ch 106 222 31
r5 79 9- 08
:1
6 18 78
ch 069 039 03 66
3
rU 2
V2
n 5 10 22
R ch :53 119 68 38
V2 11+ rUn 58 -10 086
0
7
6
R :c
16 hr :83 679 93 3
U 23
4
+
V2 :ch n:8 53 53 63
9
5 5
7
R
2+ r5:1 37 9-8 924
:c 06 21 32 97
hr
3
6
5
ch Un 874 0-8 244
r1 :49 69 73
8 6
0
ch :13 328 -10 14
r1 10 34 67 62
0:
6 00
4
ch 131 249 -49 53
r5 10 0- 33 2
6
1
:
ch 106 397 31 74
r5 47 3- 05 5
1
1
:
ch 106 814 31 05
r5 52 9- 11 5
4
1
:
ch 106 578 06 14
r5 74 9- 48 5
9
1
:
ch 106 007 06 895
r5 86 9- 53
ch :10 79 10 56
rU 71 60 67 53
5
n
ch :10 329 -10 27
rU 3
37 69 43
0
0
n:
1 14 -10 29
ch 150 108 71 75
rU 1
-1 74
ch n:9 441 03 149
r5 21 7- 02
1
1
:
ch 107 939 15 503
r6 04 53 02
2
:
ch 125 322 -92 220
r6 05 6- 20
2
1
:
ch 125 221 07 537
r6 10 3- 07
7
1
:
ch 124 465 25 377
0
3
6
r
V2 6:1 83 -1 86
R 24 78 251 188
14 6
1
:c 09 -12 450
hr 10 4
8
ch 7:6 8- 72 3
r7 86 12 10
9
4
:7
5 10 64 1
ch 120 52- 73
ch r7: 60 68 63
r1 69 3- 80
1
7
2 7
ch :21 68 512 63
r1 69 23 9
8
7
ch :21 825 -69 87
44
rU 6
6
-2
8
n 4
ch :52 510 17 23
16
rU 8
9
3
2
n 9
ch :56 849 16 80
rU 4
1- 70
3
5
5
n:
92 798 29 72
03 9- 19
04 56 98
62 45 7
-9 4 3
20 65
47
61
0
ch
r
Exon Size (bp)
18
Figure 2(e): V2R gene exon 5 size distribution
Exon 5 Size Distribution
140
120
100
80
60
Exon 5
40
20
0
V2R Genes
10
:
ch 131
r2 07
:
ch 180 602
r5 32 3:
1
ch 106 222 31
r5 79 9- 08
:1
7
6 1
ch 069 03 803 866
9
3
r
2
U
V2
n 5 10 22
R ch :5 11 68 38
V2 11+ rU 358 9-1 08
R :c n:8 06 06 67
9
3
16 hr 3
U 2 79 3
V2 +:c n: 35 -53 463
R hr5 873 359 59 9
2+ :1
2
:c 06 721 -83 49
hr
68 30 25 7
U
ch n 74 -8 24
r1 :49 6 73 46
9
0
8
ch :13 328 -10 14
r1 10 34 67 62
0:
4
6 0
ch 131 249 -49 053
r5 10 0- 33 2
:
6
1
ch 106 397 31 74
r5 47 3- 05 5
:
1
1
ch 106 814 31 05
r5 52 9- 11 5
:
1
4
ch 106 578 06 14
r5 74 9- 48 5
:
1
9
ch 106 007 06 89
r5 86 9- 53 5
5
ch :1
7 1
rU 07 96 067 653
13 0n
5
ch :10 2 10 27
rU 3 93 69 43
n: 01 7-1 02
1
4
ch 150 10 071 975
rU 1 8-1 74
ch n:9 441 03 14
r5 21 7- 02 9
:
1
1
ch 107 939 15 50
r6 04 53 02 3
:
2
ch 125 322 -92 22
r6 05 6- 20 0
:
2
1
ch 125 221 07 53
r6 10 3- 07 7
:
1
7
ch 124 465 25 37
0
7
V2 r6:1 683 3-1 86
R 24 78 25 18
14 6
8
1 1
:c 09 -12 450
hr 10 4
8
ch 7:6 8- 72 3
r7 86 12 10
:7
4
91
5 10 6
ch 120 52- 473
ch r7 60 68 63
r1 :69 3- 80
7
1
7
ch :21 268 51 63
r1 69 23 29
ch 7:2 82 -69 887
rU 16 56 44
4
ch n:52 51 217 823
rU 8 09 16
9
ch n:56 849 216 380
rU 4
1 7
n: 37 -52 057
92 98 9
2
03 9- 19
04 56 98
62 45 7
-9 43
20 65
47
61
0
ch
r
Exon Size (bp)
19
Figure 2(f): V2R gene exon 6 size distribution
Exon 6 Size Distribution
3500
3000
2500
2000
1500
Exon 6
1000
500
0
V2R Genes
10
:
ch 131
r2 07
:
ch 180 602
r5 32 3:
1
ch 106 222 31
r5 79 9- 08
:1
7
6 1
ch 069 03 803 866
9
r
3
2
U
V2
2
n 5 1
R ch :5 11 068 238
V2 11+ rU 358 9-1 08
R :c n:8 06 06 67
16 hr 3
9
3
U 2 79 3
V2 +:c n: 35 -53 463
h
3
8
9
5
R
2+ r5:1 737 59- 92
:c 06 21 83 49
hr
3 25 7
6
ch Un 874 0-8 24
r1 :49 6 73 46
9
0
ch :13 328 -10 814
r1 10 34 67 62
0:
4
6 0
ch 131 249 -49 053
r5 10 0- 33 2
:
1
ch 106 397 31 674
r5 47 3- 05 5
:
1
1
ch 106 814 31 05
r5 52 9- 11 5
:
1
ch 106 578 06 414
r5 74 9- 48 5
:1
1
9
ch 06 007 06 89
5
r
8
9 5
ch 5:1 67 -10 356
rU 07 96 67 53
1
0
ch n:10 32 -10 527
rU 3 93 69 43
n: 01 7-1 02
1
4
ch 15 10 071 975
rU 01 8-1 74
ch n:9 441 03 14
r5 21 7- 02 9
:
1
1
ch 107 939 15 50
r6 04 53 02 3
:
2
ch 125 322 -92 22
r6 05 6- 20 0
:
2
1
ch 125 221 07 53
r6 10 3- 07 7
:1
1
7
ch 24 465 25 37
0
7
V2 r6:1 683 3-1 86
R 24 78 25 18
14 6
8
1 1
:c 09 -12 450
hr 10 4
8
ch 7:6 8- 72 3
r7 86 12 10
:7
4
9
5 10 64 1
ch 120 52- 73
ch r7 60 68 63
r1 :69 3- 80
7
7
1
ch :21 268 51 63
r1 69 23 29
7
ch :2 82 -69 887
rU 16 56 44
4
ch n:5 51 217 823
rU 28 09 16
9
ch n:56 84 216 380
rU 4 91 70
n: 37 -52 57
92 98 9
2
03 9- 19
04 56 98
62 45 7
-9 43
20 65
47
61
0
ch
r
Intron Size (bp)
20
Figure 3(a): V2R gene intron 1 size distribution
Intron 1 Size Distribution
20000
18000
16000
14000
12000
10000
Intron 1
8000
6000
4000
2000
0
V2R Genes
21
Download