1 Isolation of members of a novel vomeronasal receptor gene superfamily from the mouse genome using a comprehensive data mining strategy Michael Pearce Summer Research Project Chasin Lab Columbia University 2002 2 Introduction The era of genome biology brings with it vast amounts of genomic data for a multitude of organisms. One such organism of interest is the mouse. UCSC estimates that their February 2002 draft of the mouse genome is 90-96% complete (1). This nearly complete genome sequence provides a tool with which to discover new genes. Database driven gene finding is a common technique used to discover homologues of known genes as well as members of gene families. The study of gene families can provide insight into the evolutionary forces that may have shaped the gene sequences and it can aid in the assignment of functionality to certain sequence elements. The gene family of present interest is the set of genes for candidate pheromone receptors or vomeronasal receptors (VRs) of the mouse. Pheromones are chemical signals that, when sensed by organisms, can result in a variety of behaviors or reactions, relating to danger, mating, etc. The VRs are found in the sensory neurons, whose cell bodies lie in the epithelium of the vomeronasal organ found at the base of the nasal septum (2,3). The VR proteins are members of the G-protein coupled receptor (GPCR) superfamily and therefore have the common 7transmembrane domain structure (3). However, two distinct superfamilies of vomeronasal receptors exist. Dulac and Axel identified a superfamily of ~100 V1Rs that are expressed in the Gi2 containing apical half of the receptor cell layer in the vomeronasal organ (4,5). The V1R genes have intronless coding regions (2). Matsunami and Buck have identified members of a second VR superfamily, the V2Rs, which are found in the G0 containing basal receptor layer of the vomeronasal organ (3,5). The V2R genes are related to two other GPCRs: the Ca2+-sensing receptor and the metabotropic glutamate receptor (3). Therefore, it was hypothesized that the V2R genes have a six-exon structure, in which all seven of the transmembrane domains are encoded by the sixth exon (3). Unlike the V1Rs, the V2Rs have an unusually large N-terminal extracellular domain that is believed to be involved in ligand binding (3). Hybridization experiments done on a mouse genomic library with a V2R gene probe resulted in signal from about 140 potential V2R genes (3). The 15 V2R cDNA sequences, discovered by Matsunami and Buck, were used to search the February 2002 draft of the UCSC mouse genome for the remaining members of the V2R gene family. A comprehensive data mining technique was used that involved locating the V2R gene candidates in the genome, determining the exon structure of the candidate genes from aligning the homologous sequences, and analyzing the protein sequences coded for by the candidate genes. Methods Original V2R Gene Extraction Each cDNA sequence discovered by Matsunami et al. (3) and Ryba et al. (5) were downloaded from GenBank. Each cDNA sequence was used as a query sequence in a BLAT search against the February 2002 draft of the UCSC mouse genome (1). BLAT is a Blast like alignment tool that finds a homologous protein sequence with 80% similarity over a window of 20 amino acids (6). For protein alignments, BLAT searches through an index of the genome that is constructed using non-overlapping 4-mers from the genomic assembly after removal of repeats. If there are hits against the index, then the actual genomic sequence of area of probable homology is loaded into the memory for alignment (6). 3 The genomic sequence from the highest scoring hit was collected for each search in FASTA format. Each of the original cDNA sequences was then entered as input to the NCBI mRNA/cDNA to genomic alignment tool Spidey, to determine the exon/intron structure (7). Spidey functions by first aligning each mRNA to the genomic sequence using a high-stringency BLAST. The result from BLAST is used to find genomic windows. Windows are constructed using a recursive algorithm by merging the BLAST hits with consistent parameters. This is done until all BLAST alignments are put into non-overlapping, consistent windows. A less stringent BLAST alignment is then performed using the entire mRNA sequence within each window. Once the program has determined that the mRNA is completely covered by the genomic sequence, the alignments are adjusted so the good splice donor and acceptor sites are used. The Spidey searches were done at high stringency. The resulting alignment data was manually formatted to allow for use by a Perl script that created annotations for the original genes denoting the starting and ending coordinates of the exons. Original V2R cDNA Manipulations Upon finding the genomic sequences for the original V2R's it became apparent that some of the V2R cDNA's entered in GenBank were derived from the same genes. If two cDNA's aligned to the same region with >98% identity, they were considered to be derived from the same gene and were therefore merged together. This was the case for V2R10/V2R11, V2R8/V2R9, V2R3/V2R13 and V2R2/V2R12. Once the genomic sequence had been found for the original V2R genes, it was possible to go back and try to correct some of the incomplete cDNA sequences that had been entered into GenBank. This was done by aligning all of the original cDNA's to the genomic sequences of those V2R's that were missing some exons, in Spidey at low stringency. If missing exons were predicted by the alignment of other cDNA's, the exons were added to the cDNA's of the V2R's in question. This procedure was done for V2R11, V2R3, V2R9, V2R16, and V2R2. Original V2R Protein Translation Once the cDNA sequences had been manipulated, each V2R cDNA sequence was translated using the NCBI program ORF Finder (8). The program translates the nucleotide sequence in all 6 reading frames. The largest ORF was chosen and the corresponding protein sequence was downloaded for each file and annotation was added to these files using the scripts. BLAT Search for Candidate Sequences In the initial candidate search for the other members of the V2R gene family, a BLAT search using the protein sequence of exon 6 for V2R3 was done. The exon 6 sequence was chosen because it codes for the 7 transmembrane domains of the V2Rs and it is, therefore, the most conserved of the exon sequences (3). The V2R3 exon sequence was chosen because it was one of the only original V2R cDNA's that had a full exon 6 sequence and for which we had found the exon structure prediction. The BLAT search returned 200 hits. The bounding coordinates of each hit were adjusted to generate a 30 kb region encompassing the hit, in which to search for the remaining 5 exons of the candidate. The genomic sequences of each chromosome were individually downloaded from 4 UCSC and Perl scripts were used to extract the 30 kb candidate sequences from their respective chromosomes. These sequences were saved in files named by the chromosome number and the coordinates. In the second candidate search, the 200 hits returned from the initial candidate search were expanded to a 50 kb region encompassing the bounding coordinates of the hits to allow for the discovery of candidate genes that are larger than 30 kb. In the final candidate search, the BLAT search was done using the exon 6 amino acid sequences from all of the full length V2Rs found in the previous two searches. In this search, each block of each hit was expanded to 50 kb. This approach increased the candidate search pool but most of the additional candidates were overlapping. Exon Predictions in Candidates In the initial candidate search, exon predictions were made in the 30 kb candidate sequences by aligning each of the original adjusted V2R cDNA’s with each candidate sequence using NCBI's Spidey tool with low stringency. Each candidate was aligned with each cDNA to enable exon prediction by the most homologous cDNA sequence for a given candidate. For these predictions, the executable version of Spidey was downloaded and run locally on DOS. For the initial candidate search, the results from Spidey for each candidate sequence were separated into two groups: those alignments that had at least 80% overall identity and those that had 60-80% identity. Only the alignments that had at least 80% identity were utilized and any alignments that only predicted one exon were disregarded. To determine the best exon predictions for the remaining candidates, the files were parsed manually. Choice of the best exon prediction was first based on the highest overall percent identity, then on the fidelity of the exon predictions with respect to the mRNA that was making the predictions (e.g. the correct number being predicted and the correct size being predicted), and lastly on the percentage of the mRNA that was covered. In subsequent candidate searches, the predicted exon sequences (predicted cDNAs) from the previous iteration of the gene search were aligned to the new candidate sequences using the Spidey tool. Additionally, a script was written to choose the best predicted cDNA/candidate alignment for each new candidate. The script calculates a score for each predicted cDNA/candidate alignment and then ranks the alignments for each candidate by score. Twenty percent of the score is based on the amount of cDNA coverage and thirty percent of the score is based on the percent identity between the aligned sequences. The remaining fifty percent of the score is based on the number of predicted exons. As the number deviates from 6, the alignment score is decreased. The ranking of exon numbers from most favorable to least favorable is: 6>7>8>9>less than 6>more than 10. Only candidates for which 6 exons were predicted were used in the next steps of the gene search procedure. However, the Spidey data of 7, 8, and 9 exon candidates was analyzed manually to determine if the exon predictions could be adjusted to 6 exons without sacrificing the fidelity of a full length candidate. Examples of such adjustments are: a. the removal of a 7th exon that was predicted after a full length exon 6 5 b. c. the splicing together two predicted exons that were separated by a small number of nucleotides the removal of a very small internal exon (<15 nt) Extraction of Candidate Genes Candidate genes were defined as being bounded by the first and last exon coordinates predicted by Spidey. The candidate gene sequences were extracted from the candidate files using Perl scripts and the Spidey output data from the best cDNA/candidate alignment. Extraction of Candidate Predicted cDNA The predicted cDNA sequence (exons only) was extracted from the candidate gene files, according to the exon coordinates predicted by Spidey, using Perl scripts. Candidate Predicted cDNA Translation Each candidate predicted cDNA sequence was translated using the FASTY3 program of the FASTA package (9). The program translates the nucleotide sequence in all 6 reading frames taking into account frameshifts and premature stop codons by comparing the translated predicted cDNA sequence to a homologous protein sequence. For the initial candidate search, the predicted protein sequences of each of the original adjusted V2R genes were used as input to FASTY3. For all subsequent candidate searches, the predicted protein sequences of the previous iteration of the gene search were used as input to FASTY3. A Matlab program was used to parse through the FASTY3 output and return the highest scoring predicted protein sequences for each predicted cDNA. Multiple Alignment of Predicted cDNA Sequences and Splice Site Adjustments Each predicted cDNA sequence was broken down into its individual exon sequence components plus 50 bp of flanking intron sequence on both sides of the exons (except for the 5’ end of exon 1 and the 3’ end of exon 6). These sequences were placed into six exon-specific files. Each exon specific file was aligned using the ClustalX program with the default parameters (10). These alignments allowed me to observe the positions and sequence composition of the exon/intron boundaries predicted by Spidey. Some exon/intron boundaries were adjusted based on discrepancies seen between the multiple alignments and discrepancies with the consensus splice site sequences (5’:AG/GT, 3’:AG/GT). At each successive iteration of the candidate search, the new candidates were aligned with the predicted exons from the previous iteration. After the splice site boundaries had been manually adjusted, Perl scripts were used to reconstruct the new predicted cDNA’s (with the newly adjusted boundaries). These adjusted predicted cDNA’s were then used as input to Spidey and aligned to the candidate gene sequence at high stringency to generate the new exon coordinates in the gene. The adjusted cDNA sequence was also used as input to FASTY3 to obtain the new predicted protein sequence. The new predicted protein sequences were then compared to the original predictions for the same candidate to decide which predicted cDNA made the better protein prediction. The predicted cDNA that was translated into the best protein prediction was then added to the list of complete V2Rs. 6 Multiple Alignment of Predicted Protein Sequences In order to observe the degree of homology between the predicted V2R genes, the predicted protein sequences were aligned using ClustalX. The parameters for the alignments were: Pairwise alignment parameter: gap open= 30; gap extension= .75. Multiple alignment parameter: gap open=15; gap extension=.30. A phylogenetic tree was also constructed from the multiple alignments using the Neighbor Joining (NJ) method in ClustalX with the following parameters: 5000 bootstraps excluding positions with gaps correcting for multiple substitutions Results and Discussion Original V2R Gene Extraction Each original mRNA was used as a query sequence in a BLAT search to look for the original V2R genes. In total, there were 1430 hits in the UCSC mouse genome database, all of which were of 70% identity or greater with alignments ranging from ~100 nucleotides up to the full length of the mRNA used for the query. The highest scoring, most complete alignment was chosen for each V2R. The first seven V2R's resulted in approximately full-length alignments against the database with 100% identity. These are the actual genes. Close inspection reveals that multiple mRNA's aligned to the same genomic sequence with 100% identity and were therefore produced from the same genes. This is the case for V2R3/V2R13, V2R10/V2R11, and V2R2/V2R12. The duplicate genes were removed during all subsequent analysis. Therefore, a total of 5 genes were found in the database and the remaining 8 genes fully aligned to genomic sequences with high probability. Since 100% matches were not found for these genes, the highest scoring, full-length alignments were selected to represent the genes in this project. In cases where the mRNA's of these genes aligned to the same genomic regions, the duplicates were removed. BLAT Search for Candidate Sequences The protein sequence of exon 6 of V2R3 was used in a BLAT search of the UCSC mouse genome. For the first candidate search, there were 207 hits in the UCSC mouse genome database with alignments ranging from ~40 amino acids up to 300, the full length of the query protein. The nucleotide sequences, ranging from ~120bp-900bp, of all 207 hits were taken as candidate sequences. The search area of each candidate sequence was expanded to 30 kb as explained in the methods. For the second candidate search, the same 207 hits were used, but the search area was expanded to 50 kb as explained in the methods. For the third candidate search, the protein sequence of exon 6 of all the previously predicted V2R genes was used in a BLAT search of the UCSC mouse genome. This search resulted in 4492 hits in the UCSC mouse genome database with alignments ranging from ~18 amino acids up to 301, the full length of one of the query proteins. For this search, each block within each hit 7 was expanded to 50 kb which generated 8481candidates. This technique created many duplicate candidates that were removed in subsequent steps of the gene search. Exon Predictions in Candidates The comprehensive gene search strategy resulted in the prediction of 32 full length V2R genes. The predicted genes range in size from 7396 bp to 41,213 bp. The gene sizes of the members of this family are therefore diverse, as can be seen in Figure 1, which plots the gene size within each discovered subfamily of V2R genes. The ranges of exon lengths are as follows: Exon 1 2 3 4 5 6 Range of Lengths (bp) 215-319 237-300 764-814 222-232 46-124 336-3032 The distributions of exon sizes were also plotted in Figures 2a-f. Exon sizes vary the greatest for exons 1 and exon 6. This is not surprising since exon 1 may contain varying amounts of noncoding sequence upstream of the translation start site. In the same regard, exon 6 may contain varying amounts of non-coding sequence following the translation termination site. Intron sizes varied much more than exon sizes as could be expected. The range of intron lengths are as follows: Intron 1 2 3 4 5 Range of Lengths 938-17243 337-8032 816-24454 698-20120 1134-25114 The distributions of intron sizes were also plotted in Figures 3a-f. Since introns do not contain coding sequence, their sizes are less conserved. Some of the intron sizes may be exaggerated due to the incomplete nature of the UCSC mouse genome sequence. A number of introns contain long stretches of N’s, or areas of undetermined sequence, that may not correlate to the true intron sequence lengths. Candidate Protein Translation Figure 4 lists the statistics generated by Fasty3 for the predicted protein sequences. The predicted V2R protein sequences ranged in size from 659-865 amino acids, with ~840 being the average size of the full-length V2R proteins. The V2R protein that was 659 aa long was an exception to the rest, which averaged 845 aa, and was significantly shorter than the others because the exon 6 nucleotide sequence was only ~340 bp long followed by N’s. It was 8 considered a full length because the first 5 exons aligned well to the rest of the predicted V2R protein sequences in multiple alignments and it is anticipated that the sequence represented by the N’s will also be homologous to the rest of the exon 6 amino acid sequences. Six of the predicted proteins contain premature termination codons, but this is not currently considered a problem since the exact intron-exon borders have not been confirmed. The ClustalX program was used to multiply align all the 32 protein sequences. A phylogenetic tree (Figure 4) was then built from the alignment using 5000 bootstraps. This tree showed that the predicted V2R proteins could be divided into four sub-families based on sequence homologies: 1. Chr6 family 2. Chr 7 family 3. Chr 17 family 4. Other family Despite the fact that chr7_75120603-75129887 did not cluster with the other Chr 7 family members, this V2R gene was still considered a Chr 7 family member. However, this Chr 7 candidate may represent a fifth V2R sub-family. Family specific multiple protein alignments (Figure 5a-d) were then done using ClustalX to visualize the conservation among the protein sequences. From the multiple alignments it is clear that the most conserved areas of the predicted protein sequences are those that code for the transmembrane domains of the protein towards the last 300 aa of the sequence. The more variable beginning parts of the protein sequence may be involved in binding the different chemical molecules. Conclusion The comprehensive data mining technique has successfully extracted 32 full-length V2R gene sequences from the mouse genome. It is suspected that there are about 140 genes in the V2R gene family and our searches of the UCSC mouse genome reveal that there may be up to 200 genes. The inability to extract a larger number of gene sequences may be due to the incomplete nature of the mouse genome. As more complete drafts are released, new gene searches can be done to establish if more full-length genes actually exist. There is also the possibility that other members of the family may not have a six exon structure. The sequences that have been discovered should be complete although there may be some ambiguities regarding the exact location of the donor/acceptor splice sites. The sites were manually adjusted according to multiple alignments of the exon/intron borders, but a more indepth study of the splice site consensus sequences around the junctions may be beneficial. In the manual adjustments only the 5’:AG/GT and the 3’:AG/GT were taken into consideration. Once the exact donor/acceptor splice sites have been established, it may be sensible to build a profile HMM from the 32 V2R protein sequences that have been discovered. One could then use this profile to search the mouse genome for new V2R genes. The final step in the gene discovery 9 process would be to validate the V2R gene predictions that have been made by designing PCR primers from the sequences that have been annotated. 10 Figure 1(a): Other V2R sub-family gene sizes Main V2R Gene Sizes chrUn:103014108-103021503 V2R2+:chrUn:49328346-49336745 chrUn:92193953-92202537 V2R11+:chrUn:87372130-87381462 V2R Genes chr2:180322229-180332238 chrUn:115014417-115022220 Exon 1 chr10:131042490-131051055 Intron 1 V2R16+:chr5:106687469-106700532 Exon 2 chr5:106925119-106934639 Intron 2 Exon 3 chr5:106525789-106535653 Intron 3 chr10:131103973-131114145 Exon 4 chr5:106478149-106489895 Intron 4 chr10:131076023-131087866 Exon 5 chrUn:53580679-53592497 Intron 5 chr5:106740079-106752743 Exon 6 chr5:106796039-106808673 chrUn:83235359-83252446 chr5:107043226-107077377 chr5:106867960-106902975 chr5:107132937-107174149 0 5000 10000 15000 20000 25000 Gene Size (bp) 30000 35000 40000 45000 11 Figure 1(b): Chr 17 V2R sub-family gene sizes Chr 17 V2R Genes chr17:21698256-21716380 Exon 1 chrUn:56437989-56454365 Intron 1 V2R Genes Exon 2 Intron 2 Exon 3 chrUn:92030462-92047610 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 chrUn:52898491-52919987 Exon 6 chr17:21645109-21670572 0 5000 10000 15000 20000 Gene Size 25000 30000 35000 12 Figure 1(c): Chr 7 V2R sub-family gene sizes Chr 7 V2R Genes chr7:75120603-75129887 Exon 1 Intron 1 V2R Genes Exon 2 Intron 2 Exon 3 V2R14:chr7:6861052-6880163 Intron 3 Exon 4 Intron 4 Exon 5 Intron 5 Exon 6 chr7:6926823-6944823 0 5000 10000 15000 Gene Size (bp) 20000 25000 13 Figure 1(d): Chr 6 V2R sub-family gene sizes Chr 6 V2R Genes chr6:124683781-124721091 Exon 1 Intron 1 Exon 2 chr6:125052213-125086188 V2R Genes Intron 2 Exon 3 Intron 3 Exon 4 Intron 4 chr6:124609108-124647363 Exon 5 Intron 5 Exon 6 chr6:125104653-125145083 0 5000 10000 15000 20000 25000 Gene Size 30000 35000 40000 45000 50000 r1 0 ch :13 r2 10 ch :18 76 r5 03 02 ch :10 22 3-1 r5 67 22 31 : 9 ch 106 960 -1 087 V rU 9 39 80 86 2R 2 3 V 11 ch n:5 51 -10 32 6 2R + rU 3 19 6 2 5 16 :ch n:8 80 -1 808 38 0 V +:c rU 32 67 69 67 2R h n: 3 9 3 3 2+ r5: 87 535 -53 46 :c 10 37 9- 59 39 h 6 2 2 ch rU 68 13 832 49 r1 n:4 74 0-8 52 7 ch 0:1 93 69 73 44 r1 31 28 -1 81 6 0 0 3 06 4 ch :13 42 46 70 62 r5 11 49 -49 05 ch :10 03 0-1 33 32 r5 64 97 31 67 ch :10 78 3-1 05 45 r5 65 14 31 10 ch :10 25 9-1 11 55 r5 67 78 06 41 ch :10 40 9-1 48 45 0 ch r5:1 686 79 065 989 rU 0 79 -1 3 5 ch n:1 713 60 067 565 rU 03 29 -1 5 3 n: 0 37 06 27 ch 115 141 -1 902 43 r 0 0 07 9 ch Un 14 8-1 17 75 r5 :92 41 03 41 ch :10 19 7-1 02 49 r6 70 39 15 15 ch :12 43 53 02 03 r6 50 22 -92 22 ch :12 52 6-1 20 20 r6 51 21 0 25 ch :12 04 3-1 707 37 V r6 46 65 25 73 2R :1 8 3 0 7 14 24 37 -12 86 7 :c 60 81 51 18 8 h 9 ch r7 10 124 450 r7 :68 8- 7 83 :7 6 12 21 5 1 ch 12 05 464 091 ch r 0 2 7 r1 7:6 60 -68 36 ch 7:2 92 3-7 80 3 r1 16 68 5 16 ch 7:2 98 23 129 3 rU 16 25 -6 8 ch n:5 45 6- 944 87 rU 2 10 21 8 ch n:5 898 9- 716 23 rU 64 49 21 3 n: 3 1- 67 80 92 79 5 05 03 89 29 72 04 -5 19 62 64 98 -9 54 7 20 36 47 5 61 0 ch Exon Size (bp) 14 Figure 2(a): V2R gene exon 1 size distribution Exon 1 Size Distribution 350 300 250 200 150 Exon 1 100 50 0 V2R Genes 10 : ch 131 r2 07 : ch 180 602 r5 32 3: 1 ch 106 222 31 r5 79 9- 08 :1 7 6 1 ch 069 03 803 866 9 3 r 2 U V2 n 5 10 22 R ch :5 11 68 38 V2 11+ rU 358 9-1 08 R :c n:8 06 06 67 9 3 16 hr 3 U 2 79 3 V2 +:c n: 35 -53 463 R hr5 873 359 59 9 2+ :1 2 :c 06 721 -83 49 hr 68 30 25 7 U ch n 74 -8 24 r1 :49 6 73 46 9 0 8 ch :13 328 -10 14 r1 10 34 67 62 0: 4 6 0 ch 131 249 -49 053 r5 10 0- 33 2 : 6 1 ch 106 397 31 74 r5 47 3- 05 5 : 1 1 ch 106 814 31 05 r5 52 9- 11 5 : 1 4 ch 106 578 06 14 r5 74 9- 48 5 : 1 9 ch 106 007 06 89 r5 86 9- 53 5 5 ch :1 7 1 rU 07 96 067 653 13 0n 5 ch :10 2 10 27 rU 3 93 69 43 n: 01 7-1 02 1 4 ch 150 10 071 975 rU 1 8-1 74 ch n:9 441 03 14 r5 21 7- 02 9 : 1 1 ch 107 939 15 50 r6 04 53 02 3 : 2 ch 125 322 -92 22 r6 05 6- 20 0 : 2 1 ch 125 221 07 53 r6 10 3- 07 7 : 1 7 ch 124 465 25 37 0 7 V2 r6:1 683 3-1 86 R 24 78 25 18 14 6 8 1 1 :c 09 -12 450 hr 10 4 8 ch 7:6 8- 72 3 r7 86 12 10 :7 4 91 5 10 6 ch 120 52- 473 ch r7 60 68 63 r1 :69 3- 80 7 1 7 ch :21 268 51 63 r1 69 23 29 ch 7:2 82 -69 887 rU 16 56 44 4 ch n:52 51 217 823 rU 8 09 16 9 ch n:56 849 216 380 rU 4 1 7 n: 37 -52 057 92 98 9 2 03 9- 19 04 56 98 62 45 7 -9 43 20 65 47 61 0 ch r Exon Size (bp) 15 Figure 2(b): V2R gene exon 2 size distribution Exon 2 Size Distribution 350 300 250 200 150 Exon 2 100 50 0 V2R Genes 10 : ch 131 r2 07 : ch 180 602 r5 32 3: 1 ch 106 222 31 r5 79 9- 08 :1 7 6 1 ch 069 03 803 866 9 3 r 2 U V2 n 5 10 22 R ch :5 11 68 38 V2 11+ rU 358 9-1 08 R :c n:8 06 06 67 9 3 16 hr 3 U 2 79 3 V2 +:c n: 35 -53 463 R hr5 873 359 59 9 2+ :1 2 :c 06 721 -83 49 hr 68 30 25 7 U ch n 74 -8 24 r1 :49 6 73 46 9 0 8 ch :13 328 -10 14 r1 10 34 67 62 0: 4 6 0 ch 131 249 -49 053 r5 10 0- 33 2 : 6 1 ch 106 397 31 74 r5 47 3- 05 5 : 1 1 ch 106 814 31 05 r5 52 9- 11 5 : 1 4 ch 106 578 06 14 r5 74 9- 48 5 : 1 9 ch 106 007 06 89 r5 86 9- 53 5 5 ch :1 7 1 rU 07 96 067 653 13 0n 5 ch :10 2 10 27 rU 3 93 69 43 n: 01 7-1 02 1 4 ch 150 10 071 975 rU 1 8-1 74 ch n:9 441 03 14 r5 21 7- 02 9 : 1 1 ch 107 939 15 50 r6 04 53 02 3 : 2 ch 125 322 -92 22 r6 05 6- 20 0 : 2 1 ch 125 221 07 53 r6 10 3- 07 7 : 1 7 ch 124 465 25 37 0 7 V2 r6:1 683 3-1 86 R 24 78 25 18 14 6 8 1 1 :c 09 -12 450 hr 10 4 8 ch 7:6 8- 72 3 r7 86 12 10 :7 4 91 5 10 6 ch 120 52- 473 ch r7 60 68 63 r1 :69 3- 80 7 1 7 ch :21 268 51 63 r1 69 23 29 ch 7:2 82 -69 887 rU 16 56 44 4 ch n:52 51 217 823 rU 8 09 16 9 ch n:56 849 216 380 rU 4 1 7 n: 37 -52 057 92 98 9 2 03 9- 19 04 56 98 62 45 7 -9 43 20 65 47 61 0 ch r Exon Size (bp) 16 Figure 2(c): V2R gene exon 3 size distribution Exon 3 Size Distribution 820 810 800 790 780 770 Exon 3 760 750 740 730 V2R Genes r1 0 ch :13 r2 10 : 7 ch 18 60 r5 03 23 : 2 ch 10 22 -13 r5 67 29 10 :1 9 -1 8 ch 06 603 80 786 rU 92 9- 33 6 V 2R c n 51 10 22 : V 11 hrU 535 19 680 38 2R + 8 -1 8 n 16 :ch :83 06 06 67 r U 2 79 93 3 + V :c n 3 -5 4 2R h : 8 53 3 6 2+ r5:1 73 59 59 39 :c 06 72 -8 24 h 6 1 32 9 ch rUn 87 30- 52 7 : r1 4 46 87 44 3 0 9 ch :1 32 9-1 81 6 r1 31 83 06 46 0 0 4 4 70 2 ch :13 24 6-4 05 r5 11 90 93 32 3 : 0 ch 10 39 -13 67 r5 64 73 10 45 : 7 5 ch 10 81 -13 10 r5 65 49 11 55 : 2 1 ch 10 57 -10 41 r5 67 89 64 45 4 8 : ch 10 00 -10 98 r 6 7 6 9 ch 5:1 867 9-1 535 5 rU 07 96 06 65 ch n:1 132 0-1 752 3 rU 03 93 06 74 n: 01 7- 90 3 1 1 ch 15 410 07 297 rU 01 8- 17 5 ch n: 44 10 41 r5 92 17 30 49 : 1 2 ch 10 93 -11 15 r6 70 95 50 03 : 4 ch 12 32 3-9 222 r6 50 26 22 20 0 : 5 ch 12 22 -10 25 r6 51 13 70 37 7 : 0 ch 12 46 -12 73 4 5 V r6: 68 53- 08 77 2R 12 3 1 6 14 46 781 251 18 :c 09 -1 45 8 h 1 2 0 ch r7: 08 47 83 r7 68 -1 21 6 : 7 1 24 09 5 0 6 1 ch 12 52 47 ch r7 06 -6 36 r1 :6 03 88 3 0 7 9 ch :2 26 -75 16 r1 16 82 12 3 ch 7:2 982 3-6 988 rU 16 56 94 7 ch n:5 451 -21 482 rU 28 09 71 3 ch n:5 984 -21 638 rU 64 91 67 0 n: 37 -5 05 92 9 2 7 03 89 91 2 04 -56 99 62 45 87 -9 43 20 6 47 5 61 0 ch Exon Size (bp) 17 Figure 2(d): V2R gene exon 4 size distribution Exon 4 Size Distribution 234 232 230 228 226 224 Exon 4 222 220 218 216 V2R Genes 10 : ch 131 r2 07 : ch 180 602 r5 32 31 : ch 106 222 31 r5 79 9- 08 :1 6 18 78 ch 069 039 03 66 3 rU 2 V2 n 5 10 22 R ch :53 119 68 38 V2 11+ rUn 58 -10 086 0 7 6 R :c 16 hr :83 679 93 3 U 23 4 + V2 :ch n:8 53 53 63 9 5 5 7 R 2+ r5:1 37 9-8 924 :c 06 21 32 97 hr 3 6 5 ch Un 874 0-8 244 r1 :49 69 73 8 6 0 ch :13 328 -10 14 r1 10 34 67 62 0: 6 00 4 ch 131 249 -49 53 r5 10 0- 33 2 6 1 : ch 106 397 31 74 r5 47 3- 05 5 1 1 : ch 106 814 31 05 r5 52 9- 11 5 4 1 : ch 106 578 06 14 r5 74 9- 48 5 9 1 : ch 106 007 06 895 r5 86 9- 53 ch :10 79 10 56 rU 71 60 67 53 5 n ch :10 329 -10 27 rU 3 37 69 43 0 0 n: 1 14 -10 29 ch 150 108 71 75 rU 1 -1 74 ch n:9 441 03 149 r5 21 7- 02 1 1 : ch 107 939 15 503 r6 04 53 02 2 : ch 125 322 -92 220 r6 05 6- 20 2 1 : ch 125 221 07 537 r6 10 3- 07 7 1 : ch 124 465 25 377 0 3 6 r V2 6:1 83 -1 86 R 24 78 251 188 14 6 1 :c 09 -12 450 hr 10 4 8 ch 7:6 8- 72 3 r7 86 12 10 9 4 :7 5 10 64 1 ch 120 52- 73 ch r7: 60 68 63 r1 69 3- 80 1 7 2 7 ch :21 68 512 63 r1 69 23 9 8 7 ch :21 825 -69 87 44 rU 6 6 -2 8 n 4 ch :52 510 17 23 16 rU 8 9 3 2 n 9 ch :56 849 16 80 rU 4 1- 70 3 5 5 n: 92 798 29 72 03 9- 19 04 56 98 62 45 7 -9 4 3 20 65 47 61 0 ch r Exon Size (bp) 18 Figure 2(e): V2R gene exon 5 size distribution Exon 5 Size Distribution 140 120 100 80 60 Exon 5 40 20 0 V2R Genes 10 : ch 131 r2 07 : ch 180 602 r5 32 3: 1 ch 106 222 31 r5 79 9- 08 :1 7 6 1 ch 069 03 803 866 9 3 r 2 U V2 n 5 10 22 R ch :5 11 68 38 V2 11+ rU 358 9-1 08 R :c n:8 06 06 67 9 3 16 hr 3 U 2 79 3 V2 +:c n: 35 -53 463 R hr5 873 359 59 9 2+ :1 2 :c 06 721 -83 49 hr 68 30 25 7 U ch n 74 -8 24 r1 :49 6 73 46 9 0 8 ch :13 328 -10 14 r1 10 34 67 62 0: 4 6 0 ch 131 249 -49 053 r5 10 0- 33 2 : 6 1 ch 106 397 31 74 r5 47 3- 05 5 : 1 1 ch 106 814 31 05 r5 52 9- 11 5 : 1 4 ch 106 578 06 14 r5 74 9- 48 5 : 1 9 ch 106 007 06 89 r5 86 9- 53 5 5 ch :1 7 1 rU 07 96 067 653 13 0n 5 ch :10 2 10 27 rU 3 93 69 43 n: 01 7-1 02 1 4 ch 150 10 071 975 rU 1 8-1 74 ch n:9 441 03 14 r5 21 7- 02 9 : 1 1 ch 107 939 15 50 r6 04 53 02 3 : 2 ch 125 322 -92 22 r6 05 6- 20 0 : 2 1 ch 125 221 07 53 r6 10 3- 07 7 : 1 7 ch 124 465 25 37 0 7 V2 r6:1 683 3-1 86 R 24 78 25 18 14 6 8 1 1 :c 09 -12 450 hr 10 4 8 ch 7:6 8- 72 3 r7 86 12 10 :7 4 91 5 10 6 ch 120 52- 473 ch r7 60 68 63 r1 :69 3- 80 7 1 7 ch :21 268 51 63 r1 69 23 29 ch 7:2 82 -69 887 rU 16 56 44 4 ch n:52 51 217 823 rU 8 09 16 9 ch n:56 849 216 380 rU 4 1 7 n: 37 -52 057 92 98 9 2 03 9- 19 04 56 98 62 45 7 -9 43 20 65 47 61 0 ch r Exon Size (bp) 19 Figure 2(f): V2R gene exon 6 size distribution Exon 6 Size Distribution 3500 3000 2500 2000 1500 Exon 6 1000 500 0 V2R Genes 10 : ch 131 r2 07 : ch 180 602 r5 32 3: 1 ch 106 222 31 r5 79 9- 08 :1 7 6 1 ch 069 03 803 866 9 r 3 2 U V2 2 n 5 1 R ch :5 11 068 238 V2 11+ rU 358 9-1 08 R :c n:8 06 06 67 16 hr 3 9 3 U 2 79 3 V2 +:c n: 35 -53 463 h 3 8 9 5 R 2+ r5:1 737 59- 92 :c 06 21 83 49 hr 3 25 7 6 ch Un 874 0-8 24 r1 :49 6 73 46 9 0 ch :13 328 -10 814 r1 10 34 67 62 0: 4 6 0 ch 131 249 -49 053 r5 10 0- 33 2 : 1 ch 106 397 31 674 r5 47 3- 05 5 : 1 1 ch 106 814 31 05 r5 52 9- 11 5 : 1 ch 106 578 06 414 r5 74 9- 48 5 :1 1 9 ch 06 007 06 89 5 r 8 9 5 ch 5:1 67 -10 356 rU 07 96 67 53 1 0 ch n:10 32 -10 527 rU 3 93 69 43 n: 01 7-1 02 1 4 ch 15 10 071 975 rU 01 8-1 74 ch n:9 441 03 14 r5 21 7- 02 9 : 1 1 ch 107 939 15 50 r6 04 53 02 3 : 2 ch 125 322 -92 22 r6 05 6- 20 0 : 2 1 ch 125 221 07 53 r6 10 3- 07 7 :1 1 7 ch 24 465 25 37 0 7 V2 r6:1 683 3-1 86 R 24 78 25 18 14 6 8 1 1 :c 09 -12 450 hr 10 4 8 ch 7:6 8- 72 3 r7 86 12 10 :7 4 9 5 10 64 1 ch 120 52- 73 ch r7 60 68 63 r1 :69 3- 80 7 7 1 ch :21 268 51 63 r1 69 23 29 7 ch :2 82 -69 887 rU 16 56 44 4 ch n:5 51 217 823 rU 28 09 16 9 ch n:56 84 216 380 rU 4 91 70 n: 37 -52 57 92 98 9 2 03 9- 19 04 56 98 62 45 7 -9 43 20 65 47 61 0 ch r Intron Size (bp) 20 Figure 3(a): V2R gene intron 1 size distribution Intron 1 Size Distribution 20000 18000 16000 14000 12000 10000 Intron 1 8000 6000 4000 2000 0 V2R Genes 21