CN work log for microarray data prep, 3.07

2.26.07 Problem was to find genome coordinates of differentially expressed Agilent microarray genes. Received files from Melissa and Yulin Jia associating AK with oligo and original Agilent probe IDs. However microarray expression data supplied were associated with AK only, and TIGR gff file uses ARxxx probe IDs and does not associate them with AKs. Kevin Childs supplied a mapping from ARxxx to AKxxx obtained from Agilent, allowing extraction of probe genome positions from the gff file. In 54 cases the probes could not be found in the gff file. For these cases I collected the oligos and BLASTed them against the TIGR rice gene set. In all but 7 cases the LOC_Os assignment matched that from BLASTing the AK sequence directly. In these cases I elected to use the oligo instead of the AK mapping. In 42 cases the AK BLASTposition assignments differed substantially from the direct assignments made from oligo data. In these cases the oligo ones were used. To GFF file added the AK numbers corresponding to AR numbers Features that could not be assigned TIGR locus numbers were labeled “Unassigned” in the CMap .sql loading file. They were then made unique by the addition of _# at the end, where # is an integer starting with 1 and incremented by 1. The suffix added to the Accession_id text initially was jGen, since other features (SAGE tags) with the same LOC_Os IDs are also present in the table and prevent loading. Below it will be seen that I split this suffix again to handle duplications between the two microarray gene sets. First load attempt: csv in SQL format, broke after a minute with memory allocation error. So modify script (calcGenFromPhyCN_no_SQL.pl) so that it handles and writes .csv files without any SQL code, and this works very smoothly. NOTE: During processing, check to identify LOC_Os assignments represented multiple times, since these will not load. In this case there were 56 genes with one or more identically named counterparts (the 56 includes the counterparts, so that the number of genes represented two or more times is less than 30). I did not try to match these with the probes not represented in the gff file, but it may be that those were omitted because of similarity or duplication with others. Several pairs of duplicates represent genes found by both Wang and Jia; the remainder, genes found multiple times by Jia. The first kind were given the Accession_id suffixes wGen and jGen and the second were reduced to a single gene by removal of these records: 312686 312667 LOC_Os09g00998_mGen LOC_Os09g00998_mGen 9927457 9927457 JiaJasmine_downreg JiaJasmine_downreg O. sativa early proembryo mRNA, complete sequence. A. thaliana At3g41950 mRNA sequence. 312253 312393 312694 312166 312150 312421 312215 312313 312263 312572 312632 312262 312644 312672 LOC_Os08g02340_mGen LOC_Os05g15770_mGen LOC_Os04g43680_mGen LOC_Os03g44620_mGen LOC_Os03g27310_mGen LOC_Os03g04410_mGen LOC_Os03g03820_mGen LOC_Os02g48560_mGen LOC_Os02g41630_mGen LOC_Os02g07260_mGen LOC_Os03g44620_jGen LOC_Os07g26690_jGen LOC_Os09g00998_jGen LOC_Os09g00998_jGen 9927456 9927453 9927452 9927451 9927451 9927451 9927451 9927450 9927450 9927450 9927451 9927455 9927457 9927457 JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg JiaJasmine_downreg Rice mRNA, sequence homologous to acidic ribosomal protein P2 gene. O. sativa (japonica cultivar-group) mRNA for chitinase, complete cds O.sativa mRNA for myb factor, 1202 bp. Unknown expressed protein Rice mRNA for Histone H3. Cucurbita cv. Kurokawa Amakuri mRNA for aconitase, complete cds Unknown expressed protein Unknown expressed protein Unknown expressed protein Wheat mRNA for cytosolic phosphoglycerate kinase (EC 2.7.2.3). S.tuberosum mRNA for DnaJ protein. O. sativa rwc-2 mRNA for water channel protein, partial cds A. thaliana At3g41950 mRNA sequence. A. thaliana At2g16590/F1P15.3 mRNA sequence. NOTE that the alteration of the Accession_id represents hand-editing of the genetic file after processing by the Perl script. Or it could be done before, except that the script should then be altered to prevent appending “_Gen” to accession ids. NOTE that in the config file I had to shorten the feature_type_accession JiaJasmine_highly_expressed to JiaJasmine_highexpr. If it’s not 20 chars or shorter the CMap lookup fails.

CN work log for microarray data prep, 3.07

Related documents

Products

Support

CN work log for microarray data prep, 3.07

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib