CN work log for microarray data prep, 3.07

advertisement
2.26.07 Problem was to find genome coordinates of differentially expressed Agilent
microarray genes. Received files from Melissa and Yulin Jia associating AK with oligo
and original Agilent probe IDs. However microarray expression data supplied were
associated with AK only, and TIGR gff file uses ARxxx probe IDs and does not associate
them with AKs.
Kevin Childs supplied a mapping from ARxxx to AKxxx obtained from Agilent,
allowing extraction of probe genome positions from the gff file. In 54 cases the probes
could not be found in the gff file. For these cases I collected the oligos and BLASTed
them against the TIGR rice gene set. In all but 7 cases the LOC_Os assignment matched
that from BLASTing the AK sequence directly. In these cases I elected to use the oligo
instead of the AK mapping.
In 42 cases the AK BLASTposition assignments differed substantially from the direct
assignments made from oligo data. In these cases the oligo ones were used.
To GFF file added the AK numbers corresponding to AR numbers
Features that could not be assigned TIGR locus numbers were labeled “Unassigned” in
the CMap .sql loading file. They were then made unique by the addition of _# at the end,
where # is an integer starting with 1 and incremented by 1.
The suffix added to the Accession_id text initially was jGen, since other features (SAGE
tags) with the same LOC_Os IDs are also present in the table and prevent loading. Below
it will be seen that I split this suffix again to handle duplications between the two
microarray gene sets.
First load attempt: csv in SQL format, broke after a minute with memory allocation error.
So modify script (calcGenFromPhyCN_no_SQL.pl) so that it handles and writes .csv
files without any SQL code, and this works very smoothly. NOTE: During processing,
check to identify LOC_Os assignments represented multiple times, since these will not
load. In this case there were 56 genes with one or more identically named counterparts
(the 56 includes the counterparts, so that the number of genes represented two or more
times is less than 30). I did not try to match these with the probes not represented in the
gff file, but it may be that those were omitted because of similarity or duplication with
others. Several pairs of duplicates represent genes found by both Wang and Jia; the
remainder, genes found multiple times by Jia. The first kind were given the Accession_id
suffixes wGen and jGen and the second were reduced to a single gene by removal of
these records:
312686
312667
LOC_Os09g00998_mGen
LOC_Os09g00998_mGen
9927457
9927457
JiaJasmine_downreg
JiaJasmine_downreg
O. sativa early proembryo mRNA, complete sequence.
A. thaliana At3g41950 mRNA sequence.
312253
312393
312694
312166
312150
312421
312215
312313
312263
312572
312632
312262
312644
312672
LOC_Os08g02340_mGen
LOC_Os05g15770_mGen
LOC_Os04g43680_mGen
LOC_Os03g44620_mGen
LOC_Os03g27310_mGen
LOC_Os03g04410_mGen
LOC_Os03g03820_mGen
LOC_Os02g48560_mGen
LOC_Os02g41630_mGen
LOC_Os02g07260_mGen
LOC_Os03g44620_jGen
LOC_Os07g26690_jGen
LOC_Os09g00998_jGen
LOC_Os09g00998_jGen
9927456
9927453
9927452
9927451
9927451
9927451
9927451
9927450
9927450
9927450
9927451
9927455
9927457
9927457
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
JiaJasmine_downreg
Rice mRNA, sequence homologous to acidic ribosomal protein P2
gene.
O. sativa (japonica cultivar-group) mRNA for chitinase, complete cds
O.sativa mRNA for myb factor, 1202 bp.
Unknown expressed protein
Rice mRNA for Histone H3.
Cucurbita cv. Kurokawa Amakuri mRNA for aconitase, complete cds
Unknown expressed protein
Unknown expressed protein
Unknown expressed protein
Wheat mRNA for cytosolic phosphoglycerate kinase (EC 2.7.2.3).
S.tuberosum mRNA for DnaJ protein.
O. sativa rwc-2 mRNA for water channel protein, partial cds
A. thaliana At3g41950 mRNA sequence.
A. thaliana At2g16590/F1P15.3 mRNA sequence.
NOTE that the alteration of the Accession_id represents hand-editing of the genetic file
after processing by the Perl script. Or it could be done before, except that the script
should then be altered to prevent appending “_Gen” to accession ids.
NOTE that in the config file I had to shorten the feature_type_accession
JiaJasmine_highly_expressed to JiaJasmine_highexpr. If it’s not 20 chars or shorter the
CMap lookup fails.
Download