GP3xCLI: GenePix Post-Processing Program for Quality Assessment of Raw Microarray Data from CSIRO Livestock Industries Antonio Reverter and Christina Pavlov Bioinformatics Group CSIRO Livestock Industries, Queensland Bioscience Precinct 306 Carmody Rd, St Lucia, QLD 4067, Australia ABSTRACT: We present GP3xCLI, an automated unsupervised AWK-based script to assess the quality of raw microarray data captured using the GenePix optical scanner. Input files are processed individually and, in the output, a 2-page portable document format (pdf) is being generated. Although the AWK interpreted programming language is the main driver for filtering and manipulating the raw data, GP3xCLI incorporates tools such as A2PS (a general purpose postscript generating utility), GNUPLOT (interactive plotting utility), and PS2PDF (a publicdomain postscript to pdf converter). On execution, GP3xCLI reports a series a summary statistics including total number of spots, anomalies due to background expression being larger than foreground, and distribution of records by genes or open reading frames. Inaccurate microarray signals are further scrutinized by means of the percentage of data that is retained after each successive mean to median correlation elimination, as well as by the joint distribution of intensity ratios and average intensities. Finally, diagnostic plots, including the empirical densities of dye channel intensities and intensity ratios, are produced to enhance distinguishing among quality readings. GP3xCLI is intended to be incorporated within the server hosting the laboratory database where the users can invoke it remotely. Similar to GP3, an existing PERL-based program available at http://www.bch.msu.edu/~zacharet/microarray/GP3.html, GP3xCLI is not designed to process data for its subsequent analysis, but rather to provide the Biologists with a simple, intuitive and effective means of assessing microarray data quality. AWK Script: AWK Script: (cont.) echo " =-=-=-=-=-=-= INITIALIZATION =-=-=-=-=-=-=-=" filename=`ls -l $1 | awk '{print $NF}'` echo "GPR Input:" $filename | awk '{print $1, $2, " ", $3}' date | awk '{print "Processed on:", $1, $2, $3, $4, $5, $6}' sed 's/\"//g' $1 | \ awk 'NF==43 && $1==int($1) && $2==int($2) && $3==int($3) \ {print $0}’ > tempo0 for minr in 0 0.2 0.4 0.6 0.8 0.85 0.9 do T1=`awk -v corr=$minr '$1>corr {print T2=`awk -v corr=$minr '$1>corr {print T3=`awk -v corr=$minr '$1>corr {print T4=`awk -v corr=$minr '$1>corr {print echo "> " $minr $T1 $T2 $T3 $T4 done echo " =-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-=" T=`wc tempo0 | awk '{print $1}'` echo "Total No. of Spots ------------------------>" $T N=`awk ‘$NF==-50 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Spots with Flag = -50 -------------------->" $N N=`awk ‘$NF==-100 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Spots with Flag = -100 -------------------->" $N N=`awk ‘$12>=$9 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Red dye with Background >= Foreground --->" $N N=`awk ‘$21>=$18 {print $0}' tempo0 | wc | awk '{print $1}'` echo "Green dye with Background >= Foreground --->" $N echo " =-=-=-= MEAN TO MEDIAN CORRELATION =-=-=-=" awk ‘{print $9, $10, $18, $19, log($9)/log(2), log($10)/log(2), \ log($18)/log(2), log($19)/log(2)}' tempo0 > rg awk '$1>$2 {$9=$2/$1}; $1<=$2 {$9=$1/$2}; {print $9}' rg > rr awk '$3>$4 {$9=$4/$3}; $3<=$4 {$9=$3/$4}; {print $9}' rg > gr awk '$5>$6 {$9=$6/$5}; $5<=$6 {$9=$5/$6}; {print $9}' rg > rl awk '$7>$8 {$9=$8/$7}; $7<=$8 {$9=$7/$8}; {print $9}' rg > gl ############################################################## # GP3xCLI # # GenePix Processing Program by CSIRO Livestock Industries # # # # Enquiries: Tony.Reverter-Gomez@csiro.au # # Copyright (c) 2003 CSIRO-LI # ############################################################## GPR Input: Processed on: F12.gpr Tue Apr 8 13:40:01 EST 2003 =-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-= Total No. of Spots ------------------------> 19200 Spots Spots Red Green with Flag = -50 --------------------> with Flag = -100 --------------------> dye with Background >= Foreground ---> dye with Background >= Foreground ---> 4720 12 892 915 Median to Mean Correlation Analysis: DATA LEFT RED GREEN Corr Raw Log2 Raw Log2 ______________________________________ > 0.00 19200 19200 19200 19200 > 0.20 19199 19200 19199 19200 > 0.40 19183 19200 19192 19200 > 0.60 19008 19200 19102 19200 > 0.80 17061 19199 18541 19198 > 0.85 14466 19193 17872 19196 > 0.90 10491 19137 15786 19181 =-=-=-=-=-=-= VALID SPOTS* =-=-=-=-=-=-=-= Total No. of Valid Spots -----------------> 14433 Percentage of Valid Spots -----------------> 75.2 Total Mean Min. Max. No. No. No. No. of Genes ------------------------> 7220 Repetitions -----> 2 for 6600 Genes Repetitions -----> 1 for 580 Genes Repetitions -----> 24 for 8 Genes Log(R/G) vs 0.5*Log(R*G) ________ ____________ N 14433 14433 Mean -0.017 10.327 Std 0.617 2.079 Min -8.711 3.246 Max 4.030 15.994 Correlation 0.362 Log(R/G) across Intensity Values Intensity Spots % <0 % >0 __________________________________ ( 0 , 4) 4 100.0 0.0 ( 4 , 8) 1499 74.1 25.9 ( 8 , 12) 9847 40.4 59.6 (12 , 16) 3083 17.3 82.7 __________________________________ *NB: Valid Spot defined as spots with Background < Foreground for both Red and Green channels and with a Quality Flag of 0. $0}' $0}' $0}' $0}' rr rl gr gl | | | | wc wc wc wc | | | | awk awk awk awk '{print '{print '{print '{print $1}'` $1}'` $1}'` $1}'` echo " =-=-=-= Log(R/G) vs 0.5*Log(R*G =-=-=-=-=" awk '{print $3, $4}' rgma | awk '{ v1[NR]=$1; v2[NR]=$2}; \ END{ min1=min2=99999; max1=max2=-99999; \ for(i=1;i<=NR;i++){ if( v1[i] < min1 ) min1 = v1[i]; \ if( v2[i] < min2 ) min2 = v2[i]; if( v1[i] > max1 ) max1 = v1[i]; \ if( v2[i] > max2 ) max2 = v2[i]; s1 += v1[i]; ss1 += v1[i]*v1[i]; \ s2 += v2[i]; ss2 += v2[i]*v2[i]; ss12 += v1[i]*v2[i] }; \ mean1 = s1/NR; mean2 = s2/NR; \ std1 = sqrt(( ss1 - (s1*s1)/NR ) / (NR-1)); \ std2 = sqrt(( ss2 - (s2*s2)/NR ) / (NR-1)); \ num = ( ss12 - (s1*s2)/NR ) / (NR-1); \ den = std1 * std2; corr = num / den; \ printf"%10s%11d%17d\n","N",NR,NR; \ printf"%10s%11.3f%17.3f\n","Mean",mean1,mean2; \ printf"%10s%11.3f%17.3f\n","Std",std1,std2; \ printf"%10s%11.3f%17.3f\n","Min",min1,min2; \ printf"%10s%11.3f%17.3f\n","Max",max1,max2; \ printf"%18s%10.3f\n","Correlation",corr}‘ echo " =-=-=-= EMPIRICAL DENSITIES =-=-=-=" awk '{print log($1)/log(2)}' rgma | sort -n | \ awk '{ data[NR] = $1 }; \ END { min = data[1]; max = data[NR]; range = max - min; \ n_int = 1000; if( int(NR*.1) <= n_int ) n_int = int(NR*.1); \ size = range / n_int; \ for(i=1; i<=NR; i++){ tot += data[i]; \ aux = int((data[i] - min)/size) + 1; \ q[aux]++; \ }; \ mn_int = min + size/2; \ for(i=1; i<=n_int; i++){if( q[i] < 1 ) q[i] = 0; \ print mn_int, q[i]; \ mn_int += size } \ }' > logr.d