GP3xCLI - Livestock Genomics

advertisement
GP3xCLI: GenePix Post-Processing Program for
Quality Assessment of Raw Microarray Data from
CSIRO Livestock Industries
Antonio Reverter and Christina Pavlov
Bioinformatics Group
CSIRO Livestock Industries, Queensland Bioscience Precinct
306 Carmody Rd, St Lucia, QLD 4067, Australia
ABSTRACT: We present GP3xCLI, an automated unsupervised AWK-based
script to assess the quality of raw microarray data captured using the GenePix
optical scanner. Input files are processed individually and, in the output, a 2-page
portable document format (pdf) is being generated. Although the AWK interpreted
programming language is the main driver for filtering and manipulating the raw
data, GP3xCLI incorporates tools such as A2PS (a general purpose postscript
generating utility), GNUPLOT (interactive plotting utility), and PS2PDF (a publicdomain postscript to pdf converter). On execution, GP3xCLI reports a series a
summary statistics including total number of spots, anomalies due to background
expression being larger than foreground, and distribution of records by genes or
open reading frames. Inaccurate microarray signals are
further scrutinized by means of the percentage of data that is retained after each
successive mean to median correlation elimination, as well as by the joint
distribution of intensity ratios and average intensities. Finally, diagnostic plots,
including the empirical densities of dye channel intensities and intensity ratios, are
produced to enhance distinguishing among quality readings. GP3xCLI is intended to
be incorporated within the server hosting the laboratory database where the users
can invoke it remotely. Similar to GP3, an existing PERL-based program available
at http://www.bch.msu.edu/~zacharet/microarray/GP3.html, GP3xCLI is not
designed to process data for its subsequent analysis, but rather to provide the
Biologists with a simple, intuitive and effective means of assessing microarray data
quality.
AWK Script:
AWK Script: (cont.)
echo " =-=-=-=-=-=-= INITIALIZATION =-=-=-=-=-=-=-="
filename=`ls -l $1 | awk '{print $NF}'`
echo "GPR Input:" $filename | awk '{print $1, $2, "
", $3}'
date | awk '{print "Processed on:", $1, $2, $3, $4, $5, $6}'
sed 's/\"//g' $1 | \
awk 'NF==43 && $1==int($1) && $2==int($2) && $3==int($3) \
{print $0}’ > tempo0
for minr in 0 0.2 0.4 0.6 0.8 0.85 0.9
do
T1=`awk -v corr=$minr '$1>corr {print
T2=`awk -v corr=$minr '$1>corr {print
T3=`awk -v corr=$minr '$1>corr {print
T4=`awk -v corr=$minr '$1>corr {print
echo "> " $minr $T1 $T2 $T3 $T4
done
echo " =-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-="
T=`wc tempo0 | awk '{print $1}'`
echo "Total No. of Spots ------------------------>" $T
N=`awk ‘$NF==-50 {print $0}' tempo0 | wc | awk '{print $1}'`
echo "Spots with Flag = -50 -------------------->" $N
N=`awk ‘$NF==-100 {print $0}' tempo0 | wc | awk '{print $1}'`
echo "Spots with Flag = -100 -------------------->" $N
N=`awk ‘$12>=$9 {print $0}' tempo0 | wc | awk '{print $1}'`
echo "Red
dye with Background >= Foreground --->" $N
N=`awk ‘$21>=$18 {print $0}' tempo0 | wc | awk '{print $1}'`
echo "Green dye with Background >= Foreground --->" $N
echo " =-=-=-= MEAN TO MEDIAN CORRELATION =-=-=-="
awk ‘{print $9, $10, $18, $19, log($9)/log(2), log($10)/log(2), \
log($18)/log(2), log($19)/log(2)}' tempo0 > rg
awk '$1>$2 {$9=$2/$1}; $1<=$2 {$9=$1/$2}; {print $9}' rg > rr
awk '$3>$4 {$9=$4/$3}; $3<=$4 {$9=$3/$4}; {print $9}' rg > gr
awk '$5>$6 {$9=$6/$5}; $5<=$6 {$9=$5/$6}; {print $9}' rg > rl
awk '$7>$8 {$9=$8/$7}; $7<=$8 {$9=$7/$8}; {print $9}' rg > gl
##############################################################
# GP3xCLI
#
# GenePix Processing Program by CSIRO Livestock Industries #
#
#
# Enquiries: [email protected]
#
# Copyright (c) 2003 CSIRO-LI
#
##############################################################
GPR Input:
Processed on:
F12.gpr
Tue Apr
8 13:40:01 EST 2003
=-=-=-=-=-=-= IMAGE QUALITY =-=-=-=-=-=-=-=
Total No. of Spots ------------------------> 19200
Spots
Spots
Red
Green
with Flag = -50 -------------------->
with Flag = -100 -------------------->
dye with Background >= Foreground --->
dye with Background >= Foreground --->
4720
12
892
915
Median to Mean Correlation Analysis:
DATA LEFT
RED
GREEN
Corr
Raw
Log2
Raw
Log2
______________________________________
> 0.00
19200 19200
19200 19200
> 0.20
19199 19200
19199 19200
> 0.40
19183 19200
19192 19200
> 0.60
19008 19200
19102 19200
> 0.80
17061 19199
18541 19198
> 0.85
14466 19193
17872 19196
> 0.90
10491 19137
15786 19181
=-=-=-=-=-=-= VALID SPOTS* =-=-=-=-=-=-=-=
Total No. of Valid Spots -----------------> 14433
Percentage of Valid Spots -----------------> 75.2
Total
Mean
Min.
Max.
No.
No.
No.
No.
of Genes ------------------------> 7220
Repetitions ----->
2 for 6600 Genes
Repetitions ----->
1 for
580 Genes
Repetitions -----> 24 for
8 Genes
Log(R/G) vs 0.5*Log(R*G)
________
____________
N
14433
14433
Mean
-0.017
10.327
Std
0.617
2.079
Min
-8.711
3.246
Max
4.030
15.994
Correlation
0.362
Log(R/G) across Intensity Values
Intensity
Spots
% <0
% >0
__________________________________
( 0 , 4)
4
100.0
0.0
( 4 , 8)
1499
74.1
25.9
( 8 , 12)
9847
40.4
59.6
(12 , 16)
3083
17.3
82.7
__________________________________
*NB: Valid Spot defined as spots with Background < Foreground for
both Red and Green channels and with a Quality Flag of 0.
$0}'
$0}'
$0}'
$0}'
rr
rl
gr
gl
|
|
|
|
wc
wc
wc
wc
|
|
|
|
awk
awk
awk
awk
'{print
'{print
'{print
'{print
$1}'`
$1}'`
$1}'`
$1}'`
echo " =-=-=-= Log(R/G) vs 0.5*Log(R*G =-=-=-=-="
awk '{print $3, $4}' rgma | awk '{ v1[NR]=$1; v2[NR]=$2}; \
END{ min1=min2=99999; max1=max2=-99999; \
for(i=1;i<=NR;i++){ if( v1[i] < min1 ) min1 = v1[i]; \
if( v2[i] < min2 ) min2 = v2[i]; if( v1[i] > max1 ) max1 = v1[i]; \
if( v2[i] > max2 ) max2 = v2[i]; s1 += v1[i]; ss1 += v1[i]*v1[i]; \
s2 += v2[i]; ss2 += v2[i]*v2[i]; ss12 += v1[i]*v2[i] }; \
mean1 = s1/NR; mean2 = s2/NR; \
std1 = sqrt(( ss1 - (s1*s1)/NR ) / (NR-1)); \
std2 = sqrt(( ss2 - (s2*s2)/NR ) / (NR-1)); \
num = ( ss12 - (s1*s2)/NR ) / (NR-1); \
den = std1 * std2; corr = num / den; \
printf"%10s%11d%17d\n","N",NR,NR; \
printf"%10s%11.3f%17.3f\n","Mean",mean1,mean2; \
printf"%10s%11.3f%17.3f\n","Std",std1,std2; \
printf"%10s%11.3f%17.3f\n","Min",min1,min2; \
printf"%10s%11.3f%17.3f\n","Max",max1,max2; \
printf"%18s%10.3f\n","Correlation",corr}‘
echo " =-=-=-= EMPIRICAL DENSITIES =-=-=-="
awk '{print log($1)/log(2)}' rgma | sort -n | \
awk '{ data[NR] = $1 }; \
END { min = data[1]; max = data[NR]; range = max - min; \
n_int = 1000; if( int(NR*.1) <= n_int ) n_int = int(NR*.1); \
size = range / n_int; \
for(i=1; i<=NR; i++){ tot += data[i]; \
aux = int((data[i] - min)/size) + 1; \
q[aux]++; \
}; \
mn_int = min + size/2; \
for(i=1; i<=n_int; i++){if( q[i] < 1 ) q[i] = 0; \
print mn_int, q[i]; \
mn_int += size } \
}' > logr.d
Download