STEP 1: Data formatting and zero thresholding

advertisement
Microarray Course
Affymetrix data handling with Excel
Enzo Medico
STEP 1: Data formatting and zero thresholding
-
Open the txt file generated by the GeneChip Microarray Software
-
Remove unwanted rows (experiment info), leave the Header row
-
Remove unwanted columns, to remain with three columns:
(A) Probe set
(B) Avg Diff
(C) Abs call
-
Rename the Avg Diff and Abs Call Headers, inserting the unique sample identifier (eg: “C-1
Avg Diff”, “C-1 Abs Call”)
-
Sort data by ascending probe set (procedure: click the upper left cell, then  Data  Sort…
 sort by Probe Set, Ascending) (Dati --> Ordina... --> ordina per Probe Set, Crescente)
-
Threshold data: In the D2 cell write the following formula:
=IF(B2>0;B2;0)
=SE(B2>0;B2;0)
(some Excel versions require the comma instead of the semicolon)
-
Write a header for column D such as “C-1 zero thr”
-
Copy the formula throughout all Column D (procedure: click cell D2, right click  copy;
select all cells below D2 right click  paste)
-
Numeric call rendering: in the E2 cell write the following formula:
=IF(C2=”P”;2;IF(C2=”M”;1;0))
=SE(C2=”P”;2;IF(C2=”M”;1;0))
-
Write a header for column E such as “C-1 call”
-
Copy the formula throughout all Column E
-
Save as excel file
-
Once the first file has been completed, it can be used as a formatter for the other txt files. It
is sufficient to copy the thresholding and numeric call columns to the new file. In this case,
remember to always rename the headers!
STEP 2: data assembly and normalization
-
Copy and paste all data sets in a unique file (paste special  value)
-
Check identity and remove redundant probe set columns
-
Final data structure;
-
Column (A): Probe Set
-
Columns (B,C,D,E,F,G): zero-thresholded values for C-1, C-2, H24-1, H24-2, E24-1
and E24-2, respectively
-
Columns (H,I,J,K,L,M): call values for C-1, C-2, H24-1, H24-2, E24-1 and E24-2,
respectively
-
Generate a Global Average Intensity (GAI) Column (N) by typing in cell N2:
=AVERAGE(B2:G2)
-
Copy the N2 cell in all cells below N2
-
Sort data by increasing GAI
-
Save this file (eg: all_set.xls)
=MEDIA(B2:G2)
STEP 3: moving average normalization and thresholding to 20
-
In the all_set.xls file, after data have been sorted by ascending GAI, type in the O2 cell the
following formula:
=IF(B2/AVERAGE(B2:B200)*AVERAGE($N2:$N200)>20;B2/AVERAGE(B2:B200)*A
VERAGE($N2:$N200);20)
=SE(B2/MEDIA(B2:B200)*MEDIA($N2:$N200)>20;B2/MEDIA(B2:B200)*MEDIA($N
2:$N200);20)
-
Copy the O2 cell in the first 100 cells below O2, then modify the formula in the O101 cell
as follows:
=IF(B101/AVERAGE(B2:B200)*AVERAGE($N2:$N200)>20;B101/AVERAGE(B2:B20
0)*AVERAGE($N2:$N200);20)
=SE(B101/MEDIA(B2:B200)*MEDIA($N2:$N200)>20;B101/MEDIA(B2:B200)*MEDI
A($N2:$N200);20)
-
Copy cell O101 in all cells below O101, then copy column O to columns P-T
-
Fill the headers of columns N-T with the identifiers of the experiment sample: C-1, C-2,
H24-1, H24-2, E24-1, E24-2.
-
Save this intermediate file (eg all_MA_thr.xls)
STEP 4: generation of the analysis spreadsheet
-
From the all_MA_thr.xls file, copy all data as values in a new spreadsheet
-
In this worksheet remove the original zero-thresholded data and re-order columns to obtain
the following data structure:
-
Column (A): Probe Set
-
Columns (B,C,D,E,F,G): final values for C-1, C-2, H24-1, H24-2, E24-1 and E24-2,
respectively
-
Columns (H,I,J,K,L,M): call values for C-1, C-2, H24-1, H24-2, E24-1 and E24-2,
respectively
-
Save this file before further modifications (eg analysis.xls)
-
Insert in cell N2 the following formula for average log2 ratio calculation (H24 vs C):
=AVERAGE(LOG(D2/B2;2);LOG(D2/C2;2);LOG(E2/B2;2);LOG(E2/C2;2))
=MEDIA(LOG(D2/B2;2);LOG(D2/C2;2);LOG(E2/B2;2);LOG(E2/C2;2))
-
Insert in cell O2 the following formula for average log2 ratio calculation (E24 vs C):
=AVERAGE(LOG(F2/B2;2);LOG(F2/C2;2);LOG(G2/B2;2);LOG(G2/C2;2))
-
Insert in cell P2 the following formula for average log2 ratio calculation (H24 vs E24):
=AVERAGE(LOG(F2/D2;2);LOG(F2/E2;2);LOG(G2/D2;2);LOG(G2/E2;2))
-
Insert in the cells Q2-S2 the following formulae for calculation of standard deviations (SD),
respectively:
=STDEVP(LOG(D2/B2;2);LOG(D2/C2;2);LOG(E2/B2;2);LOG(E2/C2;2))
=STDEVP(LOG(F2/B2;2);LOG(F2/C2;2);LOG(G2/B2;2);LOG(G2/C2;2))
=STDEVP(LOG(F2/D2;2);LOG(F2/E2;2);LOG(G2/D2;2);LOG(G2/E2;2))
=DEV.ST.POP(LOG(D2/B2;2);LOG(D2/C2;2);LOG(E2/B2;2);LOG(E2/C2;2))
= DEV.ST.POP (LOG(F2/B2;2);LOG(F2/C2;2);LOG(G2/B2;2);LOG(G2/C2;2))
= DEV.ST.POP (LOG(F2/D2;2);LOG(F2/E2;2);LOG(G2/D2;2);LOG(G2/E2;2))
-
Insert in the cell T2 the following formula for Root Mean Square SD calculation:
=SQRT(AVERAGE(Q2^2;R2^2;S2^2)) || =RADQ(MEDIA(Q2^2;R2^2;S2^2))
-
Insert in the cells U2-W2 the following formulae for calculation of call compatibility,
respectively:
=IF(N2<0;IF(SUM(H2:I2)<3;0;1);IF(SUM(J2:K2)<3;0;1))
=IF(O2<0;IF(SUM(H2:I2)<3;0;1);IF(SUM(L2:M2)<3;0;1))
=IF(P2<0;IF(SUM(J2:K2)<3;0;1);IF(SUM(L2:M2)<3;0;1))
=SE(N2<0;SE(SOMMA(H2:I2)<3;0;1);SE(SOMMA(J2:K2)<3;0;1))
=SE(O2<0;SE(SOMMA(H2:I2)<3;0;1);SE(SOMMA(L2:M2)<3;0;1))
=SE(P2<0;SE(SOMMA(J2:K2)<3;0;1);SE(SOMMA(L2:M2)<3;0;1))
-
Insert in the cells X2-Z2 the following formulae for the relevance test, respectively:
=IF(ABS(N2)-$AA$1*Q2<$AC$1;0;IF(ABS(N2)-$AB$1*$T2<$AC$1;0;1))
=IF(ABS(O2)-$AA$1*R2<$AC$1;0;IF(ABS(O2)-$AB$1*$T2<$AC$1;0;1))
=IF(ABS(P2)-$AA$1*S2<$AC$1;0;IF(ABS(P2)-$AB$1*$T2<$AC$1;0;1))
=SE(ASS(N2)-$AA$1*Q2<$AC$1;0;SE(ASS(N2)-$AB$1*$T2<$AC$1;0;1))
=SE(ASS(O2)-$AA$1*R2<$AC$1;0;SE(ASS(O2)-$AB$1*$T2<$AC$1;0;1))
=SE(ASS(P2)-$AA$1*S2<$AC$1;0;SE(ASS(P2)-$AB$1*$T2<$AC$1;0;1))
-
Insert a number (eg 1) in the cells AA1, AB1 and AC1. These numbers will indicate,
respectively (1) the multiplier for SD subtraction; (2) the multiplier for RMS-SD
subtraction; (3) the threshold value to be overcome by the absolute Average log2 ratio after
SD subtractions, to call the gene relevantly regulated.
-
Insert in the cell AD1, AE1, AF1 the following formula to count the number of genes that
passed the statistical test for relevant regulation (H24vsC, E24vsC, and E24vsH24,
respectively):
=SUM(X2:X1001)
=SOMMA(X2:X1001)
=SUM(Y2:Y1001)
=SOMMA(Y2:Y1001)
=SUM(Z2:Z1001)
=SOMMA(Z2:Z1001)
Save this file (analysis.xls)
STEP 5: generation of the permutation worksheet
-
In the analysis.xls file, open a new worksheet; rename the original worksheet “ANALYSIS”
and the second worksheet “RANDOM”.
-
Copy all contents of the ANALYSIS worksheet into the RANDOM worksheet;
-
In the RANDOM worksheet, substitute existing formulae in cells N2 to W2 with the
following:
=AVERAGE(LOG(D2/B2;2);LOG(D2/E2;2);LOG(C2/B2;2);LOG(C2/E2;2))
=AVERAGE(LOG(F2/B2;2);LOG(F2/G2;2);LOG(C2/B2;2);LOG(C2/G2;2))
=AVERAGE(LOG(F2/D2;2);LOG(F2/G2;2);LOG(E2/D2;2);LOG(E2/G2;2))
=STDEVP(LOG(D2/B2;2);LOG(D2/E2;2);LOG(C2/B2;2);LOG(C2/E2;2))
=STDEVP(LOG(F2/B2;2);LOG(F2/G2;2);LOG(C2/B2;2);LOG(C2/G2;2))
=STDEVP(LOG(F2/D2;2);LOG(F2/G2;2);LOG(E2/D2;2);LOG(E2/G2;2))
=SQRT(AVERAGE(Q2^2;R2^2;S2^2))
=IF(Q2<0;IF(SUM(H2;K2)<3;0;1);IF(SUM(I2:J2)<3;0;1))
=IF(R2<0;IF(SUM(H2;M2)<3;0;1);IF(SUM(I2;L2)<3;0;1))
=IF(S2<0;IF(SUM(J2;M2)<3;0;1);IF(SUM(K2;L2)<3;0;1))
=MEDIA(LOG(D2/B2;2);LOG(D2/E2;2);LOG(C2/B2;2);LOG(C2/E2;2))
=MEDIA(LOG(F2/B2;2);LOG(F2/G2;2);LOG(C2/B2;2);LOG(C2/G2;2))
=MEDIA(LOG(F2/D2;2);LOG(F2/G2;2);LOG(E2/D2;2);LOG(E2/G2;2))
= DEV.ST.POP (LOG(D2/B2;2);LOG(D2/E2;2);LOG(C2/B2;2);LOG(C2/E2;2))
= DEV.ST.POP (LOG(F2/B2;2);LOG(F2/G2;2);LOG(C2/B2;2);LOG(C2/G2;2))
= DEV.ST.POP (LOG(F2/D2;2);LOG(F2/G2;2);LOG(E2/D2;2);LOG(E2/G2;2))
=RADQ(MEDIA(Q2^2;R2^2;S2^2))
=SE(Q2<0; SE (SOMMA (H2;K2)<3;0;1); SE (SOMMA (I2:J2)<3;0;1))
= SE (R2<0; SE (SOMMA (H2;M2)<3;0;1); SE (SOMMA (I2;L2)<3;0;1))
= SE (S2<0; SE (SOMMA (J2;M2)<3;0;1); SE (SOMMA (K2;L2)<3;0;1))
-
Copy the modified cells on all the following cells in the respective columns
-
Save the file containing the ANALYSIS and RANDOM worksheets (analysis.xls)
Download