6 Month Allelic Series RNAseq QC

6 Month Allelic Series RNAseq QC
1
QC summary
QC was performed on all 192 samples focusing on determining failed or outlier
samples. Four samples are recommended for omission from the final analysis
dataset based on evidence of RNA degradation, PCA analysis, and model-based
gene outlier detection. Those four samples can be found on slide 19.
Additionally two correctable issues were identified. First, one flowcell worth of
samples was run an additional time to add read depth to the 100 million required.
This re-run was inadvertently run as 75-mers instead of 50-mer so the samples
are a mix of read length. Secondly, for a subset of cortex samples (Q92 and Q140)
there appears to be an infinitesimal but detectable amount of liver tissue. The
overall dilution is 500-1000x, but given the extraordinary sensitivity of RNAseq this
is still measureable. We have recommended a simple filter to remove those liver
transcripts based on the fact that they have a recognizable correlation pattern
(listed on slide 29), but other methods may be more sensitive.
2
How does CHDI QC RNAseq data in general?
• Mostly we’re looking for outliers
• Also showing overall experiment
worked
• When we find outliers, we try to
determine the cause
– That helps show it is an outlier and
not part of the biology
• Methods
– Principal Components Analysis
– RNA degradation plots
– Paired end insert size
– Read lengths
– Read mapping efficiency
– Repetitive sequences and their origin
– Highly expressed genes
– # gene outliers
3
PCA whole dataset
• Not surprisingly, tissues
cluster
• Strong sex effect in liver
• Cortex is tightly clustered
4
PCA striatum
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
• Q lengths cluster, good sign
the design worked
• Q92, 111, 140, 175 uniquely
cluster
• They even stagger in
Q length order
• Couple potential outliers
(in red outline)
5
PCA cortex
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
• Only Q175 stands outside
the main cluster
• Possible Q175 outliers,
but hard to be certain
6
PCA liver
450_Liver_Q175_HET_M_L8.LB1_1.clipped
• Strong sex clustering will
need to be accounted for
• No strong Q clusters
(sex masking?)
• One potential outlier
7
Duplication in brain (representative examples)
60
50
50
50
40
30
20
10
0
30
20
10
3
5
7
9
10+
30
20
10
0
1
40
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
Percentage
60
40
Percentage
50
40
Percentage
50
Percentage
Percentage
20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
0
1
DuplicationLevel
3
5
7
9
10+
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
30
20
10
0
40
30
20
10
0
1
3
5
7
9
10+
30
20
10
DuplicationLevel
3
5
7
9
10+
50
50
40
40
30
20
10
0
1
60
Percentage
40
Percentage
50
40
Percentage
50
Percentage
Percentage
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
0
1
DuplicationLevel
3
5
7
9
10+
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
• Duplication is consistent and
hovers between 13-24%
• No red flags
• Higher in striatum than
cortex generally
• Origin of the majority of the
duplicated sequences is
mitochondrial
50
40
40
30
20
10
0
30
20
10
0
1
3
5
7
9
10+
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
50
Percentage
50
40
Percentage
50
40
Percentage
50
Percentage
Percentage
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
1
3
5
7
9
10+
DuplicationLevel
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
10
0
20
10
0
1
3
5
7
9
10+
50
50
40
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
50
Percentage
20
Percentage
30
30
Percentage
40
40
Percentage
Percentage
50
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
1
3
5
7
9
10+
DuplicationLevel
50
50
50
40
40
30
20
10
0
30
20
10
3
5
7
9
DuplicationLevel
8
10+
30
20
10
0
1
40
0
1
3
5
7
9
DuplicationLevel
10+
Percentage
60
40
Percentage
50
40
Percentage
50
Percentage
Percentage
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
30
20
10
0
1
3
5
7
9
DuplicationLevel
10+
30
20
10
0
1
3
5
7
9
DuplicationLevel
10+
1
3
5
7
9
DuplicationLevel
10+
Liver duplication (representative examples)
30
20
10
40
30
20
10
0
0
1
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
40
35
30
25
20
15
10
5
0
10+
60
60
50
50
Percentage
50
Percentage
60
40
Percentage
50
Percentage
Percentage
520_Liver_Q111_HET_M_L8.LB12_1.clipped.fastq
520_Liver_Q111_HET_M_L8.LB12_2.clipped.fastq
522_Liver_Q111_HET_F_L8.LB13_1.clipped.fastq
522_Liver_Q111_HET_F_L8.LB13_2.clipped.fastq
524_Liver_Q111_HET_F_L4.LB14_1.clipped.fastq
40
30
20
10
DuplicationLevel
3
5
7
9
10+
30
20
10
0
1
40
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
50
50
50
40
40
30
20
10
0
30
20
10
3
5
7
9
10+
30
20
10
0
1
40
0
1
DuplicationLevel
3
5
7
9
10+
Percentage
60
40
Percentage
50
40
Percentage
50
Percentage
Percentage
524_Liver_Q111_HET_F_L4.LB14_2.clipped.fastq
526_Liver_Q111_HET_F_L4.LB15_1.clipped.fastq
526_Liver_Q111_HET_F_L4.LB15_2.clipped.fastq
528_Liver_Q111_HET_F_L6.LB16_1.clipped.fastq
528_Liver_Q111_HET_F_L6.LB16_2.clipped.fastq
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
• Liver duplication is much
higher, 40-50%
• Major duplicated
sequences are all mouse
pheromone receptors
(Mup1-21)
• Hurts our true read depth,
but nothing terrible
• Should keep in mind for
future liver work
60
50
40
40
50
40
30
20
10
0
30
20
10
0
1
3
5
7
9
10+
30
20
10
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
50
Percentage
50
Percentage
50
Percentage
60
Percentage
Percentage
642_Liver_Q20_HET_M_L3.LB18_1.clipped.fastq
642_Liver_Q20_HET_M_L3.LB18_2.clipped.fastq
644_Liver_Q20_HET_M_L4.LB19_1.clipped.fastq
644_Liver_Q20_HET_M_L4.LB19_2.clipped.fastq
646_Liver_Q20_HET_M_L6.LB20_1.clipped.fastq
0
1
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
50
50
40
40
30
20
40
30
20
10
10
0
0
1
3
5
7
9
10+
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
50
Percentage
60
50
Percentage
60
50
Percentage
60
Percentage
Percentage
646_Liver_Q20_HET_M_L6.LB20_2.clipped.fastq
648_Liver_Q20_HET_M_L1.LB21_1.clipped.fastq
648_Liver_Q20_HET_M_L1.LB21_2.clipped.fastq
650_Liver_Q20_HET_F_L7.LB22_1.clipped.fastq
650_Liver_Q20_HET_F_L7.LB22_2.clipped.fastq
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
40
30
20
10
0
1
DuplicationLevel
3
5
7
9
10+
1
DuplicationLevel
3
5
7
9
10+
DuplicationLevel
50
50
50
40
40
40
30
20
10
0
30
20
10
3
95
7
9
DuplicationLevel
10+
30
20
10
0
1
40
0
1
3
5
7
9
DuplicationLevel
10+
Percentage
60
40
Percentage
50
50
Percentage
60
Percentage
Percentage
652_Liver_Q20_HET_F_L2.LB23_1.clipped.fastq
652_Liver_Q20_HET_F_L2.LB23_2.clipped.fastq
654_Liver_Q20_HET_F_L1.LB25_1.clipped.fastq
654_Liver_Q20_HET_F_L1.LB25_2.clipped.fastq
656_Liver_Q20_HET_F_L6.LB27_1.clipped.fastq
30
20
10
0
1
3
5
7
9
DuplicationLevel
10+
30
20
10
0
1
3
5
7
9
DuplicationLevel
10+
1
3
5
7
9
DuplicationLevel
10+
5’ -> 3’ degradation charts (representative examples)
20914_449L_striatum_Q175_WT_M_L1.LB1_1.clipped
20921_450L_cortex_Q175_HET_M_L2._1.clipped
20927_450L_striatum_Q175_HET_M_L1.LB2_1.clipped
20940_451L_striatum_Q175_WT_M_L6.LB3_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
Color by TranscriptBin
1-499
0.6
0.6
0.6
0.6
0.6
500-999
0.4
0.4
0.4
0.4
0.4
1000-1999
0.2
0.2
0.2
0.2
0.2
0
0
0
0
0
2000-2999
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
3000-3999
20953_452L_striatum_Q175_HET_M_L8.LB4_1.clipped
20966_453L_striatum_Q175_WT_M_L2.LB5_1.clipped
20973_454L_cortex_Q175_HET_M_L4.LB3_1.clipped
20979_454L_striatum_Q175_HET_M_L4.LB6_1.clipped
20992_455L_striatum_Q175_WT_M_L2.LB7_1.clipped
4000-4999
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
5000+
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
20999_456L_cortex_Q175_HET_M_L5.LB4_1.clipped
21005_456L_striatum_Q175_HET_M_L5.LB8_1.clipped
21018_457L_striatum_Q175_WT_F_L1.LB9_1.clipped
21025_458L_cortex_Q175_HET_F_L6.LB6_1.clipped
21031_458L_striatum_Q175_HET_F_L8.LB10_1.clipped
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Displays the likelihood of getting
full length transcripts for various
mRNA lengths
• Very high quality samples in general
• Most samples show >70% of all mRNA
molecules are >80% complete
• Liver on average more degraded
• Some samples have degradation in the
longer mRNA species (one marked in red)
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
21044_459L_striatum_Q175_WT_F_L6.LB11_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
21057_460L_striatum_Q175_HET_F_L8.LB12_1.clipped
21070_461L_striatum_Q175_WT_F_L3.LB13_1.clipped
21077_462L_cortex_Q175_HET_F_L7.LB7_1.clipped
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
21083_462L_striatum_Q175_HET_F_L6.LB14_1.clipped
21096_463L_striatum_Q175_WT_F_L1.LB15_1.clipped
21103_464L_cortex_Q175_HET_F_L3.LB8_1.clipped
21109_464L_striatum_Q175_HET_F_L5.LB16_1.clipped
23346_513L_striatum_Q111_WT_M_L4.LB18_1.clipped
1
1
1
1
1
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
10
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
0
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Bin1 Bin18 Bin35 Bin52 Bin69 Bin86
Suspect samples by 5’ -> 3’ degradation
454_Liver_Q175_HET_M_L3._1.clipped
456_Liver_Q175_HET_M_L7.LB4_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
845_Liver_Q92_HET_F_L6.LB25_1.clipped
522_Liver_Q111_HET_F_L8.LB13_1.clipped
452_Liver_Q175_HET_M_L1.LB2_1.clipped
776_Liver_Q140_HET_F_L8.LB13_1.clipped
716_Liver_Q80_HET_F_L7.LB6_1.clipped
843_Liver_Q92_HET_F_L6.LB23_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
11
GC content per read has a red flag
20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
2000000
2000000
1000000
1000000
0
0
0
10
20
30
40
50
Count
3000000
Count
Count
3000000
60
3000000
3000000
5000000
2500000
2500000
4000000
2000000
2000000
1500000
1000000
500000
10
20
GC#
30
40
50
60
1500000
1000000
0
2000000
0
0 10 20 30 40 50 60 70 80
0 10 20 30 40 50 60 70 80
GC#
GC#
GC#
3000000
1000000
500000
0
0
Count
4000000
Count
4000000
0
10
20
30
40
50
60
GC#
4000000
3000000
2000000
1000000
3000000
2000000
1000000
0
10
20
30
40
50
60
3000000
2000000
10
20
GC#
30
40
50
60
4000000
2000000
1000000
0
0
5000000
3000000
1000000
0
0
4000000
Count
5000000
4000000
Count
5000000
4000000
Count
5000000
Count
Count
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
10
20
GC#
30
40
50
60
2000000
1000000
0
0
3000000
0
0
10
20
GC#
30
40
50
60
0
10
20
GC#
30
40
50
60
GC#
5000000
4000000
4000000
3000000
2000000
3000000
2000000
3000000
2000000
2000000
1000000
1000000
1000000
0
0
0
0
10
20
30
40
50
60
0
10
20
GC#
30
40
50
60
0
10
20
GC#
30
40
50
3000000
3000000
1000000
0
4000000
60
8 of the samples have a “shoulder”
in the GC# chart
This is usually a really bad thing
• Suggests non-mouse or
non-biological sequence
Count
5000000
4000000
Count
5000000
4000000
Count
5000000
Count
Count
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
2000000
1000000
0
0
10
20
GC#
30
40
50
60
0
10
20
GC#
30
40
50
GC#
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
1000000
5000000
5000000
4000000
4000000
4000000
3000000
2000000
1000000
0
10
20
30
40
50
60
2000000
1000000
0
0
3000000
10
GC#
20
30
40
50
60
4000000
3000000
2000000
1000000
0
0
5000000
Count
2000000
5000000
Count
Count
Count
3000000
Count
4000000
10
20
GC#
30
40
50
60
2000000
1000000
0
0
3000000
0
0
10
20
GC#
30
40
50
60
0
10
20
GC#
30
40
50
60
GC#
0
10
20
12
30
GC#
40
50
60
5000000
5000000
4000000
4000000
4000000
3000000
2000000
1000000
10
20
30
GC#
40
50
3000000
2000000
1000000
0
0
Count
1500000
1000000
500000
0
5000000
Count
1500000
1000000
500000
0
4000000
3500000
3000000
2500000
2000000
Count
4000000
3500000
3000000
2500000
2000000
Count
Count
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
10
20
30
GC#
40
50
60
2000000
1000000
0
0
3000000
0
0
10
20
30
GC#
40
50
60
0
10
20
30
GC#
40
50
60
Those same samples flag for read length as well
30000000
60000000
25000000
25000000
50000000
40000000
40000000
20000000
20000000
40000000
30000000
20000000
10000000
30000000
20000000
10000000
0
10
20
30
40
50
60
10000000
5000000
0
0
15000000
SequenceLength
10
20
30
40
50
60
15000000
10000000
5000000
0
0
Count
30000000
50000000
Count
60000000
50000000
Count
60000000
Count
Count
20914_1_449L_striatum_Q175_WT_M_L1.LB1_1.clipped.fastq
20914_1_449L_striatum_Q175_WT_M_L1.LB1_2.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._1.clipped.fastq
20921_1_450L_cortex_Q175_HET_M_L2._2.clipped.fastq
20927_1_450L_striatum_Q175_HET_M_L1.LB2_1.clipped.fastq
20000000
10000000
0
0
0 10 20 30 40 50 60 70 80
0 10 20 30 40 50 60 70 80
SequenceLength
SequenceLength
SequenceLength
30000000
0
10
20
30
40
50
60
SequenceLength
35000000
60000000
60000000
30000000
50000000
40000000
40000000
50000000
25000000
30000000
20000000
10000000
30000000
20000000
10000000
0
0
0
10
20
30
40
50
60
40000000
30000000
SequenceLength
10
20
30
40
50
20000000
15000000
20000000
10000000
10000000
5000000
0
0
60
Count
70000000
50000000
Count
60000000
50000000
Count
60000000
Count
Count
20927_1_450L_striatum_Q175_HET_M_L1.LB2_2.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_1.clipped.fastq
20940_1_451L_striatum_Q175_WT_M_L6.LB3_2.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_1.clipped.fastq
20947_1_452L_cortex_Q175_HET_M_L8.LB2_2.clipped.fastq
SequenceLength
10
20
30
40
50
60
30000000
20000000
10000000
0
0
40000000
0
0
SequenceLength
10
20
30
40
50
60
0
SequenceLength
10
20
30
40
50
60
SequenceLength
60000000
60000000
50000000
50000000
50000000
40000000
40000000
40000000
40000000
30000000
20000000
10000000
30000000
20000000
10000000
0
10
20
30
40
50
60
20000000
10000000
0
0
30000000
SequenceLength
10
20
30
40
50
60
30000000
20000000
10000000
0
0
Count
60000000
50000000
Count
60000000
40000000
Count
50000000
Count
Count
20953_1_452L_striatum_Q175_HET_M_L8.LB4_1.clipped.fastq
20953_1_452L_striatum_Q175_HET_M_L8.LB4_2.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_1.clipped.fastq
20966_1_453L_striatum_Q175_WT_M_L2.LB5_2.clipped.fastq
20973_1_454L_cortex_Q175_HET_M_L4.LB3_1.clipped.fastq
SequenceLength
10
20
30
40
50
60
0
Those same samples have a mix
of 50mer reads and 75mer reads
That’s very odd
0
SequenceLength
20000000
10000000
0
0
30000000
10
20
30
40
50
60
0
SequenceLength
10
20
30
40
50
60
SequenceLength
20973_1_454L_cortex_Q175_HET_M_L4.LB3_2.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_1.clipped.fastq
20979_1_454L_striatum_Q175_HET_M_L4.LB6_2.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_1.clipped.fastq
20992_1_455L_striatum_Q175_WT_M_L2.LB7_2.clipped.fastq
20000000
10000000
0
70000000
60000000
60000000
60000000
60000000
50000000
50000000
50000000
50000000
40000000
40000000
40000000
30000000
10
20
30
40
50
30000000
20000000
20000000
10000000
10000000
0
0
40000000
60
SequenceLength
10
20
30
40
50
60
30000000
20000000
10000000
0
0
Count
30000000
70000000
Count
Count
Count
40000000
Count
50000000
SequenceLength
10
20
30
40
50
20000000
10000000
0
0
30000000
0
At this point we asked our sequencing
lab for clarification on what happened
60
0
SequenceLength
10
20
30
40
50
60
0
SequenceLength
10
20
30
40
50
60
SequenceLength
60000000
60000000
40000000
50000000
50000000
40000000
40000000
30000000
20000000
10000000
30000000
20000000
10000000
0
10
20
30
40
50
SequenceLength
13
60
20000000
10000000
0
0
30000000
10
20
30
40
50
SequenceLength
60
30000000
20000000
10000000
0
0
Count
50000000
40000000
Count
50000000
40000000
Count
50000000
Count
Count
20999_1_456L_cortex_Q175_HET_M_L5.LB4_1.clipped.fastq
20999_1_456L_cortex_Q175_HET_M_L5.LB4_2.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_1.clipped.fastq
21005_1_456L_striatum_Q175_HET_M_L5.LB8_2.clipped.fastq
21018_1_457L_striatum_Q175_WT_F_L1.LB9_1.clipped.fastq
10
20
30
40
50
SequenceLength
60
20000000
10000000
0
0
30000000
0
0
10
20
30
40
50
SequenceLength
60
0
10
20
30
40
50
SequenceLength
60
Our sequencing partner found the cause
The 8 suspect samples
264401 20921_1_450L_cortex_Q175_HET_M
264416 23535_1_528L_cortex_Q111_HET_F
264418 28243_1_644L_cortex_Q20_HET_M
264397 35624_1_844L_striatum_Q92_WT_F
264447 35631_1_845L_cortex_Q92_HET_F
264448 35657_1_847L_cortex_Q92_HET_F
264451 454_Liver_Q175_HET_M
264455 462_Liver_Q175_HET_F
20130523 V02604
20130523 V02604
20130523 V02604
20130523 V02604
20130604 V02761
20130604 V02761
20130604 V02761
20130604 V02761
VIRT
VIRT
VIRT
VIRT
VIRT
VIRT
VIRT
VIRT
2
3
4
1
1
2
3
4
For these 8 samples, the initial run didn’t get a full 100 million
reads. When that happens the lab runs the samples again and then
merges the run into a full “virtual run” of the full read depth we paid
for. That’s all good. The strange thing that happened to us this time
was that the run they added our 8 samples to (they add it to ongoing
flow cells) happened to be a 75mer run. Again no big problem usually,
and what they do is clip off 25 bases in their processing and all is
compatible. This specific time they forgot to trim, so we saw the ugly
intermediate state.
What this means is that the data for those 8 are fine. They are longer,
but still good reads from our samples.
14
Mitochondrial rate in brain
8-9% of the reads are mtRNA
nothing surprising there
15
Mitochondrial rate in liver
6-7% of reads are mtRNA
Again in line with expectations
16
Other QC parameters that looked great
• Insert sizes: All right around 175 as expected
• Sense/antisense sequence ratio: 1:1 as expected
• Sequence coverage
– 40% of mouse transcriptome detected in brain
– About 30% of mouse transcriptome detected in liver
• Mapped read rate in the upper 90s
– 98% for brain, 96% for liver
• 95-97% of our reads are mapped to known genes
– 3-5% intergenic regions
17
Model based outlier detection
8000
7000
Method by which we look for the number of genes that are
outliers after accounting for our modeled effects
• 2 samples stand out, and additional 4-6 are suspect, but probably
OK (Q92 Het males)
6000
5000
4000
3000
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
2000
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
1000
0
14_449L_striatum_Q175_WT_M_L1.LB1_...
18
23385_516L_striatum_Q111_HET_M_L1.LB2...
28295_648L_cortex_Q20_HET_M_L8.LB21_1...
30772_715L_striatum_Q80_WT_F_L3.LB11_...
33205_781L_striatum_Q140_WT_F_L6.LB6_...
Integrating the sample QC to choose omissions
A very simple way to determine what to throw out
is to look for multiple strikes against a sample
5’ -> 3’ charts
454_Liver_Q175_HET_M_L3._1.clipped
PCA outliers
Model based outliers
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clippe
456_Liver_Q175_HET_M_L7.LB4_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
845_Liver_Q92_HET_F_L6.LB25_1.clipped
522_Liver_Q111_HET_F_L8.LB13_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
452_Liver_Q175_HET_M_L1.LB2_1.clipped
776_Liver_Q140_HET_F_L8.LB13_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
716_Liver_Q80_HET_F_L7.LB6_1.clipped
843_Liver_Q92_HET_F_L6.LB23_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
19
450_Liver_Q175_HET_M_L8.LB1_1.clipped
Final list of proposed samples for omission
30811_718L_striatum_Q80_HET_F_L5.LB14_1.clipped
20947_452L_cortex_Q175_HET_M_L8.LB2_1.clipped
35481_833L_striatum_Q92_HET_M_L3.LB10_1.clipped
21051_460L_cortex_Q175_HET_F_L3.LB6_1.clipped
20
Liver contamination in cortex Q140 and Q92?
While the sequencing lab was looking into the 75mer
issue I ran cortex through some basic statistical
modeling (omitting the samples mentioned
previously)
I found changes, but the pattern and biology was all
wrong
21
Alb
8
7
Logged FPKM
6
5
Every single change is an increase
Completely off in Q175, 111, 80, and 20
On (but not that strongly in 140 and 92)
It’s make no sense for Q111 to be skipped
and for Q175 to go back to normal
4
3
2
Albumin is the top hit?
Isn’t Albumin liver specific?
1
0
22
ENSMUSG00000029368
Color by Q Length
Q111
Q140
Q175
Q20
Q80
Q92
Some of the other changed genes are suspicious
•
•
•
•
•
Albumin
ApoC3, C1,
Mup3, 10, 18, 19
FABP1
Urate oxidase
All reasonably solid liver markers
DAVID functional annotation also suggests the altered genes
are liver related (p < 10-5)
23
9000000
7000000
5000000
4000000
A subset of genes with good correlation between liver and cortex
but shifted from the 1:1 axis
3000000
2000000
900000
700000
500000
400000
300000
200000
90000
70000
50000
40000
768_Liver_Q140_HET_M_L3.LB9_1.clipped
30000
20000
9000
7000
5000
4000
3000
2000
900
700
500
400
300
200
90
70
50
40
30
20
9
7
5
4
3
2
1
1
3
24
7
20
50
200
500
2000
5000
20000
33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped
50000
200000
500000
2000000
5000000
9000000
7000000
5000000
4000000
3000000
Same chart with the “significant” genes in red
2000000
900000
700000
500000
400000
300000
200000
90000
70000
50000
40000
768_Liver_Q140_HET_M_L3.LB9_1.clipped
30000
20000
9000
7000
5000
4000
3000
2000
900
700
500
400
300
200
90
70
50
40
30
20
9
7
5
4
3
2
1
1
3
25
7
20
50
200
500
2000
5000
20000
33030_768L_cortex_Q140_HET_M_L4.LB9_1.clipped
50000
200000
500000
2000000
5000000
9000000
7000000
5000000
4000000
3000000
Same chart and shading in Q111, notice the
Lack of linear correlation
2000000
900000
700000
500000
400000
300000
200000
90000
70000
50000
40000
514_Liver_Q111_HET_M_L8.LB9_1.clipped
30000
20000
9000
7000
5000
4000
3000
2000
900
700
500
400
300
200
90
70
50
40
30
20
9
7
5
4
3
2
1
1
3
26
7
20
50
200
500
2000
5000
20000
23353_514L_cortex_Q111_HET_M_L1.LB9_1.clipped
50000
200000
500000
2000000
5000000
What we suspect happened
• The basic problem is that liver specific transcripts
should not have correlated expression in cortex
• A very small amount of liver contamination has
occurred. The shift is 500 to 1000 times lower than
normal liver expression
• What this means is only the absolute highest liver
expressed genes are detected at all
• The challenge is uniquely identifying the affected genes
FPKMs of albumin, which should not exist in brain
Cortex
Albumin
27
Striatum Liver
173
0.01
40979
9000000
7000000
5000000
4000000
3000000
2000000
900000
700000
Liver filter created as
500000
400000
•
•
•
300000
200000
90000
70000
Liver mean count > 2000
Mean ratio of liver to cortex > 500
Cortex count > 0
770_Liver_Q140_HET_M_L2.LB10_1.clipped
50000
40000
30000
Not a bad first approximation
20000
9000
7000
5000
4000
3000
2000
900
700
500
400
300
200
90
70
50
40
30
20
9
7
5
4
3
2
1
1
3
28
7
20
50
200
500
2000
5000
20000
33056_770L_cortex_Q140_HET_M_L7.LB10_1.clipped
50000
200000
500000
2000000
5000000
Effect of filtering out the liver specific genes from the
cortex data
90
80
70
60
50
Hits pre-filter
Hits post-filter
40
30
20
10
0
Q80
29
Q92
Q111
Q140
Q175
Summary of QC
• All but 4 of the 192 samples can move forward to the
analysis
• A filter to clear out highly expressed liver genes is
needed for the cortex Q140 and Q92 sets
• Striatum PCA plots show that CAG length is the single
largest global element of variance!
30