Word

advertisement
[BARNA-197] Size Distribution Statistics Created: 20/Jul/12
Status:
Project:
Component/s:
Affects
Version/s:
Fix Version/s:
Resolved
Barna Package
Simulator
Simulator 1.0.3 (API 1.12)
Type:
Reporter:
Resolution:
Labels:
Remaining
Estimate:
Time Spent:
Original
Estimate:
Bug
Micha Sammeth
Won't Fix
None
Not Specified
Updated: 09/Oct/15 Resolved: 09/Oct/15
Simulator 1.1 (API 1.15)
Not Specified
Not Specified
Description
LOAD_CODING
true
LOAD_NONCODING
true
NB_MOLECULES
8000000
EXPRESSION_K
-0.6
EXPRESSION_X0
9500.0
EXPRESSION_X1
9.025E7
TSS_MEAN
NaN
POLYA_SCALE
NaN
POLYA_SHAPE
NaN
RTRANSCRIPTION
true
RT_LOSSLESS
true
RT_MAX
5500
RT_MIN
50
RT_PRIMER
RH
FRAGMENTATION
true
FRAG_SUBSTRATE
RNA
FRAG_METHOD
UR
FRAG_UR_D0
1
FRAG_UR_DELTA
NaN
FILTERING
true
PCR_DISTRIBUTION
default
PAIRED_END
true
READ_LENGTH
76
ERR_FILE
76
READ_NUMBER
15000000
FASTA
true
Priority:
Assignee:
Votes:
Major
Micha Sammeth
0
When I use the parameter file listed at the very beginning, that is, no SIZE_DISTRIBUTION,
TSS_MEAN, POLYA_SCALE, POLYA_SHAPE and it runs successfully, The UR
fragmentation results in average fragment length of 188 with maximum of 555 and a total of 76
million molecules. And when doing a size selection on this, the output average is 181 with max:
299 and gives me 2.1 million molecules which is approximately 3% of total molecules from
fragmentation. How is this possible? The average is 188 out of 76 million and you get at the end
2 million between 181 and 299…?? It just doesn't add up, does it? Or did I get it completely
wrong?
[LIBRARY] Configuration
Mode: RH
PWM: No
RT MIN: 50
RT MAX: 5500
Processing Fragments ********** OK (00:05:34)
76054743 mol: in 76027642, new 27101, out 76054743
avg Len 156.77885, maxLen 555
initializing Selected Size distribution
[LIBRARY] Segregating cDNA (Acceptance)
Processing Fragments ********** OK (00:02:05)
76054743 mol: in 76054743, new 0, out 2057527
avg Len 181.46268, maxLen 299
start amplification
Comments
Comment by Micha Sammeth [ 24/Jul/12 ]
This is most likely a misunderstanding: my guess is that the PCR amplification factor in the
fourth column of the .LIB file (cf http://sammeth.net/confluence/display/SIM/.LIB+library) has
most likely not been taken into account when counting the number of fragments in the interval
[181;299].
In a similar test case with ~51M reads and for which an average size of 181nt is reported, I find
~37M with sizes betweeen 181nt and 299nt (taking into account the multiplier in the 4th
column), which however correspond only to ~1.5M unique fragements (disregarding the
multiplier in the last column).
The histograms of insert size distribution in both cases resemble like one egg to another.
Comment by Micha Sammeth [ 24/Jul/12 ]
report seems to be a misunderstanding
Comment by Arunkumar Srinivasan [ 24/Jul/12 ]
Sorry, I still don't get it. Suppose an ideal scenario: There are 10 molecules, all of size 500. And
the uniform fragmentation size is 100. Under ideal conditions, one would expect 5 fragments
out of each, with 100bp each, and therefore 50 fragments. Now, if I were to size select for
100bp, I should get all 50 of them.
Extending the analogy to this scenario, the question I had was "before" amplification. I have 5
million molecules which are near-uniform fragmented to have average=156 and max=555 and
totally 76 million of them. I'd expect "most" of the fragments to be around the desired
fragmentation length (that I provided). But the number of fragments after size selection seems to
be 2 million. What I understand is that, out of 76 million molecules with mean=156, there are 2
million fragments/molecules with mean=181 and max=299. So, the remaining 74 million
fragments did NOT pass the size selection meaning they are > 299. Then, how is the average of
76 million = 156?
I'm sorry again if I am understanding something wrong, but following the protocol, this seems
rather straightforward to follow. The number of molecules you report after size-selection (and
before PCR amplification) contains the number of fragments that passed the size-selection. I
find a discrepancy in that statistics. 76mill reads with average of 156 results in 74 million reads
failing size selection for max=299bp. So, 74 mill must be > 299. How is the average of 76
million then 156?
Comment by Micha Sammeth [ 24/Jul/12 ]
Carefully, uniform random fragmentation does not produce uniform random distributed
fragment sizes, that is confounded very often! I hope this will become clear from the manuscript
once it is available. However, this is not the point of discussion here.
If I understood correctly, in your example you are comparing 2 million UN-amplified fragments
that fall in a certain interval to the total ~76 million amplified fragments. The ratio should be
more reasonable when comparing UN-amplified within the interval to total UN-amplified
fragments, or both amplified values (multiplier in the 4th column of the .LIB file).
In the former case you would compare probably few millions of reads, and in the latter case
dozen of millions of reads with each other.
Generated at Tue Feb 09 22:57:44 CET 2016 using JIRA 6.4#64014sha1:a67c3532e05d88bdc00cabc2cf5e0c6b82fa6023.
Download