[BARNA-197] Size Distribution Statistics Created: 20/Jul/12 Status: Project: Component/s: Affects Version/s: Fix Version/s: Resolved Barna Package Simulator Simulator 1.0.3 (API 1.12) Type: Reporter: Resolution: Labels: Remaining Estimate: Time Spent: Original Estimate: Bug Micha Sammeth Won't Fix None Not Specified Updated: 09/Oct/15 Resolved: 09/Oct/15 Simulator 1.1 (API 1.15) Not Specified Not Specified Description LOAD_CODING true LOAD_NONCODING true NB_MOLECULES 8000000 EXPRESSION_K -0.6 EXPRESSION_X0 9500.0 EXPRESSION_X1 9.025E7 TSS_MEAN NaN POLYA_SCALE NaN POLYA_SHAPE NaN RTRANSCRIPTION true RT_LOSSLESS true RT_MAX 5500 RT_MIN 50 RT_PRIMER RH FRAGMENTATION true FRAG_SUBSTRATE RNA FRAG_METHOD UR FRAG_UR_D0 1 FRAG_UR_DELTA NaN FILTERING true PCR_DISTRIBUTION default PAIRED_END true READ_LENGTH 76 ERR_FILE 76 READ_NUMBER 15000000 FASTA true Priority: Assignee: Votes: Major Micha Sammeth 0 When I use the parameter file listed at the very beginning, that is, no SIZE_DISTRIBUTION, TSS_MEAN, POLYA_SCALE, POLYA_SHAPE and it runs successfully, The UR fragmentation results in average fragment length of 188 with maximum of 555 and a total of 76 million molecules. And when doing a size selection on this, the output average is 181 with max: 299 and gives me 2.1 million molecules which is approximately 3% of total molecules from fragmentation. How is this possible? The average is 188 out of 76 million and you get at the end 2 million between 181 and 299…?? It just doesn't add up, does it? Or did I get it completely wrong? [LIBRARY] Configuration Mode: RH PWM: No RT MIN: 50 RT MAX: 5500 Processing Fragments ********** OK (00:05:34) 76054743 mol: in 76027642, new 27101, out 76054743 avg Len 156.77885, maxLen 555 initializing Selected Size distribution [LIBRARY] Segregating cDNA (Acceptance) Processing Fragments ********** OK (00:02:05) 76054743 mol: in 76054743, new 0, out 2057527 avg Len 181.46268, maxLen 299 start amplification Comments Comment by Micha Sammeth [ 24/Jul/12 ] This is most likely a misunderstanding: my guess is that the PCR amplification factor in the fourth column of the .LIB file (cf http://sammeth.net/confluence/display/SIM/.LIB+library) has most likely not been taken into account when counting the number of fragments in the interval [181;299]. In a similar test case with ~51M reads and for which an average size of 181nt is reported, I find ~37M with sizes betweeen 181nt and 299nt (taking into account the multiplier in the 4th column), which however correspond only to ~1.5M unique fragements (disregarding the multiplier in the last column). The histograms of insert size distribution in both cases resemble like one egg to another. Comment by Micha Sammeth [ 24/Jul/12 ] report seems to be a misunderstanding Comment by Arunkumar Srinivasan [ 24/Jul/12 ] Sorry, I still don't get it. Suppose an ideal scenario: There are 10 molecules, all of size 500. And the uniform fragmentation size is 100. Under ideal conditions, one would expect 5 fragments out of each, with 100bp each, and therefore 50 fragments. Now, if I were to size select for 100bp, I should get all 50 of them. Extending the analogy to this scenario, the question I had was "before" amplification. I have 5 million molecules which are near-uniform fragmented to have average=156 and max=555 and totally 76 million of them. I'd expect "most" of the fragments to be around the desired fragmentation length (that I provided). But the number of fragments after size selection seems to be 2 million. What I understand is that, out of 76 million molecules with mean=156, there are 2 million fragments/molecules with mean=181 and max=299. So, the remaining 74 million fragments did NOT pass the size selection meaning they are > 299. Then, how is the average of 76 million = 156? I'm sorry again if I am understanding something wrong, but following the protocol, this seems rather straightforward to follow. The number of molecules you report after size-selection (and before PCR amplification) contains the number of fragments that passed the size-selection. I find a discrepancy in that statistics. 76mill reads with average of 156 results in 74 million reads failing size selection for max=299bp. So, 74 mill must be > 299. How is the average of 76 million then 156? Comment by Micha Sammeth [ 24/Jul/12 ] Carefully, uniform random fragmentation does not produce uniform random distributed fragment sizes, that is confounded very often! I hope this will become clear from the manuscript once it is available. However, this is not the point of discussion here. If I understood correctly, in your example you are comparing 2 million UN-amplified fragments that fall in a certain interval to the total ~76 million amplified fragments. The ratio should be more reasonable when comparing UN-amplified within the interval to total UN-amplified fragments, or both amplified values (multiplier in the 4th column of the .LIB file). In the former case you would compare probably few millions of reads, and in the latter case dozen of millions of reads with each other. Generated at Tue Feb 09 22:57:44 CET 2016 using JIRA 6.4#64014sha1:a67c3532e05d88bdc00cabc2cf5e0c6b82fa6023.