DEFINITION AND CHARACTERIZATION OF PETROLEUM COMPOSITIONAL FAMILIES USING PRINCIPAL COMPONENT ANALYSIS Nikos Pasadakis1, Mark Obermajer2, Kirk G. Osadetz2 1 Technical University of Crete, 2 Natural Resources Canada, GSC Calgary Backround The Williston Basin is a sub-circular Phanerozoic epicratonic basin that preserves a Phanerozoic succession up to 5 km thick overlying deeply eroded Archean Superior and Hearne/Wyoming cratons. It is an extensively studied, prolific petroleum province with producing horizons throughout the entire Phanerozoic succession. As a result of numerous geological and geochemical studies, the regional petroleum systems are relatively well defined, making this setting ideal for developing alternative means of identifying and describing petroleum systems. This study discusses improvements in petroleum-petroleum correlations that result from exploratory multivariate statistical analysis of many easily obtained, thermally persistent compositional compounds. Specifically, we perform principal component analysis (PCA) on the gasoline range and saturate fraction gas chromatographic data obtained from analyses of 171 oil samples produced from Ordovician - Mississippian interval. The low abundance, high molecular weight, structurally complicated compounds, such as polycyclic alkanes, are the preferred basis of petroleum family definition because the biochemical, sedimentological and physical processes that singly or competitively affect their compositional variations are understood. However, such approach ignores both the majority of the petroleum composition and most of the compositional data obtained from analytical protocols. Moreover, the affects of important processes, especially mixing might not be discernable if absolute concentrations vary between mixed components. Neither do such techniques objectively define criteria for separating families nor do they describe the internal compositional variations within a single family. It is desirable to determine if abundant, simple compounds exhibit characteristics and variations consistent with an interpretation based on less abundant but more complicated compounds. In addition, the compounds that define a family may not persist through the complete thermal maturity range making it necessary to use alternative components to relate samples lacking diagnostic compositional elements. Ideally, one would employ the most easily obtained, most abundant and most thermally persistent components that allow family definition. However, qualitative and semi-quantitative analyses of such fractions are commonly non-diagnostic of familial affinity. PCA is an exploratory multivariate statistical method. Experiments or observations that result in data for many variables from many samples often contain valuable information. The number of variables and samples, as well as covariance, can obscure significance. Such data must be explored to determine if most of the original information can be represented by a reduced number of derived variables. Petroleum systems are well suited to such exploration because sample molecular composition results from a complicated interaction of biological, environmental, geological and physical processes working competitively and simultaneously. PCA derives a new uncorrelated variable set, the principal components. Each principal component attempts to account for the largest possible portion of the original total variance, using linear combinations of original variables. The first principal component passes through the centroid of the standardized data set and is oriented to maximize sample variance using linear combinations of original variables. Successive principal components explain the largest possible variance while being orthogonal to the preceding principal component. Successive principal components explain progressively less of the original variance. The number of derived variables is equal to the number of original variables. Sample scores describe position in principal component space and each original variable has loadings that describe their contribution to each principal component. Mathematically PCA is either the eigenanalysis of the covariance matrix or the eigen-analysis of the correlation matrix, depending on the preparation of the dataset. Display of principal component subsets is the selection of a reduced variable set. Such visualizations illustrate, either characteristics of the data (clustering about a point in PC space), linear gradients in the data (correlated variations in PC space), or a non-linear relationship among the samples (horseshoe effect). This allows visualization of samples associations and the elucidation of the role and importance of original variables, based on their loadings. The interpretation of the principal components requires additional information from models that interpret variable loadings. It is also possible to refer independent data models and classifications, as is the case here. Finally, additional samples can be compared to the model using factor loadings to calculate their sample scores. Principal Component Analysis (PCA) has many applications to geological problems. It has been applied to organic geochemistry describing and classifying both petroleum generation and secondary processes, to identify petroleum families while characterizing alteration pathways and has been shown to be efficient for the discrimination of petroleum sources. In this work we use PCA for variable reduction and classification purposes, maximizing the diagnostic characteristics of both fractions. Type II (Devonian, Mississippian & Mesozoic oils) not studied Group 2 (Mission Canyon, Bakken) (Nisku) Group 3 Type I (OrdovicianSilurian oils) 23/30 Lodgepole Fm. (L.Miss.) 0.68 1.81 0.41 0.40 Family B Bakken Fm. (U.Dev.-Miss.) 1.40 4.02 2.55 0.67 Family D Winnipegosis Fm. (M.Dev.) 0.87 2.11 1.52 0.07 Winnipeg Gr. (M. Ord.) and Bighorn Gr. (U.Ord.) 1.10 9.07 2.48 0.04 (Madison) (Winnipegosis) d/r Family C m/l Source rocks (Bakken) Group 4 Critical biomarker criteria pr/ph Zumberge, 1983; Osadetz et al., Leenheer & Williams, 1974 Zumberge, 1987 1992 & 1994 (Duperow) Group 1 Family A (Red River) (Red River) Generalized familial classification of Paleozoic oils in the Williston Basin (principal reservoirs in parentheses). Biomarker ratios (average): pr/ph - pristane/phytane; m/l - C 15-19/C 21-25 n-alkanes; d/r - diasteranes/regular steranes; 23/30 - C 23 tricyclic terpane/C 30 hopane. SSE Moose Mountain 200 300 N Souris River S USA NNE (km ) 100 CANADA 0 Chim ney Buttle Little Missouri River DEPTH (km ) PALEOCENE 0 0 CRETACEOUS Winnipegosis 1 1 JURASSIC TRIASSIC PERMIAN PENNSYLVANIAN 2 MISSISSIPIAN Lodgepole Bakken DEVONIAN 3 SILURIAN Yeoman ORDOVICIAN 4 CAMBRIAN NESSON ANTICLINE Stratigraphic section of Williston Basin, indicating the main stratigraphic features and position of effective Paleozoic petroleum source rocks Location map showing the main geological elements of Williston Basin and the distribution of petroleum provinces in the study area Biomarkers Principal Component Analysis (PCA) has many applications to geological problems. It has been applied to organic geochemistry describing and classifying both petroleum generation and secondary processes, to identify petroleum families while characterizing alteration pathways and has been shown to be efficient for the discrimination of petroleum sources. In this work we use PCA for variable reduction and classification purposes, maximizing the diagnostic characteristics of both fractions. Oil families A C B D Family D C 34 prom inenc e C34/C33 hop a ne 2.0 1.5 C 35 prom inenc e Family A 1.0 Family B Family C 0.5 0.0 0.5 1.0 C35/C34 ho p a ne 1.5 SFGC GRGC 15 Lab.no. 1388 Family C 20 pool: Weyburn, SK reservoir: Madison (Miss) Ph Pr 25 Lab.no. 1402 SFGC GRGC Family B pool: Squaw Gap, ND reservoir: Bakken (Miss) 15 Pr Ph 20 25 SFGC GRGC Lab.no. 1364 15 Family D pool: Hitchcock, SK reservoir: Winnipegosis (Dev) 20 25 Pr SFGC GRGC 1 - 2,2-dimethylpentane 2 - 2,4-dimethylpentane 3 - 3,3-dimethylpentane 4 - 2-methylhexane 5 - 2,3-dimethylpentane 6 - 1,1-dimethylcyclopentane 7 - 3-methylhexane 8 - 1c3-dimethylcyclopentane 9 - 1t3-dimethylcyclopentane 10 - heptane 11 - methylcyclohexane 12 - toluene 13 - octane Ph Lab.no. 3118 15 Family A pool: Raymond, MT reservoir: Red River (Ord) Pr - pristane Ph - phytane 15 - C15 n-alkane 20 - C20 n-alkane 25 - C25 n-alkane Pr Ph 20 25 Representative GRGCs and SFGCs showing typical gasoline range and n-alkane distributions in each biomarker-defined Paleozoic oil family. 5 4 Scores Oil families A C B D Family C PC 2 2 0.4 0.2 nC8 1 0.0 0 -2 Tol Benz naphC6 3 -1 X-loadings 0.6 C6 nC6 naphC7 nC7 Fa m il C7 yA D ily m a F -3 -5 -4 -3 -2 -1 0 1 PC 1 Family B -0.2 naphC8 C8 2 3 -0.4 4 5 -0.4 -0.2 0.0 0.2 0.4 PC 1 nC6 = hexane/ Sof compounds eluting between 2,2-dimethylbutane and hexane; nC7 = heptane/Sof compounds eluting between 2,2-dimethylpentane and heptane; nC8 = octane/ Sof compounds eluting between methylcyclohexane and octane; CYC6 = cyclopentane/hexane; CYC7 = (methylcyclopentane+cyclohexane+dimethylcyclopentane)/heptane; CYC8 = (methylcyclohexane+ethylcyclopentane+1,cis-4-dimethylcyclohexane)/octane; C6 = (dimethylbutane+methylpentane)/hexane; C7 = (trimethylbutane+dimethylpentane+methylhexane)/heptane; C8 = (dimethylhexane+trimethylpentane+methylheptane)/octane; Benz = benzene/heptane; Tol = toluene/octane. The GRCM classifies Families A, Band C unambiguously. Family C exhibits a significant PC2 compositional gradient. Family A samples are characterized by an enrichment of gasoline range n-alkanes, while Families B and C both have abundant branched and cyclic compounds, but Family C oils is enriched in aromatic compounds and cyclopentane compared to Family B. Family D oils exhibit a wide variation of all the modelled gasoline range components, but it is neither as enriched in n-alkanes as Family A is generally, nor is it as enriched in aromatics and cyclopentane as are about half of the Family C oils. While Families Band C are distinguishable, together they display a PC2 variation suggesting either an alteration of Family C controlled by water washing or by significant mixing of pristine aromatic-enriched Family C oils with aromatic-poor Family B oils. This model cannot alone distinguish between these two alternatives. Scores 3 ily m a F bC7 D PI 1 0.6 0.4 0 0.2 -1 mil Fa yB Fa mi ly A PC 2 1 Fa mil y C 2 X-loadings 0.8 -2 Oil families A B -3 -3 0.0 -0.2 C D -2 -1 0 PC 1 1 2 dmC5 3 -0.4 -0.5 K1 0 PC 1 0.5 PI I (Isoheptane value) = (2-methylhexane + 3-methylhexane)/sum of 1c3-, 1t3-, 1t2- dimethylcyclopentanes; K1 (Mango parameter) = (2methylhexane+2,3-dimethylpentane) / (3-methylhexane+2,4-dimethylpentane); bC7 (weight % isoheptane) = (2-methylhexane+2,3-dimethylpentane+3methylhexane)*100 / Sof compounds eluting between 2-methylhexane and 2,2-dimethylhexane; dmC5 = 2,4-dimethylpentane/2,3-dimethylpentane. The GRRM provides an improved characterization of Family D samples, but it is weaker than the GRCM at discriminating Family B from Family C in PC1 vs. PC2 space. A strong gradient among the four families is controlled by the loadings of both K1 factor and the branched to total C7 compound ratio. This demonstrates that the K1 factor is primarily a source indicator contrary to the initial interpretation of this parameter. An important feature is the sub-parallel orientation of the four familial gradients indicating linear variations within each family. The similarity of these gradients suggests that there is a single dominant process that affects these internal linear variations. Since the observed gradients are strongly controlled by Paraffin Index 1 (PI 1) loadings, that process is inferred to be thermal maturity. The almost orthogonal relationship between the loadings or PI 1 and the K1 factor suggests little impact of thermal maturity on the K1 parameter, further showing that K1 is an effective source indicator. Scores 2 ily m a F 0 C C16 A -2 Ph C20-24 C19 -0.2 Oil families A C B D PC 1 Pr C18 0.0 -4 -6 -5 -4 -3 -2 -1 C14 C13 Family D Fam ily PC 2 C15 0.4 0.2 -6 X-loadings 0.6 B C17 0 1 2 3 4 -0.4 -0.4 -0.2 0 0.2 PC 1 Pr = pristane normalized to the highest peak; Ph = phytane normalized to the highest peak; nC13 to nC24 = C13 to C24 normal alkanes normalized to the highest peak. Although oil fam ilies generally exhibit distinc tive ranges of sam ple sc ores within the SFCM, this m odel is less effec tive in separating fam ilies B, C and D. The results are c onsistent with predominanc e of lower m olec ular weight odd c arbon num ber n-alkanes in Family A oils that varies with thermal m aturity. The general tendenc y for Family C oils to have the m ost negative PC1 sc ores indic ates lower relative c onc entration of Pr and m ore abundant even c arbon num bered n-alkanes, c om pared to Fam ily B sam ples. This is c onsistent with a lac k of water c olumn anoxia during deposition of Bakken sourc e roc ks c ompared with signific ant water c olumn anoxia during the deposition of Lodgepole sourc e roc ks. It is apparent that gasoline range c om positional differenc es in Fam ily C are ac c om panied by non-linear variations of sam ple sc ores of SFGC c om ponents. The non-linear relationship among Fam ily C desc ribes a c ontinuous c om positional variation, but the two orthogonal lim bs provide basis for subdividing this fam ily into two subgroups. The subgroup with strongly negative PC1 sc ores inc ludes sam ples enric hed in benzene and toluene, while the sam ples with m ore positive PC1 sc ores are depleted in these c om pounds and overlap with arom atic -poor Family B oils. While different proc esses m ight explain these internal Fam ily C variations (water washing, biodegradation, c om positional m ixing or their c om bination) a m ixing hypothesis is the m ost plausible explanation. The PC1 and PC2 sc ores of Fam ily D samples overlap those of fam ilies B and C, not allowing a mutual distinc tion of these samples. Fam ily D sc ores also exhibit a nonlinear variation. Although there are insuffic ient num der of Fam ily D sam ples to c onfirm subc om positions, the sim ilarity to the non-linear behaviour of Fam ily C suggests sim ilar proc esses m ay be responsible for those variations. X-loadings Scores 3 Family B 2 0.4 0 0.2 ily A Fam PC 2 1 -2 Pr/C17 CPI(22-32) Family D -1 Pr/Ph 0.6 Family C Ph/C18 0.0 CPI(14-20) Oil families A C B D -0.2 -3 -3 -2 -1 0 1 PC 1 2 3 4 5 -0.4 -0.2 0 0.2 0.4 0.6 PC 1 Pr/Ph = pristane/phytane ratio; Pr/nC17 = pristane/C17 normal alkane ratio; Ph/nC18 = phytane/C18 normal alkane ratio; CPI 14-20 = ½{[(C15+C17+C19)/(C14+C16+C18)] +[(C15+C17+C19)/(C16+C18+C20)}; CPI 22-32 = ½{[(C23+C25+C27+C29+C31) / (C22+C24+C26+C28+C30)]+[(C23+C25+C27+C29+C31) / (C24+C26+C28+C30+C32)]}. In this model the gradient of Family A is distinctive, but its range cannot be attributed only to differences in thermal maturity, since this gradient is associated with original variable loadings commonly attributed to source rock depositional environment, such as Pr/Ph. More noticeable are the two general gradient trends between the Ordovician-sourced and Devono-Carboniferous sourced oils controlled by the loadings of Pr/C17 and the CPI for the light n-alkanes. The orthogonal gradient of Family D samples suggests that the composition of that oil family is affected by factors other than the source kerogen and thermal maturity variations that account for the internal variation of oil family composition in this and previous models. The distinction between Family B and Family C samples is controlled by the loadings of the Pr/Ph ratio and the heavy n-alkane CPI. Both these are commonly interpreted as indications of source rock depositional environment. Compared to previous models, this model illustrates distinct and characteristic samples scores for these two families. If only mixing of Family C and Family B end-members was ubiquitous process responsible for the variation of Family C scores and their overlap with Family B, then a more significant overlap of sample scores in the SFRM might have been expected. Scores 4 0.4 X-loadings Pr/Ph Pr/C17 Family B 0.2 PC 2 2 Fami 0 Family A ly D Ph/C18 0.0 CPI(14-20) nC7 -0.2 nC6 -2 Oil families A C B D Family C -4 -4 -2 0 PC 1 2 -0.4 -0.6 4 -0.4 Benz Tol -0.2 0 0.2 0.4 PC 1 Pr/Ph = pristane/phytane ratio; Pr/nC17 = pristane/C17 normal alkane ratio; Ph/nC18 = phytane/C18 normal alkane ratio; CPI 14-20 = ½{[(C15+C17+C19)/(C14+C16+C18)] +[(C15+C17+C19)/(C16+C18+C20); nC6 = hexane/ Sof compounds eluting between 2,2-dimethylbutane and hexane; nC7 = heptane/Sof compounds eluting between 2,2-dimethylpentane and heptane; Benz = benzene/heptane; Tol = toluene/octane. The CGSM is useful for the classification of families A and B, and for distinguishing one of either families C or D in the absence of the other. Family A oils are distinctive and exhibit the smallest internal variation in the PC1-PC2 space, due primarily to source. A similar source emphasis is seen in families Band C. If the internal range of individual family PC1 scores is typified by Family A then the range of PC1 scores of families B and C would be overlap, with or without mixing. Therefore PC1 compositions of these two families are likely due to source. Family D PC1 scores likely reflect a progressive change in source paleodepositional conditions during the open marine to mesohaline basin transition. Therefore, these scores are interpreted as indicating changes in water column chemistry, specifically oxygenation and salinity, acting on differences in biomass. This is consistent with previously observed variations between families B and C, where the two were interpreted to have com positions indicative of dyserobic and anaerobic water columns, respectively. PC2 scores, controlled by the loadings of benzene and toluene, and Pr/Ph, define a gradient between families B and C that is at least partly attributed to mixing. 5 3 Model 1 Model 2 4 2 3 2 1 PC 2 1 0 0 -1 -1 -2 -2 -3 -4 -8 0.8 -6 -4 -2 0 2 4 -3 -3 5 6 Model 4 -2 -1 0 1 2 3 4 Model 5 4 0.6 3 PC 2 0.4 2 0.2 1 0 0 -1 -0.2 -2 -0.4 -3 -0.6 -0.8 -0.4 0.0 0.4 -4 1.2 -4 0.8 -2 0 PC 1 2.0 Model 4 1.4 1.8 0.4 1.2 1.6 Ph/nC18 CPI (14-20) 1.0 b e nze ne 1.6 Model 4 Model 1 0.6 1.4 1.2 1.0 1.0 0.8 0.6 0.4 0.2 0.8 0.2 0.0 0.6 0.0 -4 -2 0 PC 1 2 4 4 PC 1 1.2 0.8 2 -2 0 2 PC 1 4 -2 0 2 PC 1 4 6 Summary - Gasoline range and >210oC boiling point saturate fraction hydrocarbons carry important petroleum system information affected by multiple processes simultaneously complicating compositional traits and limiting their independent value for classification and interpretation. - Principal components of these two fractions, interpreted separately or in combination enhance the interpretation of petroleum systems, especially when combined with information from biological marker compounds. - In Williston Basin compositional variations of these fractions follow polycyclic terpane and sterane biomarker compositional characteristics interpreted previously as indicative of source rock and thermal maturity characteristics, but herein related to a more complicated set of processes. - Only Family A oils, from Ordovician sources, have sufficiently distinctive compositions to be classified using reduced variable sets from principal component compositional analysis of these fractions. Families B, C and D oils, from GivetianTournaisian source rocks, also have characteristic compositions of these fractions, but the range of compositional variations precludes their use as a primary classification tool. Consideration of multiple models, pair-wise models and additional derived variables can minimize ambiguities. Pair-wise discrimination using multiple models assists the identification of families B, C and D. - Principal component models of both the group of Williston Basin oils and the individual biomarker-defined families exhibit linear and non-linear compositional variations that are interpreted using variable loadings and independent information to result from: kerogen composition, compositional mixing, source rock depositional environment, and thermal maturity - Despite their lack of diagnostic capability for familial classification, these fractions can be better indicators for certain processes, such as compositional mixing - as shown here particularly - than more complicated compounds present in lower abundance. Therefore, the classification and analysis of petroleum systems can benefit from a combination of biological marker analysis and principal component analysis of gasoline range and >210oC boiling point Leenheer, M.J., Zumberge, J.E., 1987. Correlation and thermal maturity of Williston Basin crudeoils and Bakken source rocks using terpane biomarkers. In: Longman, M.W. (Ed.), Williston Basin: Anatomy of a Cratonic Oil Province. Rocky Mountain Association of Geologists, Denver, pp. 287-298. Obermajer, M., Osadetz, K.G., Fowler, M.G., Snowdon, L.R., 2000. Light hydrocarbon (gasoline range) parameter refinement of biomarker-based oil-oil correlation studies: an example from Williston Basin. Organic Geochemistry 31, 959-976. Osadetz, K.G., Brooks, P.W., Snowdon, L.R., 1992. Oil families and their sources in Canadian Williston Basin, (southeastern Saskatchewan and southwestern Manitoba). Bulletin of Canadian Petroleum Geology 40, 254-273. Osadetz, K.G., Snowdon, L.R., Brooks, P.W., 1994. Oil families in Canadian Williston Basin southwestern Saskatchewan. Bulletin of Canadian Petroleum Geology 42, 155-177. Williams, J.A., 1974. Characterization of oil types in Williston Basin. American Association of Petroleum Geologists Bulletin 58, 1243-1252. Zumberge, J.E., 1983. Tricyclic diterpane distributions in the correlation of Paleozoic crude oils from the Williston Basin. In: Bjoroy, M. (Ed.), Advances in Organic Geochemistry 1981. John Wiley & Sons Ltd., New York, pp. 738-745. We thank Sneh Ac hal & Laura Mulder, GSC-Calgary, for exc ellent tec hnic al assistanc e. The Saskatc hewan Geologic al Survey and various oil c ompanies are thanked for providing selec ted oil samples.