Uploaded by petro Ali

1-s2.0-S2666519022000115-main(1)

advertisement
Journal Pre-proof
Shale lithology identification using stacking model combined with SMOTE from well
logs
Jinlu Yang, Min Wang, Ming Li, Yu Yan, Xin Wang, Haoming Shao, Changqi Yu, Yan
Wu, Dianshi Xiao
PII:
S2666-5190(22)00011-5
DOI:
https://doi.org/10.1016/j.uncres.2022.09.001
Reference:
UNCRES 15
To appear in:
Unconventional Resources
Received Date: 9 July 2022
Revised Date:
4 September 2022
Accepted Date: 5 September 2022
Please cite this article as: J. Yang, M. Wang, M. Li, Y. Yan, X. Wang, H. Shao, C. Yu, Y. Wu, D.
Xiao, Shale lithology identification using stacking model combined with SMOTE from well logs,
Unconventional Resources (2022), doi: https://doi.org/10.1016/j.uncres.2022.09.001.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.
© 2022 Published by Elsevier B.V. on behalf of KeAi Communications Co., Ltd.
Shale lithology identification using stacking model combined
with SMOTE from well logs
Jinlu Yang1,2, Min Wang1,2*, Ming Li1,2, Yu Yan1,2, Xin Wang1,2, Haoming Shao1,2, Changqi Yu1,2,
Yan Wu1,2, Dianshi Xiao1,2
1. Laboratory of Deep Oil and Gas, China University of Petroleum (East China), Qingdao 266580,
China.
2. School of Geosciences, China University of Petroleum (East China), Qingdao 266580, China.
Abstract: Shale lithology identification is the basis of geological research and reservoir
-p
ro
o
f
characterization, and is an essential task for oil exploration. Recently, several machine learning
algorithms have been applied to improve the accuracy of lithology identification. However, stacking
model for lithology identification has been less used in existing studies, and less consideration was
re
given to imbalanced lithologies problem. In this study, we build a stacking model based on random
forest (RF), extreme gradient boosting (XGBoost) and linear regression (LR), and use synthetic
lP
minority oversampling technique (SMOTE) to improve the imbalanced lithologies problem, and
Jo
ur
na
then compare the stacking model with support vector machine (SVM), RF and XGBoost models
after adjusting model parameters using grid search and fivefold cross-validation. The authors
prepared a dataset consisting of logging data and core data from 13 wells in a depression in Junggar
basin, China, including a total of 2352 sample points marked with lithologic labels. The lithologies
identified in this study are mudstone (MS), dolomitic mudstone (DM), siltstone (S), dolomitic
siltstone (DS) and micritic dolomite (MD). The results show that (1) the overall identification
performance of the stacking model is better than that of the SVM, RF and XGBoost models. (2)
SMOTE algorithm can effectively improve the identification performance of the minority
lithologies. (3) Density log is the most important factor in identifying lithologies. The stacking
model combined with SMOTE proposed in this paper has high lithology identification performance,
which renders it practicable for lithology identification.
Keywords: Shale; Lithology identification; Minority lithologies; Stacking
1. Introduction
A clear and accurate understanding of the lithology of the formation rocks is of great
significance to reservoir evaluation, reservoir characterization and oil exploration (Horrocks et al.,
2015; Li et al., 2019; Saporetti et al., 2019). Among the traditional lithology analysis methods, the
most accurate one is core analysis. However, due to the small number of cores and high cost of
acquiring a core sample, it is impossible to core the whole section of a well. Therefore, indirect
lithology identification using information-rich logs has become an important tool for studying
reservoirs (Ghosh et al., 2016). However, the reservoir heterogeneity renders it hard to describe the
reservoir using linear logging response equations and empirical statistical formulations (Corina and
Hovda, 2018; Zheng et al., 2021).
In recent years, machine learning has been widely used by researchers to reduce logging
-p
ro
o
f
interpretation costs and improve analysis accuracy (Raeesi et al., 2012; Dong et al., 2016;
Chevitarese et al., 2018; Zhang et al., 2018; Chen et al., 2020; Li et al., 2020; Xu et al., 2021; Tian
et al., 2021; Sun et al., 2021). Machine learning has strong nonlinear mapping capability, high fault
tolerance, and excellent application prospects. K-means was applied to identify five types of
re
metamorphic lithologies, and the results are relatively consistent with the actual stratigraphy (Yang
lP
et al., 2016). Although K-means can achieve good identification results, it often ends up with a local
Jo
ur
na
optimum instead of a global optimum. Moreover, K-means is an unsupervised learning algorithm,
which wastes data without using sample labels during training. Therefore, researchers prefer
supervised learning, which can effectively utilize logging data and core data. SVM was applied to
the identification of shale lithology. Five types of lithofacies were successfully classified
(Bhattacharya et al., 2016). SVM has been proved suitable for cases with a small training set
(Sebtosheikh and Salehi, 2015). The results of applying RF to logging while drilling suggest that it
is faster and more accurate than SVM (Sun et al., 2019). A comparison among artificial neural
network (ANN), SVM, RF, XGBoost proved that XGBoost and RF outperform SVM and ANN (Xie
et al., 2018). A study further showed that XGBoost outperforms RF among the boosting family of
algorithms (Dev and Eden, 2019).
Although the methods listed above are effective used in lithology identification, the stacking
method which is used in high-level data mining competitions, such as Kaggle, to further improve
the data mining ability of single machine learning algorithms has been less used in existing studies,
and the imbalanced lithologies problem was less considered. The imbalanced lithologies problem
leads to the identification performance in favor of the majority lithologies in a model (Deng et al.,
2017; Saporetti et al., 2018; He, Gu et al., 2020; Zhou et al., 2020). Therefore, in this study, we
build a stacking model based on RF, XGBoost and LR, and compare the stacking model with SVM,
RF and XGBoost. After improving the identification performance of the minority lithologies using
SMOTE algorithm, the effect of different oversampling ratios on the identification effect of models
is analyzed.
The purpose of this study is threefold. First, the stacking method is introduced to further
improve the performance of lithology identification by machine learning models. Second, the effect
of different oversampling ratios on the identification performance of different models is analyzed.
Third, we analyze the importance of logging parameters in lithology identification using Shapley
-p
ro
o
f
additive explanations (SHAP).
The rest of this study is organized as follows: Section 2 introduces the identification models
used in the research, and Section 3 performs exploratory data analysis on the collected logging and
core data, and presents a comparative study of the performance of the stacking model, SVM, RF,
re
and XGBoost models. In Section 3, the importance of logging parameters and the effect of minority
lP
classes oversampling are also considered and the lithology prediction of single well is performed.
Jo
ur
na
The paper is concluded in Section 4.
2. Method
Machine learning algorithms use artificially provided example data to generate predictive
models and make predictions on unknown data. Supervised learning is a branch of machine learning
that generates predictive models based on labeled training data. The training data comprise a set of
feature vectors and a desired output value. The supervised learning algorithm analyzes the training
data, trains a model, and makes predictions for unknown data. This study evaluates the effectiveness
of the identification model by comparing and analyzing the identification performances of the
stacking model, SVM, RF, and XGBoost models.
2.1 SVM
SVM is a powerful machine learning model that can perform linear or non-linear classification,
regression, and even outlier detection. SVM was initially applied to binary classification tasks.
When the dataset is simple and linearly separable, a maximum edge hyperplane is constructed in
the training samples, which separates the samples as much as possible, maximizing the difference
between the two classes of data samples. When the dataset is linearly inseparable, SVM first
implicitly projects the training data into a high-dimensional feature space using its kernel function
and classifies the training data linearly in the high-dimensional feature space to find a hyperplane
to realize the classification tasks (Xiao et al., 2015). Common kernel functions include the linear
and radial basis functions.
SVM is simple to train, and its identification ability is mainly limited by the penalty factor C,
and kernel function parameter γ (Cherkassky and Ma, 2004). The penalty factor C allows SVM to
make misclassifications during the training process, so as to ensure that the decision boundary is
smoother. The kernel function parameter γ mainly defines the influence of a single sample on the
whole identification hyperplane. The effect of a single sample is positively correlated with γ.
-p
ro
o
f
2.2 Ensemble Learning
In supervised learning algorithms, the training goal is to train a stable classification or
regression model with good performance, but the actual situation is rarely ideal, and sometimes only
multiple preferred models can be obtained. Ensemble learning combines multiple weak supervision
re
models in order to get a better and comprehensively stronger supervision model (Dong et al., 2020;
lP
Sagi and Rokach, 2018). The integration method is a meta algorithm that combines several weak
supervision models into a strong prediction model.
Jo
ur
na
2.2.1 RF
RF integrates multiple decision trees through ensemble learning and can be used for
classification or regression. When used for classification, a decision tree is constructed by randomly
selecting samples and sample features from the dataset; this is repeated several times, with the
resulting decision trees being uncorrelated, after which the results of all decision trees are counted
as the final result (Biau, 2012; Biau and Scornet, 2016). When predicting new samples, the
identification results of each decision tree in the forest are counted, and the most categories are
selected as the prediction result. RF has strong robustness in cases of missing data and unbalanced
data, and is suitable for high-dimensional data with thousands of sample features. The introduction
of randomness makes the method less prone to overfitting, with low variance and high
generalization ability.
The decision tree contained in RF is a classification model in the form of a tree structure (Myles
et al., 2004; Song and Lu, 2015).The decision tree is constructed using a top-down recursive
partitioning method that splits the dataset into smaller subsets. This process is repeated until the
splitting stops. The non-partitionable nodes, whereas the remaining nodes are called internal nodes.
Each internal node corresponds to a value, and each leaf node corresponds to a sample category.
When a new sample serves as input, the decision tree starts from the root node, determines the
corresponding feature attributes, and selects the output node based on the result until it reaches the
leaf node. The result of the decision is the category stored in the leaf node.
The performance of RF is mainly limited by the number and maximum depth of decision trees.
If the number of decision trees is insufficient, the model is underfitted. As the number of decision
trees increases, the classification performance of the model tends to level off. Furthermore, the
maximum depth of the decision trees is adjusted to improve the fitting ability of each decision tree.
-p
ro
o
f
2.2.2 XGBoost
XGBoost is a boosting method based on the gradient boosting decision tree (GBDT) algorithm
with many algorithmic and engineering improvements and is more efficient, flexible and portable
than GBDT (Chen and Guestrin, 2016). It has been widely used in many machine learning
re
competitions and has achieved good results. The core idea of XGBoost is to grow a tree by splitting
lP
features repeatedly. Each time a tree is added, a new function f (x) is learned to fit the residuals of
the last prediction. When the training is complete, if a new sample is to be predicted, the
Jo
ur
na
corresponding leaf nodes are computed in each tree based on the sample characteristics, and the
scores of each leaf node are summed up as the predicted value of the sample.
XGBoost enumerates all possible splits to optimize the splitting threshold. When using
classification and regression tree (CART) as the base classifier, XGBoost explicitly adds a regular
term to control the complexity of the model, which helps prevent overfitting, improving the
generalization ability of the model. While traditional GBDT only uses the first-order derivatives of
the cost function in model training, XGBoost performs a second-order Taylor expansion of the cost
function, which enables a fast and accurate gradient descent. Additionally, traditional GBDT does
not deal with missing values; however, XGBoost can automatically learn the strategy of dealing
with missing values.
The performance of XGBoost is mainly limited by the maximum depth of the decision tree,
minimum leaf node sample weight, feature sampling method, and sample sampling method. The
larger the maximum depth, the more localized samples the model learns and the more easily it is
overfitted. The larger minimum leaf node sample weight, the more likely it is to be underfitted. The
feature and sample sampling methods affect the proportion of data and features extracted when the
tree is newly built.
2.2.3 Stacking model
Stacking model is a multimodel hierarchical integration framework, which does not comprise
a single machine learning algorithm, but rather comprises a combination of multiple machine
learning algorithms and is therefore a heterogeneous model (Rokach, 2010). The algorithms that
make up stacking model are referred to as base learners. By integrating different base learners,
stacking can form better models than base learners. Unlike the training process of a single algorithm,
the training process of stacking model is relatively more complex. In this study, we proposes a
-p
ro
o
f
stacking model based on RF, XGBoost and LR (Fig. 1), the first layer comprises RF and XGBoost,
which are trained separately. We concatenate the new training set and test set generated by these
two models, as shown in Figure 1, as the input of the second layer model. The second layer model
is LR, and the new training set and test set obtained from the first layer model are used to train the
re
model to obtain the final prediction result. Linear regression is an analytical method that uses linear
lP
methods to model the relationship between one or more independent variables and dependent
variables. Generally speaking, in the stacking model, the first layer model has a strong ability to
Jo
ur
na
extract nonlinear relationships, and it is easy to enter the state of overfitting. To reduce the risk of
overfitting, the second layer tends to use simple models such as LR. After completing the training
of stacking model, we will evaluate the model with the test set, analyze the lithology identification
effect, and compare the evaluation results with those of SVM, RF and XGBoost.
3. Application case
3.1 Data preparation and preprocessing
The logging data and core data used in this study were obtained from 13 wells in Junggar Basin,
China with a total of 2352 sample points. According to the core data, the target lithologies identified
in this study are mudstone (MS), dolomitic mudstone (DM), siltstone (S), dolomitic siltstone (DS)
and micritic dolomite (MD). Some scholars have introduced and analyzed the above five lithologies
in detail(Pan et al., 2022οΌ‰.
The following eight log parameters are collected as sample attribute values: spontaneous
potential log (SP), natural gamma log (GR), microspherically focused log (RMSL), shallow
investigation lateral log (LLS), deep investigation lateral log (LLD), acoustic travel time log (AC),
compensated neutron log (CNL), and density log (DEN). Each sample comprises a 9-dimensional
vectors, including 8-dimensional different log parameters and 1-dimensional of the lithology label.
To eliminate the effect of different logging curve magnitudes on the identification performance
of the model, it is necessary to normalize the data. In this study, the values of each logging curve in
the original dataset are mapped to [0, 1] using linear function normalization, defined as follows:
π‘₯∗ =
π‘₯ − π‘₯π‘šπ‘–π‘›
π‘₯π‘šπ‘Žπ‘₯ − π‘₯π‘šπ‘–π‘›
(1)
Here π‘₯ ∗ represents normalized data, π‘₯π‘šπ‘–π‘› represents the minimum value of a logging data,
and π‘₯π‘šπ‘Žπ‘₯ represents the maximum value of a logging data.
After the normalization, the logging data is randomly divided into training and testing sets in
-p
ro
o
f
a ratio of 8:2. The training set is used to train and tune the lithology identification model, and the
testing set is used to evaluate the model.
3.2 Model training
re
Appropriate parameters are selected on the basis of the training set for the SVM, RF, and
XGBoost models by grid search and fivefold cross-validation. For the SVM, RF, and XGBoost
lP
models, Table 1 presents the parameters to be adjusted and their search ranges. The fivefold cross-
Jo
ur
na
validation divides the training dataset into five equal subsamples randomly, wherein one subsample
is used as the dataset for testing the model, while the other four subsamples are used for training.
This process is repeated five times, so that every subsample is once used as the test data. The average
of the five test results is used as the final estimate of the model performance.
Considering the number of different lithologies in training set is not equal, SMOTE algorithm
is used to oversample the minority lithologies. The effect of different oversampling ratios on the
identification performance of the model is considered. Table 3 shows that the proportions of siltstone
and dolomitic siltstone are low and close to each other. Therefore, these two classes are selected for
oversampling.
The evaluation metrics are precision, recall and F1-score in this study. If we define a testing
dataset of two categories, according to the predicted results, TN, FN, FP and TP are defined as
shown in Table 2. The number of positive samples is P and the number of negative samples is N
(Table 2), then the evaluation metrics are defined as follows:
Precision =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
(1)
Recall =
F1 − score = 2 ∗
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
(2)
Precision ∗ Recall
Precision + Recall
(3)
3.3 Results and discussion
The main work done in this section is as follows: 1. we perform exploratory data analysis on
the core and logging data. 2. After determining the model parameters, the lithology identification
results of each model are compared. 3. The importance analysis of logging parameters is
implemented. 4. The effectiveness of SMOTE algorithm is analyzed. 5. We predict the lithologies
f
of a single well using the stacking model, SVM, RF and XGBoost models.
-p
ro
o
3.3.1 Exploratory data analysis
Exploratory data analysis is performed on the data using statistical table and correlation matrix
plot. Statistical table contains the percentage of lithologies; the correlation matrix plot represents
lP
scatter density contour plot.
re
the correlation between the logging parameters, which can be represented by scatter points and
The proportion of different lithologies is not equally distributed in the dataset (Table 3). The
Jo
ur
na
proportion of mudstone and dolomitic mudstone are both above 20%, and the proportion of siltstone
and dolomitic siltstone is only about 10%.
Fig. 2 shows the correlation matrix of different logging parameters, with eight logging
parameters on both horizontal and vertical axes, and different colors representing different
lithologies. The scatter plots of the logging parameters are shown in the upper right area of Fig. 2.
Scatter plots are usually able to observe correlations when the data volume is small, but when there
is a large amount of data, the scatter points overlap each other, resulting in a reduced discrimination
ability. The lower left area of Fig. 2 shows the density contour plot, and the darker color of the
density plot represents the more dense scatter distribution here. The density contour plot can show
the data distribution in the overlapping area, which helps the interpretation when the data volume is
large. The middle diagonal is the kernel density estimation plot.
The upper right area of Fig. 2 shows that most of the correlations among the logging
parameters are not obvious, and only a few logging parameters have obvious correlations, such as
positive correlation between LLS and LLD and RMSL, positive correlation between AC and CNL,
and negative correlation between AC and DEN. The lower left area of Fig. 2 shows that the
distribution of logging parameters among different lithologies is in an overlapping state and there
is no obvious dividing line, which indicates the difficulty of model identification.
3.3.2 Analysis of model prediction results
Precision, recall and F1-score are used to evaluate the identification performance. Table 4
shows the identification performance of different models. The results show that the stacking model
achieved the best identification results, followed by XGBoost and RF, and the identification
performance of SVM is the worst. Ensemble models are more suitable than SVM for lithology
identification problem, and better identification performance can be obtained by using the stacking
-p
ro
o
f
method.
Each model has different identification results for different lithologies,. All models show high
identification accuracy for mudstone and dolomitic mudstone. The F1-score of SVM for identifying
dolomitic siltstone is only 0.06, which is the worst identification result in these models. The stacking
re
model performs best overall, slightly outperforming XGBoost in identifying dolomitic mudstone,
lP
dolomitic siltstone, and micritic dolomite. The ability of RF to identify siltstone and dolomitic
siltstone is weaker than that of XGBoost. The above results show that the ensemble models have a
Jo
ur
na
great advantage in the face of data with not equally distributed lithologies.
Fig. 3 is the confusion matrices based on the comparison of predicted and actual lithologies
which shows the distribution of lithologies identified by models for each lithology. Taking the
confusion matrix of the stacking model as an example, the main misclassifications are as follows.
1. 13% and 14% of siltstone samples are misclassified as mudstone and dolomitic siltstone
respectively. 2. More than half of dolomitic siltstone samples are misclassified as dolomitic
mudstone. 3. 39% of micritic dolomite samples are misclassified as dolomitic mudstone. The above
misclassification may be caused by a large number of overlapping logging features between
misclassified lithologies, which makes identification difficult.
3.3.3 Importance analysis of logging parameters
After the training of each lithology identification model is completed, the importance of each
logging parameter needs to be calculated in order to determine the importance of different logging
parameters on lithology identification. The SHAP algorithm was proposed by Lundberg and Lee
(2017), which can be used to explain the importance of different parameters on model predictions
in classification and regression models. The SHAP algorithm uses marginal contribution to calculate
parameter importance, which doesn’t have the problem of overestimating parameters with more
categories in comparison with feature importance method built-in the RF model, and can provide
richer information such as the importance of different parameters for each lithology.
Fig. 4 shows the importance of different logging parameters. The horizontal axis of Fig. 4
shows the average absolute value of SHAP, which can be divided into SHAP values for different
lithologies, using different colors to represent the importance for different lithologies, and the
vertical axis shows each logging parameter in order of importance, with the importance of logging
parameters increasing from the bottom to the top.
-p
ro
o
f
Fig. 4 shows that DEN is the most important factor in identifying lithologies; the importance
of AC, CNL, SP, RMSL, LLD and LLS on lithology identification gradually decreases; GR has the
least influence on lithology identification. The importance of logging parameters varies for different
lithologies, for example, to identify dolomitic mudstone, DEN is the most influential factor and GR
lP
3.3.4 Oversampling ratios
re
is the least influential factor.
SMOTE algorithm is used to oversample minority lithologies in the expectation of improving
Jo
ur
na
the identification accuracy of minority lithologies. By using different oversampling ratios, the
identification models give different results when facing different minority lithologies. Fig. 5 shows
the effect of different oversampling ratios of siltstone and dolomitic siltstone on the effect of
different lithology identification models.
The results show that for siltstone, the F1-score of siltstone identification is almost unchanged
with the increase of oversampling ratios, which may be because the oversampling does not
effectively widen the feature boundaries of siltstone, resulting in the model not being able to learn
new feature boundaries, and thus the identification effect cannot be improved. For dolomitic
siltstone, the model identification effect is improved with the increase of the oversampling ratios.
SVM has the most obvious improvement in identification, with the F1-score increasing from 0.06
to 0.31. For RF and XGBoost, the F1-score of the former increased from 0.24 to 0.31, and the F1score of the latter increased from 0.29 to 0.35. For the stacking model, the F1-score increased from
0.3 to 0.33. It also show that the identification effect of majority lithologies is almost unaffected
after oversampling minority lithologies.
3.3.5 Single well lithology prediction
Fig. 6 shows that the SVM, RF, XGBoost and the stacking models can identify shale lithologies,
but the stacking model has the highest accuracy, indicating that the stacking model has high usability
in shale lithology identification, which can not only obtain complete lithology distribution
characteristics, but also improve the efficiency of lithology identification.
4. Conclusions
In this study, exploratory data analysis, modeling, oversampling and logging parameters
interpretation analysis are performed using machine learning algorithms in Junggar Basin, China,
with the following conclusions:
-p
ro
o
f
(1) The exploratory data analysis, by means of statistical table and correlation matrix plot, can
analyze the characteristics of lithology distribution and the correlation between logging parameters.
(2) The machine learning algorithms can dig out the correlation between logging parameters
and lithologies, and the stacking model achieves the best identification performance.
lP
has no effect on majority lithologies.
re
(3) SMOTE algorithm can improve the identification performance of minority lithologies, and
(4) The SHAP algorithm is applied to quantify the importance of different logging parameters
Jo
ur
na
in lithology identification and the results show that the logging parameters in descending order of
importance are DEN, AC, CNL, SP, RMSL, LLD, LLS, and GR.
Credit author statement
Jinlu Yang: Writing-original draft, Writing-review & editing. Min Wang: Research idea design.
Ming Li: Experiment operation and data processing. Yu Yan: Experiment operation and data
processing. Xin Wang: Experiment operation and data processing. Haoming Shao: Experiment
operation and data processing. Changqi Yu: Experiment operation and data processing. Yan Wu:
Experiment operation and data processing. Dianshi Xiao: Experiment operation and data processing.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
Financial support from the National Natural Science Foundation of China (No. 41922015, No.
42072147) and Qingdao Postdoctoral (ZX20210070) is acknowledged.
Reference
Bhattacharya, S., Carr, T.R., Pal, M., 2016. Comparison of supervised and unsupervised approaches
for mudstone lithofacies classification: case studies from the Bakken and Mahantango–Marcellus
Shale, USA. J. Nat. Gas. Sci. Eng. 33, 1119–1133.
Biau, G., 2012. Analysis of a random forests model. J. Mach. Learn. Res. 13(1), 1063–1095.
Biau, G., Scornet, E., 2016. A random forest guided tour. TEST. 25, 197–227.
Chen, G., Chen, M., Hong, G.B., Lu, Y.H., Zhou, B., Gao, Y.F., 2020. A new method of lithology
classification based on convolutional neural network algorithm by utilizing drilling string vibration
data. Energies. 13(4), 888.
Chen, T.Q., Guestrin, C., 2016. XGBoost, a scalable tree boosting system. In: Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San
f
Francisco. ACM, pp. 785–794.
-p
ro
o
Cherkassky, V., Ma, Y.Q., 2004. Practical selection of SVM parameters and noise estimation for
SVM regression. Neural Networks. 17(1), 113–126.
Chevitarese, D.S., Szwarcman, D., e Silva, R.G., Brazil, E.V., 2018. Deep learning applied to
seismic facies classification, a methodology for training. In: Conference Proceedings, Saint
re
Petersburg 2018, Vol. 2018. EAGE, pp. 1–5.
lP
Corina, A.N., Hovda, S., 2018. Automatic lithology prediction from well logging using kernel
density estimation. J. Pet. Sci. Eng. 170, 664–674.
Jo
ur
na
Deng, C.X., Pan, H.P., Fang, S.N., Konaté, A.A., Qin, R.D., 2017. Support vector machine as an
alternative method for lithology classification of crystalline rocks. J. Geophys. Eng. 14(2), 341–349.
Dev, V.A., Eden, M.R., 2019. Formation lithology classification using scalable gradient boosted
decision trees. Comput. Chem. Eng. 128, 392–404.
Dong, S.Q., Wang, Z.Z., Zeng, L.B., 2016. Lithology identification using kernel fisher discriminant
analysis with well logs. J. Pet. Sci. Eng. 143, 95–102.
Dong, X.B., Yu, Z.W., Cao, W.M., Shi, Y.F., Ma, Q.L., 2020. A survey on ensemble learning. Front.
Comput. Sci. 14(2), 241–258.
Ghosh, S., Chatterjee, R., Shanker, P., 2016. Estimation of ash, moisture content and detection of
coal lithofacies from well logs using regression and artificial neural network modelling. Fuel. 177,
279–287.
He, M., Gu, H.M., Wan, H., 2020. Log interpretation for lithology and fluid identification using
deep neural network combined with MAHAKIL in a tight sandstone reservoir. J. Pet. Sci. Eng. 194,
107498.
Horrocks, T., Holden, E.J., Wedge, D., 2015. Evaluation of automated lithology classification
architectures using highly–sampled wireline logs for coal exploration. Comput. Geosci. 83, 209–
218.
Li, P.X., Feng, X.T., Feng, G.L., Xiao, Y.X., Chen, B.R., 2019. Rockburst and microseismic
characteristics around lithological interfaces under different excavation directions in deep tunnels.
Engineering Geology. 260, 105209.
Li, Z.R., Kang, Y., Feng, D.Y., Wang, X.M., Lv, W.J., Chang, J., Zheng, W.X., 2020. Semi–
supervised learning for lithology identification using laplacian support vector machine. J. Pet. Sci.
Eng. 195, 107510.
Lundberg, S.M., Lee, S.I., 2017. A unified approach to interpreting model predictions. In:
Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS
2017), Vol. 30. Curran Associates Inc.
-p
ro
o
tree modeling. Journal of Chemometrics. 18(6), 275–285.
f
Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A., Brown, S.D., 2004. An introduction to decision
Pan, Y.H., Huang, Z.L., Guo, X.B., Li, T.J., Zhao, J., Li, Z.Y., Qu, T., Wang, B.R., Fan, T.G., Xu,
X.F., 2022. Lithofacies types, reservoir characteristics, and hydrocarbon potential of the lacustrine
organic-rich fine-grained rocks affected by tephra of the permian Lucaogou formation, Santanghu
re
basin, western China. J. Pet. Sci. Eng. 208, 109631.
lP
Raeesi, M., Moradzadeh, A., Ardejani, F.D., Rahimi, M., 2012. Classification and identification of
hydrocarbon reservoir lithofacies and their heterogeneity using seismic attributes, logs data and
Jo
ur
na
artificial neural networks. J. Pet. Sci. Eng. 82–83, 151–165.
Rokach, L., 2010. Ensemble–based classifiers. Artif. Intell. Rev. 33, 1–39.
Sagi, O., Rokach, L., 2018. Ensemble learning, a survey. Wires Data Min. Knowl. 8(4), e1249.
Saporetti, C.M., da Fonseca, L.G., Pereira, E. 2019. A lithology identification approach based on
machine learning with evolutionary parameter tuning. IEEE Geosci. Remote. Sens. Lett. 16, 1819–
1823.
Saporetti, C.M., da Fonseca, L.G., Pereira, E., de Oliveira, L.C., 2018. Machine learning approaches
for petrographic classification of carbonate-siliciclastic rocks using well logs and textural
information. J. Appl. Geophys. 155, 217–225.
Sebtosheikh, M.A., Salehi, A., 2015. Lithology prediction by support vector classifiers using
inverted seismic attributes data and petrophysical logs as a new approach and investigation of
training data set size effect on its performance in a heterogeneous carbonate reservoir. J. Pet. Sci.
Eng. 134, 143–149.
Song, Y.Y., Lu, Y., 2015. Decision tree methods, applications for classification and prediction.
Shanghai Arch. Psychiatry. 27(2), 130–135.
Sun, J., Chen, M. Q., Li, Q., Ren, L. Dou, M.Y. Zhang, J.X., 2021. A new method for predicting
formation lithology while drilling at horizontal well bit. J. Pet. Sci. Eng. 196, 107955.
Sun, J., Li, Q., Chen, M. Q., Ren, L., Huang, G.H., Li, C.Y., Zhang, Z.X., 2019. Optimization of
models for a rapid identification of lithology while drilling – a win–win strategy based on machine
learning. J. Pet. Sci. Eng. 176, 321–341.
Tian, M., Omre, H.N., Xu, H.M., 2021. Inversion of well logs into lithology classes accounting for
spatial dependencies by using hidden markov models and recurrent neural networks. J. Pet. Sci. Eng.
196, 107598.
Xiao, Y.C., Wang, H.G., Xu, W.L., 2015. Parameter selection of gaussian kernel for one–class SVM.
IEEE T. Cybernetics. 45(5), 927–939.
Xie, Y.X., Zhu, C.Y., Zhou, W., Li, Z.D., Liu, X., Tu, M., 2018. Evaluation of machine learning
-p
ro
o
performances. J. Pet. Sci. Eng. 160, 182–193.
f
methods for formation lithology identification, a comparison of tuning processes and model
Xu, T., Chang, J., Feng, D.Y., Lv, W.J., Kang, Y. Liu, H.N., Li, J. Li, Z.R., 2021. Evaluation of active
learning algorithms for formation lithology identification. J. Pet. Sci. Eng. 206, 108999.
Yang, H.J., Pan, H.P., Ma, H.L., Konaté, A.A., Yao, J., Guo, B., 2016. Performance of the synergetic
lP
Pet. Sci. Eng. 144, 1–9.
re
wavelet transform and modified k–means clustering in lithology classification using nuclear log. J.
Zhang, G.Y., Wang, Z.Z., Chen, Y.K., 2018. Deep learning for seismic lithology prediction. Geophys.
Jo
ur
na
J. Int. 215(2), 1368–1387.
Zheng, W.H., Tian, F., Di, Q.Y., Xin, W. Cheng, F.Q. Shan, X.C., 2021. Electrofacies classification
of deeply buried carbonate strata using machine learning methods, a case study on ordovician
paleokarst reservoirs in Tarim Basin. Mar. Pet. Geol. 123, 104720.
Zhou, K.B., Zhang, J.Y., Ren, Y.S., Huang, Z., Zhao, L.X., 2020. A gradient boosting decision tree
algorithm combining synthetic minority over–sampling technique for lithology identification.
Geophysics. 85(4), WA147-WA158.
Table 1 Parameters tuning table for the SVM, RF, and XGBoost models
Model
Parameters
C

Estimators
Max depth
Estimators
Max depth
Min child weight
Gamma
SVM
RF
Optimal value
33
0.01
300
10
47
7
4
0.5
Jo
ur
na
lP
re
-p
r
oo
f
XGBoost
Search range
0.0001-100
0.0001-1
10-500
3-20
10-400
3-10
1-6
0.1-1
Table 2 The definition of the confusion matrix
Predicted
Negative
Positive
Positive
False Positive (FP)
True Positive (TP)
Jo
ur
na
lP
re
-p
r
oo
f
Actual
Negative
True Negative (TN)
False Negative (FN)
Table 3 Statistical table of the distribution of samples of different lithologies in the dataset
MS
Number
595
Proportion/%
25.30
DM
799
33.97
S
252
10.71
DS
244
10.37
MD
462
19.64
Total
2352
100
Jo
ur
na
lP
re
-p
r
oo
f
Lithology
Table 4 Comparison of predicted results from different lithology identification models
Lithology
SVM
The stacking model
MS
DM
S
DS
MD
Precision
0.91
0.77
0.5
0.51
0.81
Recall
0.93
0.91
0.66
0.21
0.6
MS
DM
S
DS
MD
Precision
0.88
0.76
0.51
0.48
0.86
Recall
0.95
0.91
0.62
0.16
0.58
F1-score
0.92
0.83
0.57
0.3
0.69
Precision
0.88
0.71
0.52
0.75
0.77
Recall
0.96
0.92
0.49
0.03
0.52
XGBoost
F1-score
0.92
0.8
0.51
0.06
0.62
F1-score
0.91
0.83
0.57
0.24
0.69
Precision
0.91
0.77
0.55
0.39
0.67
Recall
0.93
0.81
0.67
0.23
0.69
F1-score
0.92
0.79
0.61
0.29
0.68
Jo
ur
na
lP
re
-p
r
oo
f
RF
Jo
na
ur
re
lP
ro
-p
of
Jo
ur
na
lP
re
-p
ro
of
Figure 1. The stacking model based on XGBoost, RF, and LR working process.
Figure 2. Correlation matrix between GR, SP, RMSL, LLS, LLD, AC, CNL and DEN in the dataset.
Figure 3. Comparison of confusion matrices for the stacking model, SVM, RF, and XGBoost.
Figure 4. The importance of logging parameters in lithology identification.
Figure 5. Effect of different oversampling ratios of siltstone and dolomitic siltstone on the performance of different lithology
identification models.
Figure 6. Comparison of predicted results from the stacking model, SVM, RF, and XGBoost models for well X1.
Jo
na
ur
re
lP
ro
-p
of
Fig. 1
Jo
na
ur
re
lP
ro
-p
of
Fig. 2
Fig. 3
(b)
(c)
(d)
Jo
ur
na
lP
re
-p
ro
of
(a)
Jo
na
ur
re
lP
ro
-p
of
Fig. 4
1
1
0.8
0.8
F1-score
F1-score
Fig. 5
0.6
0.4
0.4
0.2
(a)
Stacking model
0.2
0.6
(b)
SVM
0
0
0
0.5
1
1.5
0
2
0.5
1!
"
# $ % ! "
& ' "
# $ % & ' "
1!
! ( # $ "
1.5
2
"
# $ % ! "
& ' "
# $ % & ' "
! ( # $ "
0.8
0.4
0.2
(c)
RF
0
0
0.5
1
1.5
0
2
0
Ratio
ζ‰©ε’žε€ζ•°
! ( # Siltstone
$ "
! "
# siltstone
$ % ! "
Dolomitic
re
& ' " Dolomitic
# $ %mudstone
& ' "
ur
na
lP
# $ Mudstone
% ! "
0.5
1
(d)
XGBoost
1.5
2
Ratio
ζ‰©ε’žε€ζ•°
Jo
! "
0.4
-p
0.2
0.6
of
0.6
ro
F1-score
0.8
F1-score
1
Ratio
ζ‰©ε’žε€ζ•°
Ratio
ζ‰©ε’žε€ζ•°
& 'Micritic
"
#dolomite
$ % & ' "
! ( # $ "
Dolomitic mudstone
Siltstone
Jo
ur
na
Mudstone
lP
re
-p
ro
of
Fig. 6
Dolomitic siltstone
Micritic dolomite
1. We build a stacking model based on random forest (RF), extreme gradient boosting (XGBoost) and linear
regression (LR), and use synthetic minority oversampling technique (SMOTE) to improve the imbalanced
lithologies problem.
2. The identification performance of the stacking model is better than that of the SVM, RF and XGBoost models.
3. SMOTE algorithm can effectively improve the identification performance of the minority lithologies.
4. The stacking model combined with SMOTE proposed in this paper has high lithology identification performance,
Jo
ur
na
lP
re
-p
r
oo
f
which renders it practicable for lithology identification.
Declaration of Interest Statement
β˜’ The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Jo
ur
na
lP
re
-p
ro
of
☐The authors declare the following financial interests/personal relationships which may be
considered as potential competing interests:
Download