Journal Pre-proof Shale lithology identification using stacking model combined with SMOTE from well logs Jinlu Yang, Min Wang, Ming Li, Yu Yan, Xin Wang, Haoming Shao, Changqi Yu, Yan Wu, Dianshi Xiao PII: S2666-5190(22)00011-5 DOI: https://doi.org/10.1016/j.uncres.2022.09.001 Reference: UNCRES 15 To appear in: Unconventional Resources Received Date: 9 July 2022 Revised Date: 4 September 2022 Accepted Date: 5 September 2022 Please cite this article as: J. Yang, M. Wang, M. Li, Y. Yan, X. Wang, H. Shao, C. Yu, Y. Wu, D. Xiao, Shale lithology identification using stacking model combined with SMOTE from well logs, Unconventional Resources (2022), doi: https://doi.org/10.1016/j.uncres.2022.09.001. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2022 Published by Elsevier B.V. on behalf of KeAi Communications Co., Ltd. Shale lithology identification using stacking model combined with SMOTE from well logs Jinlu Yang1,2, Min Wang1,2*, Ming Li1,2, Yu Yan1,2, Xin Wang1,2, Haoming Shao1,2, Changqi Yu1,2, Yan Wu1,2, Dianshi Xiao1,2 1. Laboratory of Deep Oil and Gas, China University of Petroleum (East China), Qingdao 266580, China. 2. School of Geosciences, China University of Petroleum (East China), Qingdao 266580, China. Abstract: Shale lithology identification is the basis of geological research and reservoir -p ro o f characterization, and is an essential task for oil exploration. Recently, several machine learning algorithms have been applied to improve the accuracy of lithology identification. However, stacking model for lithology identification has been less used in existing studies, and less consideration was re given to imbalanced lithologies problem. In this study, we build a stacking model based on random forest (RF), extreme gradient boosting (XGBoost) and linear regression (LR), and use synthetic lP minority oversampling technique (SMOTE) to improve the imbalanced lithologies problem, and Jo ur na then compare the stacking model with support vector machine (SVM), RF and XGBoost models after adjusting model parameters using grid search and fivefold cross-validation. The authors prepared a dataset consisting of logging data and core data from 13 wells in a depression in Junggar basin, China, including a total of 2352 sample points marked with lithologic labels. The lithologies identified in this study are mudstone (MS), dolomitic mudstone (DM), siltstone (S), dolomitic siltstone (DS) and micritic dolomite (MD). The results show that (1) the overall identification performance of the stacking model is better than that of the SVM, RF and XGBoost models. (2) SMOTE algorithm can effectively improve the identification performance of the minority lithologies. (3) Density log is the most important factor in identifying lithologies. The stacking model combined with SMOTE proposed in this paper has high lithology identification performance, which renders it practicable for lithology identification. Keywords: Shale; Lithology identification; Minority lithologies; Stacking 1. Introduction A clear and accurate understanding of the lithology of the formation rocks is of great significance to reservoir evaluation, reservoir characterization and oil exploration (Horrocks et al., 2015; Li et al., 2019; Saporetti et al., 2019). Among the traditional lithology analysis methods, the most accurate one is core analysis. However, due to the small number of cores and high cost of acquiring a core sample, it is impossible to core the whole section of a well. Therefore, indirect lithology identification using information-rich logs has become an important tool for studying reservoirs (Ghosh et al., 2016). However, the reservoir heterogeneity renders it hard to describe the reservoir using linear logging response equations and empirical statistical formulations (Corina and Hovda, 2018; Zheng et al., 2021). In recent years, machine learning has been widely used by researchers to reduce logging -p ro o f interpretation costs and improve analysis accuracy (Raeesi et al., 2012; Dong et al., 2016; Chevitarese et al., 2018; Zhang et al., 2018; Chen et al., 2020; Li et al., 2020; Xu et al., 2021; Tian et al., 2021; Sun et al., 2021). Machine learning has strong nonlinear mapping capability, high fault tolerance, and excellent application prospects. K-means was applied to identify five types of re metamorphic lithologies, and the results are relatively consistent with the actual stratigraphy (Yang lP et al., 2016). Although K-means can achieve good identification results, it often ends up with a local Jo ur na optimum instead of a global optimum. Moreover, K-means is an unsupervised learning algorithm, which wastes data without using sample labels during training. Therefore, researchers prefer supervised learning, which can effectively utilize logging data and core data. SVM was applied to the identification of shale lithology. Five types of lithofacies were successfully classified (Bhattacharya et al., 2016). SVM has been proved suitable for cases with a small training set (Sebtosheikh and Salehi, 2015). The results of applying RF to logging while drilling suggest that it is faster and more accurate than SVM (Sun et al., 2019). A comparison among artificial neural network (ANN), SVM, RF, XGBoost proved that XGBoost and RF outperform SVM and ANN (Xie et al., 2018). A study further showed that XGBoost outperforms RF among the boosting family of algorithms (Dev and Eden, 2019). Although the methods listed above are effective used in lithology identification, the stacking method which is used in high-level data mining competitions, such as Kaggle, to further improve the data mining ability of single machine learning algorithms has been less used in existing studies, and the imbalanced lithologies problem was less considered. The imbalanced lithologies problem leads to the identification performance in favor of the majority lithologies in a model (Deng et al., 2017; Saporetti et al., 2018; He, Gu et al., 2020; Zhou et al., 2020). Therefore, in this study, we build a stacking model based on RF, XGBoost and LR, and compare the stacking model with SVM, RF and XGBoost. After improving the identification performance of the minority lithologies using SMOTE algorithm, the effect of different oversampling ratios on the identification effect of models is analyzed. The purpose of this study is threefold. First, the stacking method is introduced to further improve the performance of lithology identification by machine learning models. Second, the effect of different oversampling ratios on the identification performance of different models is analyzed. Third, we analyze the importance of logging parameters in lithology identification using Shapley -p ro o f additive explanations (SHAP). The rest of this study is organized as follows: Section 2 introduces the identification models used in the research, and Section 3 performs exploratory data analysis on the collected logging and core data, and presents a comparative study of the performance of the stacking model, SVM, RF, re and XGBoost models. In Section 3, the importance of logging parameters and the effect of minority lP classes oversampling are also considered and the lithology prediction of single well is performed. Jo ur na The paper is concluded in Section 4. 2. Method Machine learning algorithms use artificially provided example data to generate predictive models and make predictions on unknown data. Supervised learning is a branch of machine learning that generates predictive models based on labeled training data. The training data comprise a set of feature vectors and a desired output value. The supervised learning algorithm analyzes the training data, trains a model, and makes predictions for unknown data. This study evaluates the effectiveness of the identification model by comparing and analyzing the identification performances of the stacking model, SVM, RF, and XGBoost models. 2.1 SVM SVM is a powerful machine learning model that can perform linear or non-linear classification, regression, and even outlier detection. SVM was initially applied to binary classification tasks. When the dataset is simple and linearly separable, a maximum edge hyperplane is constructed in the training samples, which separates the samples as much as possible, maximizing the difference between the two classes of data samples. When the dataset is linearly inseparable, SVM first implicitly projects the training data into a high-dimensional feature space using its kernel function and classifies the training data linearly in the high-dimensional feature space to find a hyperplane to realize the classification tasks (Xiao et al., 2015). Common kernel functions include the linear and radial basis functions. SVM is simple to train, and its identification ability is mainly limited by the penalty factor C, and kernel function parameter γ (Cherkassky and Ma, 2004). The penalty factor C allows SVM to make misclassifications during the training process, so as to ensure that the decision boundary is smoother. The kernel function parameter γ mainly defines the influence of a single sample on the whole identification hyperplane. The effect of a single sample is positively correlated with γ. -p ro o f 2.2 Ensemble Learning In supervised learning algorithms, the training goal is to train a stable classification or regression model with good performance, but the actual situation is rarely ideal, and sometimes only multiple preferred models can be obtained. Ensemble learning combines multiple weak supervision re models in order to get a better and comprehensively stronger supervision model (Dong et al., 2020; lP Sagi and Rokach, 2018). The integration method is a meta algorithm that combines several weak supervision models into a strong prediction model. Jo ur na 2.2.1 RF RF integrates multiple decision trees through ensemble learning and can be used for classification or regression. When used for classification, a decision tree is constructed by randomly selecting samples and sample features from the dataset; this is repeated several times, with the resulting decision trees being uncorrelated, after which the results of all decision trees are counted as the final result (Biau, 2012; Biau and Scornet, 2016). When predicting new samples, the identification results of each decision tree in the forest are counted, and the most categories are selected as the prediction result. RF has strong robustness in cases of missing data and unbalanced data, and is suitable for high-dimensional data with thousands of sample features. The introduction of randomness makes the method less prone to overfitting, with low variance and high generalization ability. The decision tree contained in RF is a classification model in the form of a tree structure (Myles et al., 2004; Song and Lu, 2015).The decision tree is constructed using a top-down recursive partitioning method that splits the dataset into smaller subsets. This process is repeated until the splitting stops. The non-partitionable nodes, whereas the remaining nodes are called internal nodes. Each internal node corresponds to a value, and each leaf node corresponds to a sample category. When a new sample serves as input, the decision tree starts from the root node, determines the corresponding feature attributes, and selects the output node based on the result until it reaches the leaf node. The result of the decision is the category stored in the leaf node. The performance of RF is mainly limited by the number and maximum depth of decision trees. If the number of decision trees is insufficient, the model is underfitted. As the number of decision trees increases, the classification performance of the model tends to level off. Furthermore, the maximum depth of the decision trees is adjusted to improve the fitting ability of each decision tree. -p ro o f 2.2.2 XGBoost XGBoost is a boosting method based on the gradient boosting decision tree (GBDT) algorithm with many algorithmic and engineering improvements and is more efficient, flexible and portable than GBDT (Chen and Guestrin, 2016). It has been widely used in many machine learning re competitions and has achieved good results. The core idea of XGBoost is to grow a tree by splitting lP features repeatedly. Each time a tree is added, a new function f (x) is learned to fit the residuals of the last prediction. When the training is complete, if a new sample is to be predicted, the Jo ur na corresponding leaf nodes are computed in each tree based on the sample characteristics, and the scores of each leaf node are summed up as the predicted value of the sample. XGBoost enumerates all possible splits to optimize the splitting threshold. When using classification and regression tree (CART) as the base classifier, XGBoost explicitly adds a regular term to control the complexity of the model, which helps prevent overfitting, improving the generalization ability of the model. While traditional GBDT only uses the first-order derivatives of the cost function in model training, XGBoost performs a second-order Taylor expansion of the cost function, which enables a fast and accurate gradient descent. Additionally, traditional GBDT does not deal with missing values; however, XGBoost can automatically learn the strategy of dealing with missing values. The performance of XGBoost is mainly limited by the maximum depth of the decision tree, minimum leaf node sample weight, feature sampling method, and sample sampling method. The larger the maximum depth, the more localized samples the model learns and the more easily it is overfitted. The larger minimum leaf node sample weight, the more likely it is to be underfitted. The feature and sample sampling methods affect the proportion of data and features extracted when the tree is newly built. 2.2.3 Stacking model Stacking model is a multimodel hierarchical integration framework, which does not comprise a single machine learning algorithm, but rather comprises a combination of multiple machine learning algorithms and is therefore a heterogeneous model (Rokach, 2010). The algorithms that make up stacking model are referred to as base learners. By integrating different base learners, stacking can form better models than base learners. Unlike the training process of a single algorithm, the training process of stacking model is relatively more complex. In this study, we proposes a -p ro o f stacking model based on RF, XGBoost and LR (Fig. 1), the first layer comprises RF and XGBoost, which are trained separately. We concatenate the new training set and test set generated by these two models, as shown in Figure 1, as the input of the second layer model. The second layer model is LR, and the new training set and test set obtained from the first layer model are used to train the re model to obtain the final prediction result. Linear regression is an analytical method that uses linear lP methods to model the relationship between one or more independent variables and dependent variables. Generally speaking, in the stacking model, the first layer model has a strong ability to Jo ur na extract nonlinear relationships, and it is easy to enter the state of overfitting. To reduce the risk of overfitting, the second layer tends to use simple models such as LR. After completing the training of stacking model, we will evaluate the model with the test set, analyze the lithology identification effect, and compare the evaluation results with those of SVM, RF and XGBoost. 3. Application case 3.1 Data preparation and preprocessing The logging data and core data used in this study were obtained from 13 wells in Junggar Basin, China with a total of 2352 sample points. According to the core data, the target lithologies identified in this study are mudstone (MS), dolomitic mudstone (DM), siltstone (S), dolomitic siltstone (DS) and micritic dolomite (MD). Some scholars have introduced and analyzed the above five lithologies in detailοΌPan et al., 2022οΌ. The following eight log parameters are collected as sample attribute values: spontaneous potential log (SP), natural gamma log (GR), microspherically focused log (RMSL), shallow investigation lateral log (LLS), deep investigation lateral log (LLD), acoustic travel time log (AC), compensated neutron log (CNL), and density log (DEN). Each sample comprises a 9-dimensional vectors, including 8-dimensional different log parameters and 1-dimensional of the lithology label. To eliminate the effect of different logging curve magnitudes on the identification performance of the model, it is necessary to normalize the data. In this study, the values of each logging curve in the original dataset are mapped to [0, 1] using linear function normalization, defined as follows: π₯∗ = π₯ − π₯πππ π₯πππ₯ − π₯πππ (1) Here π₯ ∗ represents normalized data, π₯πππ represents the minimum value of a logging data, and π₯πππ₯ represents the maximum value of a logging data. After the normalization, the logging data is randomly divided into training and testing sets in -p ro o f a ratio of 8:2. The training set is used to train and tune the lithology identification model, and the testing set is used to evaluate the model. 3.2 Model training re Appropriate parameters are selected on the basis of the training set for the SVM, RF, and XGBoost models by grid search and fivefold cross-validation. For the SVM, RF, and XGBoost lP models, Table 1 presents the parameters to be adjusted and their search ranges. The fivefold cross- Jo ur na validation divides the training dataset into five equal subsamples randomly, wherein one subsample is used as the dataset for testing the model, while the other four subsamples are used for training. This process is repeated five times, so that every subsample is once used as the test data. The average of the five test results is used as the final estimate of the model performance. Considering the number of different lithologies in training set is not equal, SMOTE algorithm is used to oversample the minority lithologies. The effect of different oversampling ratios on the identification performance of the model is considered. Table 3 shows that the proportions of siltstone and dolomitic siltstone are low and close to each other. Therefore, these two classes are selected for oversampling. The evaluation metrics are precision, recall and F1-score in this study. If we define a testing dataset of two categories, according to the predicted results, TN, FN, FP and TP are defined as shown in Table 2. The number of positive samples is P and the number of negative samples is N (Table 2), then the evaluation metrics are defined as follows: Precision = ππ ππ + πΉπ (1) Recall = F1 − score = 2 ∗ ππ ππ + πΉπ (2) Precision ∗ Recall Precision + Recall (3) 3.3 Results and discussion The main work done in this section is as follows: 1. we perform exploratory data analysis on the core and logging data. 2. After determining the model parameters, the lithology identification results of each model are compared. 3. The importance analysis of logging parameters is implemented. 4. The effectiveness of SMOTE algorithm is analyzed. 5. We predict the lithologies f of a single well using the stacking model, SVM, RF and XGBoost models. -p ro o 3.3.1 Exploratory data analysis Exploratory data analysis is performed on the data using statistical table and correlation matrix plot. Statistical table contains the percentage of lithologies; the correlation matrix plot represents lP scatter density contour plot. re the correlation between the logging parameters, which can be represented by scatter points and The proportion of different lithologies is not equally distributed in the dataset (Table 3). The Jo ur na proportion of mudstone and dolomitic mudstone are both above 20%, and the proportion of siltstone and dolomitic siltstone is only about 10%. Fig. 2 shows the correlation matrix of different logging parameters, with eight logging parameters on both horizontal and vertical axes, and different colors representing different lithologies. The scatter plots of the logging parameters are shown in the upper right area of Fig. 2. Scatter plots are usually able to observe correlations when the data volume is small, but when there is a large amount of data, the scatter points overlap each other, resulting in a reduced discrimination ability. The lower left area of Fig. 2 shows the density contour plot, and the darker color of the density plot represents the more dense scatter distribution here. The density contour plot can show the data distribution in the overlapping area, which helps the interpretation when the data volume is large. The middle diagonal is the kernel density estimation plot. The upper right area of Fig. 2 shows that most of the correlations among the logging parameters are not obvious, and only a few logging parameters have obvious correlations, such as positive correlation between LLS and LLD and RMSL, positive correlation between AC and CNL, and negative correlation between AC and DEN. The lower left area of Fig. 2 shows that the distribution of logging parameters among different lithologies is in an overlapping state and there is no obvious dividing line, which indicates the difficulty of model identification. 3.3.2 Analysis of model prediction results Precision, recall and F1-score are used to evaluate the identification performance. Table 4 shows the identification performance of different models. The results show that the stacking model achieved the best identification results, followed by XGBoost and RF, and the identification performance of SVM is the worst. Ensemble models are more suitable than SVM for lithology identification problem, and better identification performance can be obtained by using the stacking -p ro o f method. Each model has different identification results for different lithologies,. All models show high identification accuracy for mudstone and dolomitic mudstone. The F1-score of SVM for identifying dolomitic siltstone is only 0.06, which is the worst identification result in these models. The stacking re model performs best overall, slightly outperforming XGBoost in identifying dolomitic mudstone, lP dolomitic siltstone, and micritic dolomite. The ability of RF to identify siltstone and dolomitic siltstone is weaker than that of XGBoost. The above results show that the ensemble models have a Jo ur na great advantage in the face of data with not equally distributed lithologies. Fig. 3 is the confusion matrices based on the comparison of predicted and actual lithologies which shows the distribution of lithologies identified by models for each lithology. Taking the confusion matrix of the stacking model as an example, the main misclassifications are as follows. 1. 13% and 14% of siltstone samples are misclassified as mudstone and dolomitic siltstone respectively. 2. More than half of dolomitic siltstone samples are misclassified as dolomitic mudstone. 3. 39% of micritic dolomite samples are misclassified as dolomitic mudstone. The above misclassification may be caused by a large number of overlapping logging features between misclassified lithologies, which makes identification difficult. 3.3.3 Importance analysis of logging parameters After the training of each lithology identification model is completed, the importance of each logging parameter needs to be calculated in order to determine the importance of different logging parameters on lithology identification. The SHAP algorithm was proposed by Lundberg and Lee (2017), which can be used to explain the importance of different parameters on model predictions in classification and regression models. The SHAP algorithm uses marginal contribution to calculate parameter importance, which doesn’t have the problem of overestimating parameters with more categories in comparison with feature importance method built-in the RF model, and can provide richer information such as the importance of different parameters for each lithology. Fig. 4 shows the importance of different logging parameters. The horizontal axis of Fig. 4 shows the average absolute value of SHAP, which can be divided into SHAP values for different lithologies, using different colors to represent the importance for different lithologies, and the vertical axis shows each logging parameter in order of importance, with the importance of logging parameters increasing from the bottom to the top. -p ro o f Fig. 4 shows that DEN is the most important factor in identifying lithologies; the importance of AC, CNL, SP, RMSL, LLD and LLS on lithology identification gradually decreases; GR has the least influence on lithology identification. The importance of logging parameters varies for different lithologies, for example, to identify dolomitic mudstone, DEN is the most influential factor and GR lP 3.3.4 Oversampling ratios re is the least influential factor. SMOTE algorithm is used to oversample minority lithologies in the expectation of improving Jo ur na the identification accuracy of minority lithologies. By using different oversampling ratios, the identification models give different results when facing different minority lithologies. Fig. 5 shows the effect of different oversampling ratios of siltstone and dolomitic siltstone on the effect of different lithology identification models. The results show that for siltstone, the F1-score of siltstone identification is almost unchanged with the increase of oversampling ratios, which may be because the oversampling does not effectively widen the feature boundaries of siltstone, resulting in the model not being able to learn new feature boundaries, and thus the identification effect cannot be improved. For dolomitic siltstone, the model identification effect is improved with the increase of the oversampling ratios. SVM has the most obvious improvement in identification, with the F1-score increasing from 0.06 to 0.31. For RF and XGBoost, the F1-score of the former increased from 0.24 to 0.31, and the F1score of the latter increased from 0.29 to 0.35. For the stacking model, the F1-score increased from 0.3 to 0.33. It also show that the identification effect of majority lithologies is almost unaffected after oversampling minority lithologies. 3.3.5 Single well lithology prediction Fig. 6 shows that the SVM, RF, XGBoost and the stacking models can identify shale lithologies, but the stacking model has the highest accuracy, indicating that the stacking model has high usability in shale lithology identification, which can not only obtain complete lithology distribution characteristics, but also improve the efficiency of lithology identification. 4. Conclusions In this study, exploratory data analysis, modeling, oversampling and logging parameters interpretation analysis are performed using machine learning algorithms in Junggar Basin, China, with the following conclusions: -p ro o f (1) The exploratory data analysis, by means of statistical table and correlation matrix plot, can analyze the characteristics of lithology distribution and the correlation between logging parameters. (2) The machine learning algorithms can dig out the correlation between logging parameters and lithologies, and the stacking model achieves the best identification performance. lP has no effect on majority lithologies. re (3) SMOTE algorithm can improve the identification performance of minority lithologies, and (4) The SHAP algorithm is applied to quantify the importance of different logging parameters Jo ur na in lithology identification and the results show that the logging parameters in descending order of importance are DEN, AC, CNL, SP, RMSL, LLD, LLS, and GR. Credit author statement Jinlu Yang: Writing-original draft, Writing-review & editing. Min Wang: Research idea design. Ming Li: Experiment operation and data processing. Yu Yan: Experiment operation and data processing. Xin Wang: Experiment operation and data processing. Haoming Shao: Experiment operation and data processing. Changqi Yu: Experiment operation and data processing. Yan Wu: Experiment operation and data processing. Dianshi Xiao: Experiment operation and data processing. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments Financial support from the National Natural Science Foundation of China (No. 41922015, No. 42072147) and Qingdao Postdoctoral (ZX20210070) is acknowledged. Reference Bhattacharya, S., Carr, T.R., Pal, M., 2016. Comparison of supervised and unsupervised approaches for mudstone lithofacies classification: case studies from the Bakken and Mahantango–Marcellus Shale, USA. J. Nat. Gas. Sci. Eng. 33, 1119–1133. Biau, G., 2012. Analysis of a random forests model. J. Mach. Learn. Res. 13(1), 1063–1095. Biau, G., Scornet, E., 2016. A random forest guided tour. TEST. 25, 197–227. Chen, G., Chen, M., Hong, G.B., Lu, Y.H., Zhou, B., Gao, Y.F., 2020. A new method of lithology classification based on convolutional neural network algorithm by utilizing drilling string vibration data. Energies. 13(4), 888. Chen, T.Q., Guestrin, C., 2016. XGBoost, a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San f Francisco. ACM, pp. 785–794. -p ro o Cherkassky, V., Ma, Y.Q., 2004. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Networks. 17(1), 113–126. Chevitarese, D.S., Szwarcman, D., e Silva, R.G., Brazil, E.V., 2018. Deep learning applied to seismic facies classification, a methodology for training. In: Conference Proceedings, Saint re Petersburg 2018, Vol. 2018. EAGE, pp. 1–5. lP Corina, A.N., Hovda, S., 2018. Automatic lithology prediction from well logging using kernel density estimation. J. Pet. Sci. Eng. 170, 664–674. Jo ur na Deng, C.X., Pan, H.P., Fang, S.N., Konaté, A.A., Qin, R.D., 2017. Support vector machine as an alternative method for lithology classification of crystalline rocks. J. Geophys. Eng. 14(2), 341–349. Dev, V.A., Eden, M.R., 2019. Formation lithology classification using scalable gradient boosted decision trees. Comput. Chem. Eng. 128, 392–404. Dong, S.Q., Wang, Z.Z., Zeng, L.B., 2016. Lithology identification using kernel fisher discriminant analysis with well logs. J. Pet. Sci. Eng. 143, 95–102. Dong, X.B., Yu, Z.W., Cao, W.M., Shi, Y.F., Ma, Q.L., 2020. A survey on ensemble learning. Front. Comput. Sci. 14(2), 241–258. Ghosh, S., Chatterjee, R., Shanker, P., 2016. Estimation of ash, moisture content and detection of coal lithofacies from well logs using regression and artificial neural network modelling. Fuel. 177, 279–287. He, M., Gu, H.M., Wan, H., 2020. Log interpretation for lithology and fluid identification using deep neural network combined with MAHAKIL in a tight sandstone reservoir. J. Pet. Sci. Eng. 194, 107498. Horrocks, T., Holden, E.J., Wedge, D., 2015. Evaluation of automated lithology classification architectures using highly–sampled wireline logs for coal exploration. Comput. Geosci. 83, 209– 218. Li, P.X., Feng, X.T., Feng, G.L., Xiao, Y.X., Chen, B.R., 2019. Rockburst and microseismic characteristics around lithological interfaces under different excavation directions in deep tunnels. Engineering Geology. 260, 105209. Li, Z.R., Kang, Y., Feng, D.Y., Wang, X.M., Lv, W.J., Chang, J., Zheng, W.X., 2020. Semi– supervised learning for lithology identification using laplacian support vector machine. J. Pet. Sci. Eng. 195, 107510. Lundberg, S.M., Lee, S.I., 2017. A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Vol. 30. Curran Associates Inc. -p ro o tree modeling. Journal of Chemometrics. 18(6), 275–285. f Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A., Brown, S.D., 2004. An introduction to decision Pan, Y.H., Huang, Z.L., Guo, X.B., Li, T.J., Zhao, J., Li, Z.Y., Qu, T., Wang, B.R., Fan, T.G., Xu, X.F., 2022. Lithofacies types, reservoir characteristics, and hydrocarbon potential of the lacustrine organic-rich fine-grained rocks affected by tephra of the permian Lucaogou formation, Santanghu re basin, western China. J. Pet. Sci. Eng. 208, 109631. lP Raeesi, M., Moradzadeh, A., Ardejani, F.D., Rahimi, M., 2012. Classification and identification of hydrocarbon reservoir lithofacies and their heterogeneity using seismic attributes, logs data and Jo ur na artificial neural networks. J. Pet. Sci. Eng. 82–83, 151–165. Rokach, L., 2010. Ensemble–based classifiers. Artif. Intell. Rev. 33, 1–39. Sagi, O., Rokach, L., 2018. Ensemble learning, a survey. Wires Data Min. Knowl. 8(4), e1249. Saporetti, C.M., da Fonseca, L.G., Pereira, E. 2019. A lithology identification approach based on machine learning with evolutionary parameter tuning. IEEE Geosci. Remote. Sens. Lett. 16, 1819– 1823. Saporetti, C.M., da Fonseca, L.G., Pereira, E., de Oliveira, L.C., 2018. Machine learning approaches for petrographic classification of carbonate-siliciclastic rocks using well logs and textural information. J. Appl. Geophys. 155, 217–225. Sebtosheikh, M.A., Salehi, A., 2015. Lithology prediction by support vector classifiers using inverted seismic attributes data and petrophysical logs as a new approach and investigation of training data set size effect on its performance in a heterogeneous carbonate reservoir. J. Pet. Sci. Eng. 134, 143–149. Song, Y.Y., Lu, Y., 2015. Decision tree methods, applications for classification and prediction. Shanghai Arch. Psychiatry. 27(2), 130–135. Sun, J., Chen, M. Q., Li, Q., Ren, L. Dou, M.Y. Zhang, J.X., 2021. A new method for predicting formation lithology while drilling at horizontal well bit. J. Pet. Sci. Eng. 196, 107955. Sun, J., Li, Q., Chen, M. Q., Ren, L., Huang, G.H., Li, C.Y., Zhang, Z.X., 2019. Optimization of models for a rapid identification of lithology while drilling – a win–win strategy based on machine learning. J. Pet. Sci. Eng. 176, 321–341. Tian, M., Omre, H.N., Xu, H.M., 2021. Inversion of well logs into lithology classes accounting for spatial dependencies by using hidden markov models and recurrent neural networks. J. Pet. Sci. Eng. 196, 107598. Xiao, Y.C., Wang, H.G., Xu, W.L., 2015. Parameter selection of gaussian kernel for one–class SVM. IEEE T. Cybernetics. 45(5), 927–939. Xie, Y.X., Zhu, C.Y., Zhou, W., Li, Z.D., Liu, X., Tu, M., 2018. Evaluation of machine learning -p ro o performances. J. Pet. Sci. Eng. 160, 182–193. f methods for formation lithology identification, a comparison of tuning processes and model Xu, T., Chang, J., Feng, D.Y., Lv, W.J., Kang, Y. Liu, H.N., Li, J. Li, Z.R., 2021. Evaluation of active learning algorithms for formation lithology identification. J. Pet. Sci. Eng. 206, 108999. Yang, H.J., Pan, H.P., Ma, H.L., Konaté, A.A., Yao, J., Guo, B., 2016. Performance of the synergetic lP Pet. Sci. Eng. 144, 1–9. re wavelet transform and modified k–means clustering in lithology classification using nuclear log. J. Zhang, G.Y., Wang, Z.Z., Chen, Y.K., 2018. Deep learning for seismic lithology prediction. Geophys. Jo ur na J. Int. 215(2), 1368–1387. Zheng, W.H., Tian, F., Di, Q.Y., Xin, W. Cheng, F.Q. Shan, X.C., 2021. Electrofacies classification of deeply buried carbonate strata using machine learning methods, a case study on ordovician paleokarst reservoirs in Tarim Basin. Mar. Pet. Geol. 123, 104720. Zhou, K.B., Zhang, J.Y., Ren, Y.S., Huang, Z., Zhao, L.X., 2020. A gradient boosting decision tree algorithm combining synthetic minority over–sampling technique for lithology identification. Geophysics. 85(4), WA147-WA158. Table 1 Parameters tuning table for the SVM, RF, and XGBoost models Model Parameters C ο§ Estimators Max depth Estimators Max depth Min child weight Gamma SVM RF Optimal value 33 0.01 300 10 47 7 4 0.5 Jo ur na lP re -p r oo f XGBoost Search range 0.0001-100 0.0001-1 10-500 3-20 10-400 3-10 1-6 0.1-1 Table 2 The definition of the confusion matrix Predicted Negative Positive Positive False Positive (FP) True Positive (TP) Jo ur na lP re -p r oo f Actual Negative True Negative (TN) False Negative (FN) Table 3 Statistical table of the distribution of samples of different lithologies in the dataset MS Number 595 Proportion/% 25.30 DM 799 33.97 S 252 10.71 DS 244 10.37 MD 462 19.64 Total 2352 100 Jo ur na lP re -p r oo f Lithology Table 4 Comparison of predicted results from different lithology identification models Lithology SVM The stacking model MS DM S DS MD Precision 0.91 0.77 0.5 0.51 0.81 Recall 0.93 0.91 0.66 0.21 0.6 MS DM S DS MD Precision 0.88 0.76 0.51 0.48 0.86 Recall 0.95 0.91 0.62 0.16 0.58 F1-score 0.92 0.83 0.57 0.3 0.69 Precision 0.88 0.71 0.52 0.75 0.77 Recall 0.96 0.92 0.49 0.03 0.52 XGBoost F1-score 0.92 0.8 0.51 0.06 0.62 F1-score 0.91 0.83 0.57 0.24 0.69 Precision 0.91 0.77 0.55 0.39 0.67 Recall 0.93 0.81 0.67 0.23 0.69 F1-score 0.92 0.79 0.61 0.29 0.68 Jo ur na lP re -p r oo f RF Jo na ur re lP ro -p of Jo ur na lP re -p ro of Figure 1. The stacking model based on XGBoost, RF, and LR working process. Figure 2. Correlation matrix between GR, SP, RMSL, LLS, LLD, AC, CNL and DEN in the dataset. Figure 3. Comparison of confusion matrices for the stacking model, SVM, RF, and XGBoost. Figure 4. The importance of logging parameters in lithology identification. Figure 5. Effect of different oversampling ratios of siltstone and dolomitic siltstone on the performance of different lithology identification models. Figure 6. Comparison of predicted results from the stacking model, SVM, RF, and XGBoost models for well X1. Jo na ur re lP ro -p of Fig. 1 Jo na ur re lP ro -p of Fig. 2 Fig. 3 (b) (c) (d) Jo ur na lP re -p ro of (a) Jo na ur re lP ro -p of Fig. 4 1 1 0.8 0.8 F1-score F1-score Fig. 5 0.6 0.4 0.4 0.2 (a) Stacking model 0.2 0.6 (b) SVM 0 0 0 0.5 1 1.5 0 2 0.5 1! " # $ % ! " & ' " # $ % & ' " 1! ! ( # $ " 1.5 2 " # $ % ! " & ' " # $ % & ' " ! ( # $ " 0.8 0.4 0.2 (c) RF 0 0 0.5 1 1.5 0 2 0 Ratio ζ©ε’εζ° ! ( # Siltstone $ " ! " # siltstone $ % ! " Dolomitic re & ' " Dolomitic # $ %mudstone & ' " ur na lP # $ Mudstone % ! " 0.5 1 (d) XGBoost 1.5 2 Ratio ζ©ε’εζ° Jo ! " 0.4 -p 0.2 0.6 of 0.6 ro F1-score 0.8 F1-score 1 Ratio ζ©ε’εζ° Ratio ζ©ε’εζ° & 'Micritic " #dolomite $ % & ' " ! ( # $ " Dolomitic mudstone Siltstone Jo ur na Mudstone lP re -p ro of Fig. 6 Dolomitic siltstone Micritic dolomite 1. We build a stacking model based on random forest (RF), extreme gradient boosting (XGBoost) and linear regression (LR), and use synthetic minority oversampling technique (SMOTE) to improve the imbalanced lithologies problem. 2. The identification performance of the stacking model is better than that of the SVM, RF and XGBoost models. 3. SMOTE algorithm can effectively improve the identification performance of the minority lithologies. 4. The stacking model combined with SMOTE proposed in this paper has high lithology identification performance, Jo ur na lP re -p r oo f which renders it practicable for lithology identification. Declaration of Interest Statement β The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Jo ur na lP re -p ro of βThe authors declare the following financial interests/personal relationships which may be considered as potential competing interests: