Utilising GAN-Generated synthetic data to enhance Machine Learning Models for Predicting Adsorbent Performance in Heavy Metal Removal by J.D. Johannes 221038124 1. Project Background 2. Problem statement 3. Aims and Objectives 4. Literature Review 5. Methodology 6. Results and Discussion 7. Conclusion 8. Recommendation 1 Project Background • Heavy metal pollution in wastewater is a major global concern, including in Namibia, affecting water quality, public health, and environmental sustainability (Jadoun et al., 2023). • Metals like Chromium, Iron, Copper, Zinc, and Lead enter the environment through mining, smelting, and other industrial processes (Zhang et al., 2023). • Existing methods for heavy metal removal are often inefficient, expensive, and environmentally challenging. 2 Project Background • The performance of adsorbents depends on various factors, making it complex to optimize without precise models. • Machine Learning (ML) has the potential to accurately predict adsorbent performance under various conditions, streamlining the process of heavy metal removal (Uhlig, et al., 2023). • By reducing the need for extensive parametric optimization experiments, ML models can cut costs and time. 3 Project Background • In the Namibian context, the lack of readily available adsorption data poses an additional challenge. • Waiting for enough data to accumulate from the industry is not a viable option, hence the critical need for alternative approaches. • This is where Generative Adversarial Networks (GANs) are used. 4 Project background • There are no documented applications of such models for heavy metal adsorption in Namibia. • And thus, this project aims to address that by developing ML models for predicting adsorbent performance for heavy metal removal from wastewater trained with synthetic data. 5 Problem statement In the context of wastewater treatment, predicting adsorbent performance for heavy metal removal remains a complex and costly task. The limited size of available datasets in Namibia adds to this challenge. This research project seeks to address this issue by employing machine learning techniques to generate synthetic data from an existing dataset, aiming to create a predictive model for adsorbent performance. 6 Aims and Objectives Aim • To develop and optimise a machine learning model to accurately predict the removal of heavy metals wastewater. Objectives • Generate synthetic data from the available dataset using Generative Adversarial Networks. • Develop and refine a machine learning model using generated synthetic dataset. 7 Literature Review • Adsorption is a process where a solid or liquid substance, an adsorbent, accumulates molecules of a gas, liquid, or dissolved solids on its surface, leading to a film of the adsorbate (Dutta et al., 2019). Machine Learning • Machine Learning (ML) enables computers to learn from data through algorithms, creating models for prediction without explicit instructions (Abbasi et al., 2022). 8 Literature Review • Generative Adversarial Networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning. • They consist of two models, a generative model that creates data samples and a discriminative model that evaluates them, working in tandem to improve the authenticity of the generated data (Goodfellow et al., 2014) • There are different kinds of GAN architectures, such as Deep Convolutional GAN (DCGAN), Gaussian Mixture Model Conditional Gan (GMM CGAN), Vanilla GAN. 9 Literature Review Previous Works • Machine Learning (ML) models have shown promising results in predicting adsorption efficiency in various studies (Zhao et al., 2023; Dashti et al., 2023). • Li-juan, et al., (2008) setup a regression support vector machine (SVM) to set up a prediction model of a sewage treatment plant. • Yu, et al., (2023) utilised a GAN based on the Gaussian Mixture Model (GMM_GAN) to generate synthetic datasets for small samples. • Aziira et al. (2020) used GANs and Conditional Generative Adversarial Networks (CGANs). • Their study revealed that the CGAN-generated data achieved 63% accuracy in mimicking real data. 10 Technical background Equation 1 Figure 1: The overall block diagram of the methodology (Yu, et al., 2023). 11 Technical background 𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 1 𝒘 2 𝑛 2 +𝐶 𝜉𝑖 𝑖=1 𝑦𝑖 𝒘 ∙ 𝒙𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 𝑎𝑛𝑑 𝜉 ≥ 0 Where 𝒘 is the weight vector, 𝑏 is the bias term, 𝜉𝑖 are the slack variables, 𝐶 is the penalty parameter, and 𝑦𝑖 and 𝒙𝑖 are the labels and data points, respectively. Figure 3 Support vector machine (SVM) classifier (Taoufik et al., 2022) 12 Technical background Evaluation Parameters for ML models: N i=1 Root Mean Square Error = RMSE = 1 Mean Relative Error = MRE = N N i=1 yi,Exp − yi,Pre yi,Pre N 2 yi,Exp − yi,Pre yi,Pre Evaluation setup for GANs: Kolmogorov-Smirnov test 13 Methodology • The dataset has inputs: solid loading, adsorption time, and temperature. The outputs are concentrations of the metals in ppm of Cr, Fe, Cu, Zn, Pb. It has 108 data points. • This study employs a Conditional Wasserstein Generative Adversarial Network (CWGAN) architecture integrated with a Gaussian Mixture Model (GMM). • The dataset was preprocessed using the StandardScaler from scikit-learn to normalize the features (inputs). 14 Methodology • The data is then partitioned into training, validation, and test sets with an 80-20 split for training and testing. • A GMM is fitted on the training data to sample noise for the generator. • The models are trained using the Adam optimizer with a learning rate of 0.0002. The batch size is set to 128, and the training is halted at 500 epochs. • A Wasserstein loss function was utilised for training both the generator and discriminator. Specifically, the cross-entropy as the loss function. 15 Methodology • Qualitative Assessment: Distribution plots are used to visually compare the distributions of the real and synthetic data. • Quantitative Assessment: A Kolmogorov-Smirnov test is employed to statistically assess the similarity between the distributions of the real and synthetic data. • The GAN model was implemented using Python's TensorFlow and Keras libraries. • A 1000 samples dataset was generated with the trained model. 16 Methodology (ML Model) • The features were scaled using the StandardScaler from scikit-learn to normalize the features (inputs). • The data is then partitioned into training, validation, and test sets with an 80-20 split for training and testing. • A radial basis function kernel was used. • Train the SVM model on the training set. • Test set evaluation - evaluated the model’s performance on the test set. • Regression metrics (MSE), (RMSE), and R-squared were used to evaluate model perfomance. 17 Results and Discussion Figure 2: Comparative distributions of real and synthetic data after 500 training epochs. 18 Results and Discussion • The histograms display the distributional characteristics of both real and synthetic data of the inputs. • The blue histogram corresponds to the real data, and the orange histogram represents the synthetic data generated by the GAN model. • The GAN model shows a promising overlap between real and synthetic feature distributions indicating effective learning. • The slight variations suggests room for improvement, through model fine-tuning or additional training epochs. 19 Results and Discussion • The Kolmogorov-Smirnov test results further corroborate the visual inspection with a KS statistic of 0.80, p-value 0.05. • Low KS Statistic – a KS statistic of 0.80 is high, which suggests that the empirical cumulative distribution functions (ECDFs) of the real and synthetic data sets are not close. • p-value < 0.05 suggests that we reject the null hypothesis, indicating that there is a statistically significant difference between the distributions of the real and synthetic data. 20 Results and Discussion • RMSE (0.25), this value indicates the standard deviation of the prediction errors. This value might suggest moderate predictive accuracy. • MSE (0.0625): this value of suggests that, on average, the square of the error between the predicted and actual values is 0.0625. • R² (0.4): An R² of 0.4 indicates that approximately 40% of the variance in the dependent variable is predictable from the independent variables. This suggests that there is room for improvement. 21 Results and discussion • The p-values for RMSE, MSE, and R² being 0.05 suggest that the model’s performance is statistically significant. • This implies that while there is an indication of the model capturing some patterns in the data, the level of confidence in these results is not very high. • The modest R² value and the borderline significant p-values could be attributed to the poor quality of the synthetic data. 22 Conclusion • Despite the model reaching convergence, the synthetic dataset it produced was of suboptimal quality. • This was evidenced by the difference in distributions between real and synthetic data and high KS statistic of 0.80 with a p-value < 0.05. • A low RMSE (0.25), R² (0.4) show that the model has moderate predictability. • The p-values for RMSE, MSE, and R² being 0.05 suggest that the model’s performance is statistically significant. • In conclusion, while the GAN model showed potential in generating synthetic data, the current limitations highlight the need for a more nuanced approach. • By addressing these aspects, there is potential to significantly improve the quality of the generated synthetic data, thereby making it more useful for predictive modeling. 23 Recommendations • It's recommended to generate a higher quality dataset for more reliable and robust modeling by GAN model architecture adjustment, training process refinement. • Consider using different types of generative models like variational autoencoders (VAEs) or different GAN architectures that might be better suited for the specific characteristics of the dataset. • Engage with experts in data science, machine learning, and chemical engineering for a multi-faceted approach. 24 Project timeline T h C T hi C el R o T h Mon, 6/19/2023 Project Start 1 TASK PROGRESS START END 8-Jul-23 Proposal write up 8-Aug-23 5-Jul-23 Jul 17, 2023 Aug 14, 2023 Sep 11, 2023 Oct 9, 2023 Nov 6, 2023 19 23 27 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 Project consultation with 100% supervisor 19-Jun-23 100% Jun 19, 2023 Proposal Presentation 100% 14-Aug-23 18-Aug-23 Proposal Submission 70% 18-Aug-23 24-Aug-23 Progress Presentation 0% 24-Aug-23 20-Oct-23 Final Project Presentation 0% 20-Oct-23 10-Nov-23 Final Project Submission0% 10-Nov-23 17-Nov-23 M F T S W S T M F T S W S T M F T S W S T M F T S W S T M F T S W S T M F T S W S Figure 4. Gantt chart for work breakdown 25 T Thank you! Questions 26 References • Dashti, A., Raji, M., Riasat Harami, H., Zhou, J. L., & Asghari, M. (2023b). Biochar performance evaluation for heavy metals removal from industrial wastewater based on machine learning: Application for environmental protection. Separation and Purification Technology, 312, 123399. https://doi.org/https://doi.org/10.1016/j.seppur.2023.123399 • Rajendran, S., Priya, A. K., Senthil Kumar, P., Hoang, T. K. A., Sekar, K., Chong, K. Y., Khoo, K. S., Ng, H. S., & Show, P. L. (2022). A critical and recent developments on adsorption technique for removal of heavy metals from wastewater-A review. Chemosphere, 303, 135146. https://doi.org/10.1016/J.CHEMOSPHERE.2022.135146 • Uhlig, S., Alkhasli, I., Schubert, F., Tschöpe, C., & Wolff, M. (2023). A review of synthetic and augmented training data for machine learning in ultrasonic non-destructive evaluation. Ultrasonics, 134, 107041. https://doi.org/10.1016/J.ULTRAS.2023.107041 • Zhang, W., Huang, W., Tan, J., Huang, D., Ma, J., & Wu, B. (2023). Modeling, optimization and understanding of adsorption process for pollutant removal via machine learning: Recent progress and future perspectives. Chemosphere, 311, 137044. https://doi.org/https://doi.org/10.1016/j.chemosphere.2022.137044 • Yu, H., Wang, Q.F. & Shi, J.Y. Data Augmentation Generated by Generative Adversarial Network for Small Sample Datasets Clustering. Neural Process Lett (2023). https://doi.org/10.1007/s11063-023-11315-z 27 References • Motamed, S., Rogalla, P., & Khalvati, F. (2021). Data augmentation using Generative Adversarial Networks (GANs) for GAN-based detection of Pneumonia and COVID-19 in chest X-ray images. Informatics in medicine unlocked, 27, 100779. https://doi.org/10.1016/j.imu.2021.100779 • Mulé, S., Lawrance, L., Belkouchi, Y., Vilgrain, V., Lewin, M., Trillaud, H., Hoeffel, C., Laurent, V., Ammari, S., Morand, E., Faucoz, O., Tenenhaus, A., Cotten, A., Meder, J. F., Talbot, H., Luciani, A., & Lassau, N. (2023). Generative adversarial networks (GAN)-based data augmentation of rare liver cancers: The SFR 2021 Artificial Intelligence Data Challenge. Diagnostic and interventional imaging, 104(1), 43–48. https://doi.org/10.1016/j.diii.2022.09.005 • Aziira, A & Setiawan, N & Soesanti, I. (2020). Generation of Synthetic Continuous Numerical Data Using Generative Adversarial Networks. Journal of Physics: Conference Series. 1577. 012027. 10.1088/1742-6596/1577/1/012027. 28 References • Dutta, S., & Sharma, R. K. (2019). In Separation Science and Technology. • Goodfellow, I. J., Mirza, M., Xu, B., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv. /abs/1406.2661 • W. Li-juan and C. Chao-bo, "Support Vector Machine Applying in the Prediction of Effluent Quality of Sewage Treatment Plant with Cyclic Activated Sludge System Process," 2008 IEEE International Symposium on Knowledge Acquisition and Modeling Workshop, Wuhan, China, 2008, pp. 647-650, doi: 10.1109/KAMW.2008.4810572. 29