Uploaded by Daniel Edwin

Project Final Presentation

advertisement
Utilising GAN-Generated synthetic data to enhance Machine
Learning Models for Predicting Adsorbent Performance in Heavy
Metal Removal
by
J.D. Johannes
221038124
1. Project Background
2. Problem statement
3. Aims and Objectives
4. Literature Review
5. Methodology
6. Results and Discussion
7. Conclusion
8. Recommendation
1
Project Background
• Heavy metal pollution in wastewater is a major global concern,
including in Namibia, affecting water quality, public health, and
environmental sustainability (Jadoun et al., 2023).
• Metals like Chromium, Iron, Copper, Zinc, and Lead enter the
environment through mining, smelting, and other industrial
processes (Zhang et al., 2023).
• Existing methods for heavy metal removal are often inefficient,
expensive, and environmentally challenging.
2
Project Background
• The performance of adsorbents depends on various factors,
making it complex to optimize without precise models.
• Machine Learning (ML) has the potential to accurately predict
adsorbent performance under various conditions, streamlining
the process of heavy metal removal (Uhlig, et al., 2023).
• By reducing the need for extensive parametric optimization
experiments, ML models can cut costs and time.
3
Project Background
• In the Namibian context, the lack of readily available
adsorption data poses an additional challenge.
• Waiting for enough data to accumulate from the industry
is not a viable option, hence the critical need for
alternative approaches.
• This is where Generative Adversarial Networks (GANs)
are used.
4
Project background
• There are no documented applications of such models for heavy
metal adsorption in Namibia.
• And thus, this project aims to address that by developing ML
models for predicting adsorbent performance for heavy metal
removal from wastewater trained with synthetic data.
5
Problem statement
In the context of wastewater treatment, predicting adsorbent
performance for heavy metal removal remains a complex and
costly task. The limited size of available datasets in Namibia adds
to this challenge. This research project seeks to address this issue
by employing machine learning techniques to generate synthetic
data from an existing dataset, aiming to create a predictive model
for adsorbent performance.
6
Aims and Objectives
Aim
• To develop and optimise a machine learning model to
accurately predict the removal of heavy metals
wastewater.
Objectives
• Generate synthetic data from the available dataset using
Generative Adversarial Networks.
• Develop and refine a machine learning model using
generated synthetic dataset.
7
Literature Review
• Adsorption is a process where a solid or liquid substance, an
adsorbent, accumulates molecules of a gas, liquid, or dissolved solids
on its surface, leading to a film of the adsorbate (Dutta et al., 2019).
Machine Learning
• Machine Learning (ML) enables computers to learn from data through
algorithms, creating models for prediction without explicit
instructions (Abbasi et al., 2022).
8
Literature Review
• Generative Adversarial Networks (GANs) are a class of artificial
intelligence algorithms used in unsupervised machine learning.
• They consist of two models, a generative model that creates data
samples and a discriminative model that evaluates them, working in
tandem to improve the authenticity of the generated data
(Goodfellow et al., 2014)
• There are different kinds of GAN architectures, such as Deep
Convolutional GAN (DCGAN), Gaussian Mixture Model Conditional
Gan (GMM CGAN), Vanilla GAN.
9
Literature Review
Previous Works
• Machine Learning (ML) models have shown promising results in predicting
adsorption efficiency in various studies (Zhao et al., 2023; Dashti et al.,
2023).
• Li-juan, et al., (2008) setup a regression support vector machine (SVM) to
set up a prediction model of a sewage treatment plant.
• Yu, et al., (2023) utilised a GAN based on the Gaussian Mixture Model
(GMM_GAN) to generate synthetic datasets for small samples.
• Aziira et al. (2020) used GANs and Conditional Generative Adversarial
Networks (CGANs).
• Their study revealed that the CGAN-generated data achieved 63% accuracy
in mimicking real data.
10
Technical background
Equation 1
Figure 1: The overall block diagram of the methodology (Yu, et al., 2023).
11
Technical background
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜
1
𝒘
2
𝑛
2
+𝐶
𝜉𝑖
𝑖=1
𝑦𝑖 𝒘 ∙ 𝒙𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 𝑎𝑛𝑑 𝜉 ≥ 0
Where 𝒘 is the weight vector, 𝑏 is the bias
term, 𝜉𝑖 are the slack variables, 𝐶 is the penalty
parameter, and 𝑦𝑖 and 𝒙𝑖 are the labels and
data points, respectively.
Figure 3 Support vector machine (SVM) classifier
(Taoufik et al., 2022)
12
Technical background
Evaluation Parameters for ML models:
N
i=1
Root Mean Square Error = RMSE =
1
Mean Relative Error = MRE =
N
N
i=1
yi,Exp − yi,Pre
yi,Pre
N
2
yi,Exp − yi,Pre
yi,Pre
Evaluation setup for GANs:
Kolmogorov-Smirnov test
13
Methodology
• The dataset has inputs: solid loading, adsorption time,
and temperature. The outputs are concentrations of the
metals in ppm of Cr, Fe, Cu, Zn, Pb. It has 108 data points.
• This study employs a Conditional Wasserstein Generative
Adversarial Network (CWGAN) architecture integrated
with a Gaussian Mixture Model (GMM).
• The dataset was preprocessed using the StandardScaler
from scikit-learn to normalize the features (inputs).
14
Methodology
• The data is then partitioned into training, validation, and test sets
with an 80-20 split for training and testing.
• A GMM is fitted on the training data to sample noise for the
generator.
• The models are trained using the Adam optimizer with a learning rate
of 0.0002. The batch size is set to 128, and the training is halted at
500 epochs.
• A Wasserstein loss function was utilised for training both the
generator and discriminator. Specifically, the cross-entropy as the loss
function.
15
Methodology
• Qualitative Assessment: Distribution plots are used to visually
compare the distributions of the real and synthetic data.
• Quantitative Assessment: A Kolmogorov-Smirnov test is employed to
statistically assess the similarity between the distributions of the real
and synthetic data.
• The GAN model was implemented using Python's TensorFlow and
Keras libraries.
• A 1000 samples dataset was generated with the trained model.
16
Methodology (ML Model)
• The features were scaled using the StandardScaler from scikit-learn to
normalize the features (inputs).
• The data is then partitioned into training, validation, and test sets
with an 80-20 split for training and testing.
• A radial basis function kernel was used.
• Train the SVM model on the training set.
• Test set evaluation - evaluated the model’s performance on the test
set.
• Regression metrics (MSE), (RMSE), and R-squared were used to
evaluate model perfomance.
17
Results and Discussion
Figure 2: Comparative distributions of real and synthetic data after 500 training epochs.
18
Results and Discussion
• The histograms display the distributional characteristics of
both real and synthetic data of the inputs.
• The blue histogram corresponds to the real data, and the
orange histogram represents the synthetic data generated by
the GAN model.
• The GAN model shows a promising overlap between real and
synthetic feature distributions indicating effective learning.
• The slight variations suggests room for improvement,
through model fine-tuning or additional training epochs.
19
Results and Discussion
• The Kolmogorov-Smirnov test results further corroborate
the visual inspection with a KS statistic of 0.80, p-value
0.05.
• Low KS Statistic – a KS statistic of 0.80 is high, which
suggests that the empirical cumulative distribution
functions (ECDFs) of the real and synthetic data sets are
not close.
• p-value < 0.05 suggests that we reject the null hypothesis,
indicating that there is a statistically significant difference
between the distributions of the real and synthetic data.
20
Results and Discussion
• RMSE (0.25), this value indicates the standard deviation of the
prediction errors. This value might suggest moderate predictive
accuracy.
• MSE (0.0625): this value of suggests that, on average, the square of
the error between the predicted and actual values is 0.0625.
• R² (0.4): An R² of 0.4 indicates that approximately 40% of the variance
in the dependent variable is predictable from the independent
variables. This suggests that there is room for improvement.
21
Results and discussion
• The p-values for RMSE, MSE, and R² being 0.05 suggest that the
model’s performance is statistically significant.
• This implies that while there is an indication of the model capturing
some patterns in the data, the level of confidence in these results is
not very high.
• The modest R² value and the borderline significant p-values could be
attributed to the poor quality of the synthetic data.
22
Conclusion
• Despite the model reaching convergence, the synthetic dataset it produced was
of suboptimal quality.
• This was evidenced by the difference in distributions between real and synthetic
data and high KS statistic of 0.80 with a p-value < 0.05.
• A low RMSE (0.25), R² (0.4) show that the model has moderate predictability.
• The p-values for RMSE, MSE, and R² being 0.05 suggest that the model’s
performance is statistically significant.
• In conclusion, while the GAN model showed potential in generating synthetic
data, the current limitations highlight the need for a more nuanced approach.
• By addressing these aspects, there is potential to significantly improve the quality
of the generated synthetic data, thereby making it more useful for predictive
modeling.
23
Recommendations
• It's recommended to generate a higher quality dataset for more
reliable and robust modeling by GAN model architecture adjustment,
training process refinement.
• Consider using different types of generative models like variational
autoencoders (VAEs) or different GAN architectures that might be
better suited for the specific characteristics of the dataset.
• Engage with experts in data science, machine learning, and chemical
engineering for a multi-faceted approach.
24
Project timeline
T
h
C
T
hi
C
el
R
o
T
h
Mon, 6/19/2023
Project Start
1
TASK
PROGRESS
START
END
8-Jul-23
Proposal write up
8-Aug-23
5-Jul-23
Jul 17, 2023
Aug 14, 2023
Sep 11, 2023
Oct 9, 2023
Nov 6, 2023
19 23 27 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30
Project consultation with
100%
supervisor
19-Jun-23
100%
Jun 19, 2023
Proposal Presentation 100%
14-Aug-23 18-Aug-23
Proposal Submission
70%
18-Aug-23 24-Aug-23
Progress Presentation 0%
24-Aug-23 20-Oct-23
Final Project Presentation
0%
20-Oct-23 10-Nov-23
Final Project Submission0%
10-Nov-23 17-Nov-23
M F
T
S W S
T M F
T
S W S
T M F
T
S W S
T M F
T
S W S
T M F
T
S W S
T M F
T
S W S
Figure 4. Gantt chart for work breakdown
25
T
Thank you!
Questions
26
References
• Dashti, A., Raji, M., Riasat Harami, H., Zhou, J. L., & Asghari, M. (2023b). Biochar performance evaluation for
heavy metals removal from industrial wastewater based on machine learning: Application for environmental
protection. Separation and Purification Technology, 312, 123399.
https://doi.org/https://doi.org/10.1016/j.seppur.2023.123399
• Rajendran, S., Priya, A. K., Senthil Kumar, P., Hoang, T. K. A., Sekar, K., Chong, K. Y., Khoo, K. S., Ng, H. S., &
Show, P. L. (2022). A critical and recent developments on adsorption technique for removal of heavy metals
from wastewater-A review. Chemosphere, 303, 135146.
https://doi.org/10.1016/J.CHEMOSPHERE.2022.135146
• Uhlig, S., Alkhasli, I., Schubert, F., Tschöpe, C., & Wolff, M. (2023). A review of synthetic and augmented
training data for machine learning in ultrasonic non-destructive evaluation. Ultrasonics, 134, 107041.
https://doi.org/10.1016/J.ULTRAS.2023.107041
• Zhang, W., Huang, W., Tan, J., Huang, D., Ma, J., & Wu, B. (2023). Modeling, optimization and understanding
of adsorption process for pollutant removal via machine learning: Recent progress and future perspectives.
Chemosphere, 311, 137044. https://doi.org/https://doi.org/10.1016/j.chemosphere.2022.137044
• Yu, H., Wang, Q.F. & Shi, J.Y. Data Augmentation Generated by Generative Adversarial Network for Small
Sample Datasets Clustering. Neural Process Lett (2023). https://doi.org/10.1007/s11063-023-11315-z
27
References
• Motamed, S., Rogalla, P., & Khalvati, F. (2021). Data augmentation using
Generative Adversarial Networks (GANs) for GAN-based detection of
Pneumonia and COVID-19 in chest X-ray images. Informatics in medicine
unlocked, 27, 100779. https://doi.org/10.1016/j.imu.2021.100779
• Mulé, S., Lawrance, L., Belkouchi, Y., Vilgrain, V., Lewin, M., Trillaud, H.,
Hoeffel, C., Laurent, V., Ammari, S., Morand, E., Faucoz, O., Tenenhaus,
A., Cotten, A., Meder, J. F., Talbot, H., Luciani, A., & Lassau, N. (2023).
Generative adversarial networks (GAN)-based data augmentation of rare
liver cancers: The SFR 2021 Artificial Intelligence Data
Challenge. Diagnostic and interventional imaging, 104(1), 43–48.
https://doi.org/10.1016/j.diii.2022.09.005
• Aziira, A & Setiawan, N & Soesanti, I. (2020). Generation of Synthetic Continuous
Numerical Data Using Generative Adversarial Networks. Journal of Physics:
Conference Series. 1577. 012027. 10.1088/1742-6596/1577/1/012027.
28
References
• Dutta, S., & Sharma, R. K. (2019). In Separation Science and
Technology.
• Goodfellow, I. J., Mirza, M., Xu, B., Ozair, S., Courville, A., & Bengio, Y.
(2014). Generative Adversarial Networks. ArXiv. /abs/1406.2661
• W. Li-juan and C. Chao-bo, "Support Vector Machine Applying
in the Prediction of Effluent Quality of Sewage Treatment Plant
with Cyclic Activated Sludge System Process," 2008 IEEE
International Symposium on Knowledge Acquisition and
Modeling Workshop, Wuhan, China, 2008, pp. 647-650, doi:
10.1109/KAMW.2008.4810572.
29
Download