AI for Photocatalysis
In this study we performed data-driven modeling of photocatalysis process. The objective was
to build a machine learning (ML) model to predict first order rate constant k using the
experimental conditions (Time, solution pH, Light intensity, Light source distance, dye concentration
loading), elemental composition of catalyst (C, Fe, Al, Ni, Mo, S, Bi, Ag, Pd, Pt)
physio-chemical properties of the catalyst (Volume, surface area, pore size, pore volume)
and parameters of pollutant (solubility, molecular weight, H-bond acceptor and donor counts).
Total data consisted of 1527 samples
and 32 features, which were collected by experimentation. This dataset was divided into 1068 (70%) training
set and 459 (30%) test set. In the first notebook (1. Exploratory Data Analysis) we performed exploratory
data analysis. After this we checked the performance of avaialble (over 30) machine learning algorithms
on test set of our data in 2. Experiments after training them on training set.
The purpose was to get an idea that which ML algorithm will be best for our problem.
After that, we performed feature selection using various feature selection methods
in 3. Feature Selection notebook. The final features
were selected using Boruta-shap method. After selecting
the algorithm and features, we performed hyperparameter optimization using k-fold cross validation
in 5. hyperparameter optimization.
Then we built and trained our model on training set and checked its prediction performance on test set. Some plots
depicting analysis of prediction performance and error anlaysis were also plotted here. After
that we interpreted the machine learning model using various post-hoc interpretation methods. This
includes SHAP, Partial Dependence Plots
and Accumulated Local Effects. Finally we checked the robustness
of our model by quantifying uncertainty in the prediction of machine learning model. We used
conformal analysis for this purpose and
analyzed the robustness of our model by employing various conformal anlaysis methods.
Reproducibility
The results presented in these notebooks are completely (~100%) reproducible. All you need is
to use same computational environment which was used to create these results. The names and
versions of the python packages used in this project are given in requirements.txt file.
Furthermore, the exact version of some of the python packages is also printed at the start of
each notebook. The user will have to install these packages, preferably in a new conda environment.
Then make sure that you have copied all the code in utils notebook in a utils.py file and saved in the
same direcotry/folder where other python scripts are present. The data file is expected to be
in the data folder.