AI for Photocatalysis

In this study we performed data-driven modeling of photocatalysis process. The objective was to build a machine learning (ML) model to predict first order rate constant k using the experimental conditions (Time, solution pH, Light intensity, Light source distance, dye concentration loading), elemental composition of catalyst (C, Fe, Al, Ni, Mo, S, Bi, Ag, Pd, Pt) physio-chemical properties of the catalyst (Volume, surface area, pore size, pore volume) and parameters of pollutant (solubility, molecular weight, H-bond acceptor and donor counts). Total data consisted of 1527 samples and 32 features, which were collected by experimentation. This dataset was divided into 1068 (70%) training set and 459 (30%) test set. In the first notebook (1. Exploratory Data Analysis) we performed exploratory data analysis. After this we checked the performance of avaialble (over 30) machine learning algorithms on test set of our data in 2. Experiments after training them on training set. The purpose was to get an idea that which ML algorithm will be best for our problem. After that, we performed feature selection using various feature selection methods in 3. Feature Selection notebook. The final features were selected using Boruta-shap method. After selecting the algorithm and features, we performed hyperparameter optimization using k-fold cross validation in 5. hyperparameter optimization. Then we built and trained our model on training set and checked its prediction performance on test set. Some plots depicting analysis of prediction performance and error anlaysis were also plotted here. After that we interpreted the machine learning model using various post-hoc interpretation methods. This includes SHAP, Partial Dependence Plots and Accumulated Local Effects. Finally we checked the robustness of our model by quantifying uncertainty in the prediction of machine learning model. We used conformal analysis for this purpose and analyzed the robustness of our model by employing various conformal anlaysis methods.

Reproducibility

The results presented in these notebooks are completely (~100%) reproducible. All you need is to use same computational environment which was used to create these results. The names and versions of the python packages used in this project are given in requirements.txt file. Furthermore, the exact version of some of the python packages is also printed at the start of each notebook. The user will have to install these packages, preferably in a new conda environment. Then make sure that you have copied all the code in utils notebook in a utils.py file and saved in the same direcotry/folder where other python scripts are present. The data file is expected to be in the data folder.