.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/brouta_feature_selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_brouta_feature_selection.py: ============================ 3. Feature Selection ============================ Now we know that which model works best for our problem. Now we will perform feature selection using the best model. .. GENERATED FROM PYTHON SOURCE LINES 8-45 .. code-block:: Python import numpy as np np.NaN = np.nan # for compatibility with older versions of NumPy #np.bool = np.bool_ # for compatibility with older versions of NumPy import seaborn as sns import matplotlib.pyplot as plt from easy_mpl import bar_chart ##### Monkey-patch SciPy before importing BorutaShap because binom_test was renamed to binomtest in SciPy 1.7.0, and BorutaShap uses the old name. import scipy.stats as stats # Add the old name if SciPy only has the new one if not hasattr(stats, "binom_test") and hasattr(stats, "binomtest"): def binom_test(x, n=None, p=0.5, alternative="two-sided"): return stats.binomtest(int(x), n=n, p=p, alternative=alternative).pvalue stats.binom_test = binom_test ##### from BorutaShap import BorutaShap from sklearn.tree import DecisionTreeRegressor from sklearn.feature_selection import RFE from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression from sklearn.feature_selection import SelectFromModel from sklearn.feature_selection import VarianceThreshold from sklearn.feature_selection import mutual_info_regression from sklearn.feature_selection import SequentialFeatureSelector from utils import set_rcParams from utils import version_info from utils import prepare_data, LABEL_MAP, SAVE .. GENERATED FROM PYTHON SOURCE LINES 46-49 .. code-block:: Python for lib, ver in version_info().items(): print(lib, ver) .. rst-class:: sphx-glr-script-out .. code-block:: none python 3.12.10 (main, May 6 2025, 10:49:23) [GCC 11.4.0] os posix ai4water 1.07 lightgbm 4.6.0 catboost 1.2.10 xgboost 3.2.0 easy_mpl 0.21.5 SeqMetrics 2.0.0 numpy 1.26.4 pandas 2.2.3 matplotlib 3.10.8 h5py 3.16.0 sklearn 1.3.1 optuna 4.8.0 skopt 0.10.2 plotly 6.6.0 seaborn 0.13.2 crepes 0.9.0 mapie 0.9.2 shap 0.49.1 scipy 1.17.1 .. GENERATED FROM PYTHON SOURCE LINES 50-65 .. code-block:: Python set_rcParams() TOP_K = 10 df, _ = prepare_data(outputs="k") df = df.rename(columns=LABEL_MAP) feature_names = df.columns.tolist()[0:-1] X, y = df.iloc[:, 0:-1], df.iloc[:, -1].values print(X.shape, y.shape) .. rst-class:: sphx-glr-script-out .. code-block:: none (1527, 34) (1527,) .. GENERATED FROM PYTHON SOURCE LINES 66-68 Information gain -------------------- .. GENERATED FROM PYTHON SOURCE LINES 68-82 .. code-block:: Python importances = mutual_info_regression( X, y) bar_chart( importances, feature_names, color="teal", sort=True, show=False, ) plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_brouta_feature_selection_001.png :alt: brouta feature selection :srcset: /auto_examples/images/sphx_glr_brouta_feature_selection_001.png, /auto_examples/images/sphx_glr_brouta_feature_selection_001_2_00x.png 2.00x :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 83-85 Chi-squared ------------ .. GENERATED FROM PYTHON SOURCE LINES 85-95 .. code-block:: Python chi2_features = SelectKBest(f_regression, k=10) X_kbest_features = chi2_features.fit_transform( X, y) chi2_features = np.array(feature_names)[chi2_features.get_support()] print(chi2_features) chi2_features = chi2_features[0:TOP_K].tolist() .. rst-class:: sphx-glr-script-out .. code-block:: none ['Band Gap (eV)' 'O (At%)' 'Al (At%)' 'Ni (At%)' 'Volume (L)' 'HB donor count' 'Solubility (g/L)' 'M.W. (g/mol)' 'pka2' 'Solution pH'] .. GENERATED FROM PYTHON SOURCE LINES 96-98 Variance Threshold ------------------- .. GENERATED FROM PYTHON SOURCE LINES 98-118 .. code-block:: Python v_threshold = VarianceThreshold(threshold=0) v_threshold.fit(X) v_threshold.get_support() bar_chart( v_threshold.variances_, feature_names, color="teal", sort=True, show=False ) plt.tight_layout() plt.show() vt_features = {k:v for k,v in zip(v_threshold.variances_, feature_names, )} # sort_by_value vt_features = dict(sorted(vt_features.items(), key=lambda item: item[1], reverse=True)) vt_features = np.array(list(vt_features.values()))[0:TOP_K].tolist() .. image-sg:: /auto_examples/images/sphx_glr_brouta_feature_selection_002.png :alt: brouta feature selection :srcset: /auto_examples/images/sphx_glr_brouta_feature_selection_002.png, /auto_examples/images/sphx_glr_brouta_feature_selection_002_2_00x.png 2.00x :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 119-122 Forward Feature Selection --------------------------- Starting with empty/minimal feature set and adding features one by one .. GENERATED FROM PYTHON SOURCE LINES 122-133 .. code-block:: Python rgr = DecisionTreeRegressor() sfs_forward = SequentialFeatureSelector( rgr, n_features_to_select=TOP_K, direction="forward" ).fit(X, y) ffs_features = np.array(feature_names)[sfs_forward.get_support()] print(ffs_features) ffs_features = ffs_features.tolist() .. rst-class:: sphx-glr-script-out .. code-block:: none ['Al (At%)' 'Ni (At%)' 'Pt (At%)' 'Light Int. (W)' 'Light Dist. (cm)' 'HB acceptor count' 'pka2' 'Initial Conc. (mg/L)' 'Solution pH' 'HA (mg/L)'] .. GENERATED FROM PYTHON SOURCE LINES 134-139 Backward feature elimination ----------------------------- Starting with a full set of features and removing one by one and everytime measuring the decrease in performance. Finally we rank the features, according to the decrease in performance they cause. .. GENERATED FROM PYTHON SOURCE LINES 139-149 .. code-block:: Python sfs_forward = SequentialFeatureSelector( rgr, n_features_to_select=TOP_K, direction="backward" ).fit(X, y) bfe_features = np.array(feature_names)[sfs_forward.get_support()] print(bfe_features) bfe_features = bfe_features.tolist() .. rst-class:: sphx-glr-script-out .. code-block:: none ['Catalyst' 'Al (At%)' 'Pt (At%)' 'Surface Area (m2/g)' 'Light Dist. (cm)' 'Rxn Time (min)' 'Dyes' 'log Kow' 'HB donor count' 'Initial Conc. (mg/L)'] .. GENERATED FROM PYTHON SOURCE LINES 150-153 Recursive Feeature Elimination ------------------------------- It is similar to backward feature elemination. .. GENERATED FROM PYTHON SOURCE LINES 153-158 .. code-block:: Python rfe = RFE(DecisionTreeRegressor(), n_features_to_select=TOP_K, step=1) rfe.fit(X, y) .. raw:: html
RFE(estimator=DecisionTreeRegressor(), n_features_to_select=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 159-164 .. code-block:: Python rfe_features = np.array(feature_names)[rfe.get_support()] print(rfe_features) rfe_features = rfe_features[0:TOP_K].tolist() .. rst-class:: sphx-glr-script-out .. code-block:: none ['O (At%)' 'Al (At%)' 'Pore Vol. (cm3/g)' 'Cat. Loading (g/L)' 'Light Int. (W)' 'Rxn Time (min)' 'Initial Conc. (mg/L)' 'Solution pH' 'HA (mg/L)' 'Anions'] .. GENERATED FROM PYTHON SOURCE LINES 165-167 Tree based method ------------------ .. GENERATED FROM PYTHON SOURCE LINES 167-176 .. code-block:: Python rgr = DecisionTreeRegressor().fit(X, y) model = SelectFromModel(rgr, prefit=True) tb_features = np.array(feature_names)[model.get_support()] print(tb_features) tb_features = tb_features[0:TOP_K].tolist() .. rst-class:: sphx-glr-script-out .. code-block:: none ['Al (At%)' 'Ni (At%)' 'Cat. Loading (g/L)' 'Light Int. (W)' 'Rxn Time (min)' 'Initial Conc. (mg/L)' 'Solution pH' 'Anions'] .. GENERATED FROM PYTHON SOURCE LINES 177-192 Boruta shap ------------- The purpose of Boruta is to find a subset of features from all the given features, which are relevant for the given task. It creats a copy of a feature which is called shadow feature. Then the shadow feature is shuffled. The model is trained with the original feature plus the shuffled shadow feature. After that the feature importance of the original feature and shadow feature is calcualted using SHAP. If the SHAP importance of a shadow feature is more than the orignal feature, then it is rejected. The intuition is that, if a feature is important, then its shuffled version should not have more importnace than the original feature. Finally, Boruta shap method groups features, either as confirmed important, or confirmed rejected or tentative features. Since Boruta involves training the original model again and again, this can be extremely costly if the model training is time consuming. For theory see `this article `_ and this `kaggle notebook `_ . .. GENERATED FROM PYTHON SOURCE LINES 192-223 .. code-block:: Python class MyBoruta(BorutaShap): def box_plot(self, data, X_rotation, X_size, y_scale, figsize): if y_scale=='log': minimum = data['value'].min() if minimum <= 0: data['value'] += abs(minimum) + 0.01 order = data.groupby(by=["Methods"])["value"].mean().sort_values(ascending=False).index my_palette = self.create_mapping_of_features_to_attribute( maps= ['#B17BB2', '#EE9E9D', '#00ABAC', '#B9E6FB']) # 'yellow', 'red', 'green', 'blue' # Use a color palette plt.figure(figsize=(10, 7)) ax = sns.boxplot(x=data["Methods"], y=data["value"], order=order, palette=my_palette) if y_scale == 'log':ax.set(yscale="log") ax.set_xticklabels(ax.get_xticklabels(), rotation=X_rotation, size=14) ax.tick_params(labelsize=14) ax.grid(visible=True, ls='--', color='lightgrey') ax.set_ylabel('Z-Score', fontsize=14) ax.set_xlabel('Features', fontsize=14,) plt.tight_layout() if SAVE: plt.savefig("results/figures/boruta_shap.png", dpi=600, bbox_inches="tight") return .. GENERATED FROM PYTHON SOURCE LINES 224-227 .. code-block:: Python model = DecisionTreeRegressor() .. GENERATED FROM PYTHON SOURCE LINES 228-233 .. code-block:: Python Feature_Selector = MyBoruta(model=model, importance_measure='shap', classification=False) .. GENERATED FROM PYTHON SOURCE LINES 234-242 We observed that the number of confirmed important and tentative features remained same after 50 ``n_trials``. At 50 ``n_trials`` the total potential features were 12. Further increasing the ``n_trials`` only moved the features from 'tentative' category to 'confirmed important' until 400. For computational constraints on readthedocs, we are setting ``n_trials`` to 100. `z_score` on y-axis is a measure of importance and therefore, boxplots display the distribution of importance. .. GENERATED FROM PYTHON SOURCE LINES 242-252 .. code-block:: Python Feature_Selector.fit( X=X, y=y, n_trials=100, sample=False, train_or_test = 'test', normalize=True, verbose=True ) .. rst-class:: sphx-glr-script-out .. code-block:: none 0%| | 0/100 [00:00` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: brouta_feature_selection.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: brouta_feature_selection.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_