Data-Driven Modeling of a Sine Function¶

This notebook demonstrates the core functionality of the Data-Driven module using a simple analytical sine function as a toy model. The goal is to approximate a known function:

$$ y = a \sin(bt + c) $$

using machine learning techniques based on synthetic data. The input parameters $\mathbf{x} = [a, b, c]^T$ are sampled from predefined distributions, and the outputs $\mathbf{y}$ are computed at $N=9$ discrete time steps. Gaussian noise is added to simulate observational uncertainty.

A gPCE (Generalized Polynomial Chaos Expansion) model is trained to approximate the mapping from input parameters to function outputs.

Sensitivity analyses using Sobol indices are performed to assess the global influence of each input parameter on the response.

SHAP (SHapley Additive exPlanations) values are computed to explain the contributions of individual input features on model predictions.

The effects of input parameters are isolated and subtracted from the model response to better understand their influence.

Imports¶

In [1]:
import sys
import os
parent_dir = os.path.abspath("../../libraries/surrogate_modelling")
sys.path.insert(0, parent_dir)
In [2]:
from distributions import *
from simparameter import SimParameter
from simparameter_set import SimParamSet
from surrogate_model import SurrogateModel
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from digital_twin import DigitalTwin
import utils
from surrogate_model import *

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

Load data¶

In [3]:
x_df = pd.read_csv('../data/sine_data/example_x_df.csv')
y_df = pd.read_csv('../data/sine_data/example_y_df.csv')

For training and sensitivity analysis¶

In [4]:
Q = utils.get_simparamset_from_data(x_df)
QoI_names = y_df.columns.to_list()

max_index = 2
QoI_param = 't_3'

subtracted_effects = ['a','b','c']

For measurements¶

In [5]:
z_m_df = pd.read_csv('../data/sine_data/example_z_m_df.csv')
sigma = pd.read_csv('../data/sine_data/example_sigma_df.csv')

For update¶

In [6]:
nwalkers = 64
nburn = 800
niter = 200

E = utils.generate_stdrn_simparamset(sigma.values.reshape(-1, 1))

q = np.load('../data/sine_data/example_q.npy')
q_df = pd.read_csv('../data/sine_data/example_q_df.csv')

Train model¶

Choosing the Model¶

In this notebook, we implement the model using one of the following methods: gPCE, DNN, GBT, or LinReg.

  • gPCE (Generalized Polynomial Chaos Expansion): Models uncertainty by expressing the output as a polynomial expansion, with coefficients determined by projecting the system's response onto orthogonal polynomials.

  • DNN (Deep Neural Networks): Machine learning models composed of interconnected neurons, capable of learning complex, non-linear patterns from data.

  • GBT (Gradient Boosting Trees): Combines weak decision trees into a strong predictive model, iteratively correcting errors from previous trees for robust performance.

  • LinReg (Linear Regression): A statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

In [7]:
method = "gPCE"
In [8]:
# Model configurations
match method:
    # DNN model configurations
    case "DNN":
        config = {  
            'init_config' : {
                'layers': [
                    {'neurons': 512, 'activation': 'relu', 'dropout': 0.2},
                    {'neurons': 256, 'activation': 'sigmoid', 'dropout': 0.2},
                    {'neurons': 128, 'activation': 'relu', 'dropout': None},
                    ],
                'outputAF': 'tanh'
                },
            'train_config' : {
                'optimizer': 'Adam',
                'loss': 'MSE',
                'epochs': 100,
                'batch_size': 32,
                'k_fold': None,
                'early_stopping': {
                    'patience': 25,
                    'min_delta': 0.0001}
                }
                }
    # gPCE model configurations
    case "gPCE":
        config = {
            'init_config' : {
            'p' : 5
            }, 
            'train_config' : {
                'k_fold': 5
                }
        }
    # GBT model configurations
    case "GBT":
        config = {
            'init_config' : {
                'gbt_method': 'xgboost'
            },
            'train_config' : {
                'max_depth': 3,
                'num_of_iter': 250,
                'k_fold': 5
                }
        }
In [9]:
split_config = {
    'train_test_ratio': 0.2, 
    'random_seed': 1997,
    'split_type': 'no_shuffle'
    }
In [10]:
# Initialize surrogate model
# This creates an instance of the SurrogateModel class using the provided sampling data (Q)
model = SurrogateModel(Q, QoI_names, method, **config['init_config'])
# Split data into training and testing sets
model.train_test_split(x_df, y_df, **split_config)
# Train the model
model.train(model.X_train, model.y_train, **config['train_config'])
----- Training started for 'gPCE' model -----
Fold 1/5
Fold 2/5
Fold 3/5
Fold 4/5
Fold 5/5
Average train loss: 0.00000000000000, Average valid loss: 0.00000000000000
----- Training ended for 'gPCE' model -----
In [11]:
# get mean and variance of surrogate model
mean, var = model.get_mean_and_var()
mean, var
Out[11]:
(tensor([5.7904e-04, 3.4249e-01, 6.4277e-01, 8.6492e-01, 9.8207e-01, 9.8027e-01,
         8.6017e-01, 6.3689e-01, 3.3810e-01, 5.3171e-04], dtype=torch.float64),
 tensor([9.7192e-05, 4.6615e-04, 1.0875e-03, 1.0816e-03, 3.9445e-04, 5.1472e-04,
         3.8145e-03, 1.1676e-02, 2.2656e-02, 3.2258e-02], dtype=torch.float64))

Sobol sensitivities¶

In [12]:
# set max_index for Sobol sensitivity analysis
max_index = 2
partial_variance, sobol_index = model.get_sobol_sensitivity(max_index)
In [13]:
# Plot Sobol Sensitivity Index
# The 'plot_sobol_sensitivity' method generates a plot of Sobol Sensitivity indices, which quantify 
# the contribution of each input parameter to the output uncertainty. 
# - max_index: Specifies the maximum index of parameters to consider in the plot.
# - param_name: The name of the quantity of interest (QoI) for which sensitivity is being calculated ('t_3' in this case).
fig = model.plot_sobol_sensitivity()
No description has been provided for this image
In [14]:
fig = model.plot_sobol_sensitivity(max_index=max_index, param_name=QoI_param)
No description has been provided for this image

SHAP values¶

In [15]:
# Calculate SHAP (Shapley Additive Explanations) values
# The 'get_shap_values' method computes the SHAP values for the model's test data (model.X_test). 
# SHAP values explain the contribution of each input feature to the prediction for each test sample.
# SHAP values help to interpret the model's decisions and understand the impact of each input parameter on the output.
shap_values = await model.get_shap_values(model.X_test)
Message: sample size for shap values is set to 100.
  0%|          | 0/100 [00:00<?, ?it/s]
In [16]:
# Plot SHAP Single Waterfall Plot
# The 'plot_shap_single_waterfall' method generates a SHAP waterfall plot for a single test sample, 
# visualizing how each feature contributes to the model's prediction for that sample.
# - q: A DataFrame containing the sample data (e.g., a specific parameter set) for which the SHAP values are calculated.
# - param_name: The name of the quantity of interest (QoI) being analyzed ('t_3' in this case).
# The SHAP waterfall plot shows the cumulative effect of each feature on the model's prediction, helping to interpret individual predictions.
fig = await model.plot_shap_single_waterfall(q=q_df, param_name=QoI_param)
  0%|          | 0/1 [00:00<?, ?it/s]
No description has been provided for this image
In [17]:
fig = await model.plot_shap_multiple_waterfalls(q=q_df)
  0%|          | 0/1 [00:00<?, ?it/s]
No description has been provided for this image
In [18]:
# Plot SHAP Beeswarm Plot
# The 'plot_shap_beeswarm' method generates a SHAP beeswarm plot, which visualizes the distribution of SHAP values
# for a range of test samples, showing the impact of each feature on the model’s predictions.
# - q: The test data (model.X_test.iloc[:100]) is selected for which SHAP values are calculated and visualized.
# - param_name: The name of the quantity of interest (QoI) for which SHAP values are computed ('t_3' in this case).
# The SHAP beeswarm plot shows how different input features influence the predictions across multiple samples,
# helping to understand the global feature importance and the relationships between features and the target.
fig = await model.plot_shap_beeswarm(param_name=QoI_param)
No description has been provided for this image

Effects¶

In [19]:
effects = await model.subtract_effects(model.X_test.iloc[:100], model.y_test.iloc[:100], subtracted_effects)
fig = await model.plot_effects(effects)
Message: sample size for shap values is set to 100, which is the number of samples.
  0%|          | 0/100 [00:00<?, ?it/s]
No description has been provided for this image
In [20]:
figure, alert, remaining_effects, error_ratio, outliers = await model.plot_subtract_effects_and_alert(q_df, z_m_df, subtracted_effects[1:])
Message: sample size for shap values is set to 200, which is the number of samples.
  0%|          | 0/200 [00:00<?, ?it/s]
Message: sample size for shap values is set to 1, which is the number of samples.
  0%|          | 0/1 [00:00<?, ?it/s]
No description has been provided for this image

Update¶

In [21]:
fig = utils.plot_prior(model)
No description has been provided for this image
In [22]:
DT = DigitalTwin(model, E)
DT.update(z_m_df, nwalkers=nwalkers, nburn=nburn, niter=niter)
DT.get_mean_and_var_of_posterior()
gpc_map = DT.get_MAP()
fig = utils.plot_MCMC(model, DT, nwalkers=nwalkers, map_point=gpc_map)
MCMC creating
Burning period
100%|██████████| 800/800 [01:07<00:00, 11.77it/s]
MCMC running
100%|██████████| 200/200 [00:16<00:00, 11.90it/s]
--- 84.87607264518738 seconds ---
No description has been provided for this image