Build (build.py)

This module provides the ModelBuilder abstract base class, which defines a standard interface for building, fitting, saving, and loading PyMC models, inspired by the scikit-learn API.

ModelBuilder

class ModelBuilder(ABC):
    """Abstract Base Class for building PyMC models with a scikit-learn-like API."""

An abstract base class designed to provide a consistent, scikit-learn-like interface for PyMC models. Subclasses must implement several abstract methods and properties related to model definition, data handling, and configuration.

Attributes

  • _model_type (str): Class attribute identifying the specific model type (e.g., “BaseMMM”). Should be overridden by subclasses.

  • version (str): Class attribute specifying the version of the model class. Should be overridden by subclasses.

  • X (Optional[pd.DataFrame]): Holds the feature data after preprocessing within the instance. Initialised as None.

  • y (Optional[Union[pd.Series, np.ndarray]]): Holds the target data after preprocessing within the instance. Initialised as None.

  • model (Optional[pm.Model]): The PyMC model object. Created by the build_model method.

  • idata (Optional[az.InferenceData]): Stores the results of the model fitting process (posterior samples, observed data, sample stats, etc.). Generated by the fit method.

  • is_fitted_ (bool): A flag indicating whether the fit method has been successfully called. Initialised as False.

  • sampler_config (Dict): Dictionary holding the configuration parameters for the PyMC sampler (e.g., draws, tune, chains).

  • model_config (Dict): Dictionary holding the configuration parameters for the model structure (e.g., priors, hyperparameters).

Initialization

    def __init__(
        self,
        model_config: Optional[Dict] = None,
        sampler_config: Optional[Dict] = None,
    ) -> None:

Initialises the ModelBuilder. Sets up the model_config and sampler_config, using defaults defined by abstract properties if None is provided.

Parameters:

  • model_config (Optional[Dict], optional): Configuration for the model structure. Defaults to self.default_model_config.

  • sampler_config (Optional[Dict], optional): Configuration for the sampler. Defaults to self.default_sampler_config.

Abstract Properties

(These must be implemented by subclasses)

output_var

    @property
    @abstractmethod
    def output_var(self) -> str:

Should return the name of the target variable within the PyMC model (the observed variable in the likelihood function).

default_model_config

    @property
    @abstractmethod
    def default_model_config(self) -> Dict:

Should return a dictionary defining the default structure and parameters (e.g., priors) for the specific model implementation.

default_sampler_config

    @property
    @abstractmethod
    def default_sampler_config(self) -> Dict:

Should return a dictionary defining the default parameters for the PyMC sampler (e.g., draws, tune, chains).

_serializable_model_config

    @property
    @abstractmethod
    def _serializable_model_config(self) -> Dict:

Should return a version of self.model_config where all values are JSON-serializable (e.g., converting tuples or NumPy arrays to lists). This is used when saving the model’s metadata.

Abstract Methods

(These must be implemented by subclasses)

_data_setter

    @abstractmethod
    def _data_setter(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: Optional[Union[np.ndarray, pd.Series]] = None
    ) -> None:

Should update the data containers within the already built PyMC model (self.model) using pm.set_data(). This is essential for making predictions on new data.

_generate_and_preprocess_model_data

    @abstractmethod
    def _generate_and_preprocess_model_data(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: Union[np.ndarray, pd.Series]
    ) -> None:

Should perform any necessary preprocessing on the input data X and y, store the results (e.g., in self.X, self.y), and define the model’s coordinates using self.model.add_coords() if applicable. This method is called internally by fit before build_model.

build_model

    @abstractmethod
    def build_model(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: Union[np.ndarray, pd.Series],
        **kwargs: Any,
    ) -> None:

Should define the complete PyMC model structure within a pm.Model() context, using the (potentially preprocessed) data X and y, and assign the created model to self.model.

Concrete Methods

_validate_data

    def _validate_data(self, X: Any, y: Optional[Any] = None) -> Union[Tuple[Any, Any], Any]:

Internal helper to validate input data using scikit-learn’s check_X_y or check_array if available, otherwise returns the data unchanged.

set_idata_attrs

    def set_idata_attrs(self, idata: Optional[az.InferenceData] = None) -> az.InferenceData:

Adds standard metadata (model type, version, configurations, ID) to the attributes of an ArviZ InferenceData object.

save

    def save(self, fname: str) -> None:

Saves the InferenceData object (self.idata) to a NetCDF file using idata.to_netcdf(). Requires the model to be fitted first.

load (classmethod)

    @classmethod
    def load(cls, fname: str) -> "ModelBuilder":

Loads a previously saved model instance from a NetCDF file. It reads the InferenceData, extracts configurations, reinstantiates the model class (cls), rebuilds the PyMC model structure, and sets the idata.

_model_config_formatting (classmethod)

    @classmethod
    def _model_config_formatting(cls, model_config: Dict) -> Dict:

Internal helper used by load to convert list representations in the loaded JSON configuration back into tuples (for ‘dims’) or NumPy arrays (for other lists).

fit

    def fit(
        self,
        X: Union[np.ndarray, pd.DataFrame],
        y: Optional[Union[np.ndarray, pd.Series]] = None,
        progressbar: bool = True,
        random_seed: Optional[int] = None,
        **kwargs: Any,
    ) -> az.InferenceData:

Fits the model to the provided data X and y.

  1. Calls _generate_and_preprocess_model_data to prepare data and coordinates.

  2. Calls build_model to define the PyMC model structure.

  3. Calls pm.sample() using sampler_config and any additional kwargs.

  4. Stores the results in self.idata.

  5. Adds the original fitting data to idata under the fit_data group.

  6. Sets standard attributes on idata using set_idata_attrs.

  7. Sets self.is_fitted_ to True.

Returns the InferenceData object.

predict

    def predict(
        self,
        X_pred: Union[np.ndarray, pd.DataFrame],
        extend_idata: bool = True,
        **kwargs: Any,
    ) -> np.ndarray:

Generates point predictions for new data X_pred by calculating the mean of the posterior predictive distribution. Calls sample_posterior_predictive internally.

sample_prior_predictive

    def sample_prior_predictive(
        self,
        X_pred: Union[np.ndarray, pd.DataFrame],
        y_pred: Optional[Union[np.ndarray, pd.Series]] = None,
        samples: Optional[int] = None,
        extend_idata: bool = False,
        combined: bool = True,
        **kwargs: Any,
    ) -> xr.DataArray:

Draws samples from the model’s prior predictive distribution given input data X_pred. Builds the model if not already built. Uses _data_setter to provide the new data to the model context.

sample_posterior_predictive

    def sample_posterior_predictive(
        self,
        X_pred: Union[np.ndarray, pd.DataFrame],
        extend_idata: bool = True,
        combined: bool = True,
        **kwargs: Any
    ) -> xr.DataArray:

Draws samples from the model’s posterior predictive distribution given new input data X_pred. Requires the model to be fitted first. Uses _data_setter to provide the new data to the model context. Returns the entire posterior predictive group from the InferenceData.

get_params

    def get_params(self, deep: bool = True) -> Dict:

Returns a dictionary containing the model’s initialization parameters (model_config, sampler_config). Compatible with scikit-learn’s API.

set_params

    def set_params(self, **params: Any) -> "ModelBuilder":

Sets the model’s initialization parameters (model_config, sampler_config). Compatible with scikit-learn’s API. Returns the instance.

predict_proba

    def predict_proba(
        self,
        X_pred: Union[np.ndarray, pd.DataFrame],
        extend_idata: bool = True,
        combined: bool = False,
        **kwargs: Any,
    ) -> xr.DataArray:

Alias for predict_posterior. Returns the full posterior predictive distribution samples for the output variable. combined defaults to False here.

predict_posterior

    def predict_posterior(
        self,
        X_pred: Union[np.ndarray, pd.DataFrame],
        extend_idata: bool = True,
        combined: bool = True,
        **kwargs: Any,
    ) -> xr.DataArray:

Generates the full posterior predictive distribution samples for the output_var given new input data X_pred. Calls sample_posterior_predictive internally and extracts the samples for the specific output_var.

Properties (Concrete)

id

    @property
    def id(self) -> str:

Generates a unique 16-character ID (SHA256 hash) based on the sorted model_config, version, and _model_type. Useful for identifying specific model configurations.