Build (build.py)¶
This module provides the ModelBuilder abstract base class, which defines a standard interface for building, fitting, saving, and loading PyMC models, inspired by the scikit-learn API.
ModelBuilder¶
class ModelBuilder(ABC):
"""Abstract Base Class for building PyMC models with a scikit-learn-like API."""
An abstract base class designed to provide a consistent, scikit-learn-like interface for PyMC models. Subclasses must implement several abstract methods and properties related to model definition, data handling, and configuration.
Attributes¶
_model_type(str): Class attribute identifying the specific model type (e.g., “BaseMMM”). Should be overridden by subclasses.version(str): Class attribute specifying the version of the model class. Should be overridden by subclasses.X(Optional[pd.DataFrame]): Holds the feature data after preprocessing within the instance. Initialised asNone.y(Optional[Union[pd.Series, np.ndarray]]): Holds the target data after preprocessing within the instance. Initialised asNone.model(Optional[pm.Model]): The PyMC model object. Created by thebuild_modelmethod.idata(Optional[az.InferenceData]): Stores the results of the model fitting process (posterior samples, observed data, sample stats, etc.). Generated by thefitmethod.is_fitted_(bool): A flag indicating whether thefitmethod has been successfully called. Initialised asFalse.sampler_config(Dict): Dictionary holding the configuration parameters for the PyMC sampler (e.g.,draws,tune,chains).model_config(Dict): Dictionary holding the configuration parameters for the model structure (e.g., priors, hyperparameters).
Initialization¶
def __init__(
self,
model_config: Optional[Dict] = None,
sampler_config: Optional[Dict] = None,
) -> None:
Initialises the ModelBuilder. Sets up the model_config and sampler_config, using defaults defined by abstract properties if None is provided.
Parameters:
model_config(Optional[Dict], optional): Configuration for the model structure. Defaults toself.default_model_config.sampler_config(Optional[Dict], optional): Configuration for the sampler. Defaults toself.default_sampler_config.
Abstract Properties¶
(These must be implemented by subclasses)
output_var¶
@property
@abstractmethod
def output_var(self) -> str:
Should return the name of the target variable within the PyMC model (the observed variable in the likelihood function).
default_model_config¶
@property
@abstractmethod
def default_model_config(self) -> Dict:
Should return a dictionary defining the default structure and parameters (e.g., priors) for the specific model implementation.
default_sampler_config¶
@property
@abstractmethod
def default_sampler_config(self) -> Dict:
Should return a dictionary defining the default parameters for the PyMC sampler (e.g., draws, tune, chains).
_serializable_model_config¶
@property
@abstractmethod
def _serializable_model_config(self) -> Dict:
Should return a version of self.model_config where all values are JSON-serializable (e.g., converting tuples or NumPy arrays to lists). This is used when saving the model’s metadata.
Abstract Methods¶
(These must be implemented by subclasses)
_data_setter¶
@abstractmethod
def _data_setter(
self,
X: Union[np.ndarray, pd.DataFrame],
y: Optional[Union[np.ndarray, pd.Series]] = None
) -> None:
Should update the data containers within the already built PyMC model (self.model) using pm.set_data(). This is essential for making predictions on new data.
_generate_and_preprocess_model_data¶
@abstractmethod
def _generate_and_preprocess_model_data(
self,
X: Union[np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series]
) -> None:
Should perform any necessary preprocessing on the input data X and y, store the results (e.g., in self.X, self.y), and define the model’s coordinates using self.model.add_coords() if applicable. This method is called internally by fit before build_model.
build_model¶
@abstractmethod
def build_model(
self,
X: Union[np.ndarray, pd.DataFrame],
y: Union[np.ndarray, pd.Series],
**kwargs: Any,
) -> None:
Should define the complete PyMC model structure within a pm.Model() context, using the (potentially preprocessed) data X and y, and assign the created model to self.model.
Concrete Methods¶
_validate_data¶
def _validate_data(self, X: Any, y: Optional[Any] = None) -> Union[Tuple[Any, Any], Any]:
Internal helper to validate input data using scikit-learn’s check_X_y or check_array if available, otherwise returns the data unchanged.
set_idata_attrs¶
def set_idata_attrs(self, idata: Optional[az.InferenceData] = None) -> az.InferenceData:
Adds standard metadata (model type, version, configurations, ID) to the attributes of an ArviZ InferenceData object.
save¶
def save(self, fname: str) -> None:
Saves the InferenceData object (self.idata) to a NetCDF file using idata.to_netcdf(). Requires the model to be fitted first.
load (classmethod)¶
@classmethod
def load(cls, fname: str) -> "ModelBuilder":
Loads a previously saved model instance from a NetCDF file. It reads the InferenceData, extracts configurations, reinstantiates the model class (cls), rebuilds the PyMC model structure, and sets the idata.
_model_config_formatting (classmethod)¶
@classmethod
def _model_config_formatting(cls, model_config: Dict) -> Dict:
Internal helper used by load to convert list representations in the loaded JSON configuration back into tuples (for ‘dims’) or NumPy arrays (for other lists).
fit¶
def fit(
self,
X: Union[np.ndarray, pd.DataFrame],
y: Optional[Union[np.ndarray, pd.Series]] = None,
progressbar: bool = True,
random_seed: Optional[int] = None,
**kwargs: Any,
) -> az.InferenceData:
Fits the model to the provided data X and y.
Calls
_generate_and_preprocess_model_datato prepare data and coordinates.Calls
build_modelto define the PyMC model structure.Calls
pm.sample()usingsampler_configand any additionalkwargs.Stores the results in
self.idata.Adds the original fitting data to
idataunder thefit_datagroup.Sets standard attributes on
idatausingset_idata_attrs.Sets
self.is_fitted_toTrue.
Returns the InferenceData object.
predict¶
def predict(
self,
X_pred: Union[np.ndarray, pd.DataFrame],
extend_idata: bool = True,
**kwargs: Any,
) -> np.ndarray:
Generates point predictions for new data X_pred by calculating the mean of the posterior predictive distribution. Calls sample_posterior_predictive internally.
sample_prior_predictive¶
def sample_prior_predictive(
self,
X_pred: Union[np.ndarray, pd.DataFrame],
y_pred: Optional[Union[np.ndarray, pd.Series]] = None,
samples: Optional[int] = None,
extend_idata: bool = False,
combined: bool = True,
**kwargs: Any,
) -> xr.DataArray:
Draws samples from the model’s prior predictive distribution given input data X_pred. Builds the model if not already built. Uses _data_setter to provide the new data to the model context.
sample_posterior_predictive¶
def sample_posterior_predictive(
self,
X_pred: Union[np.ndarray, pd.DataFrame],
extend_idata: bool = True,
combined: bool = True,
**kwargs: Any
) -> xr.DataArray:
Draws samples from the model’s posterior predictive distribution given new input data X_pred. Requires the model to be fitted first. Uses _data_setter to provide the new data to the model context. Returns the entire posterior predictive group from the InferenceData.
get_params¶
def get_params(self, deep: bool = True) -> Dict:
Returns a dictionary containing the model’s initialization parameters (model_config, sampler_config). Compatible with scikit-learn’s API.
set_params¶
def set_params(self, **params: Any) -> "ModelBuilder":
Sets the model’s initialization parameters (model_config, sampler_config). Compatible with scikit-learn’s API. Returns the instance.
predict_proba¶
def predict_proba(
self,
X_pred: Union[np.ndarray, pd.DataFrame],
extend_idata: bool = True,
combined: bool = False,
**kwargs: Any,
) -> xr.DataArray:
Alias for predict_posterior. Returns the full posterior predictive distribution samples for the output variable. combined defaults to False here.
predict_posterior¶
def predict_posterior(
self,
X_pred: Union[np.ndarray, pd.DataFrame],
extend_idata: bool = True,
combined: bool = True,
**kwargs: Any,
) -> xr.DataArray:
Generates the full posterior predictive distribution samples for the output_var given new input data X_pred. Calls sample_posterior_predictive internally and extracts the samples for the specific output_var.
Properties (Concrete)¶
id¶
@property
def id(self) -> str:
Generates a unique 16-character ID (SHA256 hash) based on the sorted model_config, version, and _model_type. Useful for identifying specific model configurations.