# Data To Fit (`data_to_fit.py`)

This module defines the `DataToFit` class, which represents the final data structure prepared and scaled for input into the ABACUS MMM fitting process. It also includes utility functions related to data serialization.

## Functions

### `switch_to_msgpack_numpy`

```python
def switch_to_msgpack_numpy() -> None:
```

Configures the `msgpack` library to use `msgpack_numpy` for serialization, enabling efficient serialization of NumPy arrays. This is typically called before saving or loading `DataToFit` objects using `msgpack`.

## `DataToFit`

This class encapsulates the input data after it has been processed, split into training and testing sets, and scaled. This scaled data is directly used by the model fitting procedures.

```python
class DataToFit:
    """
    Represents input data transformed to be suitable for fitting a model.
    The data undergoes the following transformations:
    - Split into train and test data sets
    - Scaled to smaller values for better accuracy from the Bayesian model
    """
```

### Attributes

-   `date_strs` (`Union[List[str], np.ndarray, jnp.ndarray]`): List or array of date strings corresponding to the observations.
-   `time_granularity` (`str`): Time granularity (e.g., `cnst.GRANULARITY_DAILY`, `cnst.GRANULARITY_WEEKLY`).
-   `has_test_dataset` (`bool`): Flag indicating whether a train/test split was performed (i.e., if test data exists).
-   `media_data_train_scaled` (`jnp.ndarray`): Scaled media data for the training set `[time, channel]`.
-   `media_data_test_scaled` (`jnp.ndarray`): Scaled media data for the test set `[time, channel]`. Empty if `has_test_dataset` is `False`.
-   `media_scaler` (`SerializableScaler`): The scaler object (from `abacus.prepro.scaler`) used to scale/unscale `media_data`.
-   `media_costs_scaled` (`jnp.ndarray`): Scaled total media costs `[channel]`. Scaled using `media_costs_scaler`.
-   `media_cost_priors_scaled` (`jnp.ndarray`): Scaled media cost priors `[channel]`. Scaled using `media_costs_scaler`.
-   `learned_media_priors` (`jnp.ndarray`): Learned media priors `[channel]` (unscaled).
-   `media_costs_by_row_train_scaled` (`jnp.ndarray`): Scaled media costs per observation for the training set `[time, channel]`. Scaled using `media_costs_scaler`.
-   `media_costs_by_row_test_scaled` (`jnp.ndarray`): Scaled media costs per observation for the test set `[time, channel]`. Empty if `has_test_dataset` is `False`. Scaled using `media_costs_scaler`.
-   `media_costs_scaler` (`SerializableScaler`): The scaler object used to scale/unscale `media_costs`, `media_cost_priors`, and `media_costs_by_row`. Fitted on the original `media_cost_priors`.
-   `media_names` (`List[str]`): List of media channel names.
-   `extra_features_train_scaled` (`jnp.ndarray`): Scaled extra features data for the training set `[time, feature]`.
-   `extra_features_test_scaled` (`jnp.ndarray`): Scaled extra features data for the test set `[time, feature]`. Empty if `has_test_dataset` is `False`.
-   `extra_features_scaler` (`SerializableScaler`): The scaler object used to scale/unscale `extra_features_data`.
-   `extra_features_names` (`List[str]`): List of extra feature names.
-   `target_train_scaled` (`jnp.ndarray`): Scaled target variable data for the training set `[time]`.
-   `target_test_scaled` (`jnp.ndarray`): Scaled target variable data for the test set `[time]`. Empty if `has_test_dataset` is `False`.
-   `target_is_log_scale` (`bool`): Flag indicating if the original target data was log-scaled *before* scaling by `target_scaler`.
-   `target_scaler` (`SerializableScaler`): The scaler object used to scale/unscale `target_data`.
-   `target_name` (`str`): Name of the target variable.

### Static Methods

#### `from_input_data`

```python
    @staticmethod
    def from_input_data(
        input_data: InputData,
        config: Dict[str, Any]
    ) -> "DataToFit":
```

Factory method to create a `DataToFit` instance from a raw `InputData` object and a configuration dictionary. It performs the train/test split based on `config["train_test_ratio"]` and fits scalers (`SerializableScaler`) to the full dataset before transforming the train and test splits.

**Parameters:**

-   `input_data` (`InputData`): The raw input data object.
-   `config` (`Dict[str, Any]`): Configuration dictionary, primarily used to get the `train_test_ratio`.

**Returns:**

-   `DataToFit`: A new instance containing the split and scaled data.

#### `from_dict`

```python
    @staticmethod
    def from_dict(input_dict: Dict[str, Any]) -> "DataToFit":
```

Factory method to recreate a `DataToFit` object from a dictionary representation (likely produced by `to_dict`). It reconstructs the scaler objects from their dictionary form.

**Parameters:**

-   `input_dict` (`Dict[str, Any]`): The dictionary containing the data, display info, scalers, and config.

**Returns:**

-   `DataToFit`: The reconstructed `DataToFit` object.

#### `from_file`

```python
    @staticmethod
    def from_file(input_file: str) -> "DataToFit":
```

Factory method to load a `DataToFit` object from a file, typically a `.gz` file saved using `msgpack` and `gzip` via the `dump` method.

**Parameters:**

-   `input_file` (`str`): Path to the input file (e.g., `data_to_fit.gz`).

**Returns:**

-   `DataToFit`: The loaded `DataToFit` object.

### Instance Methods

#### `__init__`

```python
    def __init__(
        self,
        date_strs: Union[List[str], np.ndarray, jnp.ndarray],
        time_granularity: str,
        has_test_dataset: bool,
        media_data_train_scaled: jnp.ndarray,
        media_data_test_scaled: jnp.ndarray,
        media_scaler: SerializableScaler,
        media_costs_scaled: jnp.ndarray,
        media_cost_priors_scaled: jnp.ndarray,
        learned_media_priors: jnp.ndarray,
        media_costs_by_row_train_scaled: jnp.ndarray,
        media_costs_by_row_test_scaled: jnp.ndarray,
        media_costs_scaler: SerializableScaler,
        media_names: List[str],
        extra_features_train_scaled: jnp.ndarray,
        extra_features_test_scaled: jnp.ndarray,
        extra_features_scaler: SerializableScaler,
        extra_features_names: List[str],
        target_train_scaled: jnp.ndarray,
        target_test_scaled: jnp.ndarray,
        target_is_log_scale: bool,
        target_scaler: SerializableScaler,
        target_name: str,
    ) -> None:
```

Initialises a `DataToFit` object directly with pre-split, pre-scaled data and fitted scaler objects. Typically used internally by the factory methods (`from_input_data`, `from_dict`, `from_file`).

**(Parameters match the Attributes described above)**

#### `to_dict`

```python
    def to_dict(self) -> Dict[str, Any]:
```

Converts the `DataToFit` object into a dictionary representation suitable for serialization (e.g., with `msgpack`). JAX arrays are converted to standard NumPy arrays, and scaler objects are converted using their `to_dict` methods.

**Returns:**

-   `Dict[str, Any]`: A dictionary containing nested dictionaries for `data`, `display`, `scalers`, and `config`.

#### `dump`

```python
    def dump(self, results_dir: Union[str, Path]) -> None:
```

Serializes the `DataToFit` object using `msgpack` (with `msgpack_numpy` enabled) and saves it to a compressed gzip file named `data_to_fit.gz` within the specified directory.

**Parameters:**

-   `results_dir` (`Union[str, Path]`): The directory where `data_to_fit.gz` will be saved.

#### `to_data_frame`

```python
    def to_data_frame(self, unscaled: bool = False) -> Tuple[pd.DataFrame, pd.DataFrame]:
```

Converts the scaled data back into pandas DataFrames for easier viewing and analysis. Can optionally inverse-transform the data back to its original scale.

**Parameters:**

-   `unscaled` (`bool`, optional): If `True`, applies the `inverse_transform` method of the stored scalers to return data in the original scale. Defaults to `False` (returns scaled data).

**Returns:**

-   `Tuple[pd.DataFrame, pd.DataFrame]`: A tuple containing:
    -   `per_observation_df`: DataFrame indexed by datetime, containing time-series data (media impressions, media costs per row, extra features, target).
    -   `per_channel_df`: DataFrame indexed by media channel name, containing channel-level data (total cost, cost prior, learned prior).