modnet.models.ensemble module

This submodule implements the EnsembleMODNetModel, an extension of the vanilla model that bootstraps uncertainties from multiple MODNet models, trained in parallel.

class modnet.models.ensemble.EnsembleMODNetModel(*args, n_models=100, bootstrap=True, models=None, modnet_models=None, **kwargs)

Bases: MODNetModel

Container class for n_model (Bootstrap) MODNetModels, that handles setting up the architecture, activations, training and learning curve.

n_feat

The number of features used in the model.

weights

The relative loss weights for each target.

optimal_descriptors

The list of column names used in training the model.

model

The keras.model.Model of the network itself.

target_names

The list of targets names that the model was trained for.

Parameters:
  • *args – See MODNetModel

  • n_models – number of inner MODNetModels, each model has the same architecture defined by the args nd kwargs.

  • bootstrap – whether to bootstrap the samples for each inner MODNet fit.

  • models – List of user provided MODNetModels. Enables to have different architectures. n_models is discarded in this case.

  • modnet_model – Deprecated. Same argument as models. For backward compatibility only.

  • **kwargs – See MODNetModel

can_return_uncertainty = True
fit(training_data, n_jobs=1, **kwargs)

Train the model on the passed training MODData object.

Parameters match those of MODNetModel.fit.

Parameters:

training_data (MODData) –

Return type:

None

predict(test_data, return_unc=False, return_prob=False)

Predict the target values for the passed MODData.

Parameters:
  • test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

  • return_prob – For a classification task only: whether to return the probability of each class OR only return the most probable class.

  • return_unc – whether to return a second dataframe containing the uncertainties

Returns:

A pandas.DataFrame containing the predicted values of the targets.

Return type:

pandas.DataFrame

evaluate(test_data)

Evaluates the target values for the passed MODData by returning the corresponding loss.

Parameters:

test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

Returns:

Loss score

Return type:

pandas.DataFrame

fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=False, fast=False, nested=5, callbacks=None, n_jobs=1)

Chooses an optimal hyper-parametered MODNet model from different presets.

This function implements the “inner loop” of a cross-validation workflow. By modifying the nested argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.

The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).

Sets the self.models attribute to the model with the lowest mean validation loss across all folds.

Note: Inner models (presets) are 5-model bootstraps. The final (refit) model will be a self.n_model bootstrap.

Parameters:
  • data (MODData) – MODData object contain training and validation samples.

  • presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.

  • verbose (int) – The verbosity level to pass to tf.keras

  • val_fraction (float) – The fraction of the data to use for validation.

  • classification (bool) – Whether or not we are performing classification.

  • refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.

  • fast (bool) – Used for debugging. If True, only fit the first 2 presets, use 1-model ensembles and reduce the number of epochs.

  • nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the val_fraction argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.

  • n_jobs (int) – number of concurrent processes to use when multiprocessing

  • callbacks (List[Any]) –

Returns:

  • A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.

  • A list of validation losses achieved by the best model for each fold during validation (excluding refit).

  • The learning curve of the final (refitted) model (or None if refit is False)

  • A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).

  • The settings of the best-performing preset.

Return type:

Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]