modnet.models.ensemble module¶
This submodule implements the EnsembleMODNetModel
, an
extension of the vanilla model that bootstraps uncertainties
from multiple MODNet models, trained in parallel.
- class modnet.models.ensemble.EnsembleMODNetModel(*args, n_models=100, bootstrap=True, models=None, modnet_models=None, random_state=None, **kwargs)¶
Bases:
MODNetModel
Container class for n_model (Bootstrap) MODNetModels, that handles setting up the architecture, activations, training and learning curve.
- n_feat¶
The number of features used in the model.
- weights¶
The relative loss weights for each target.
- optimal_descriptors¶
The list of column names used in training the model.
- model¶
The
keras.model.Model
of the network itself.
- target_names¶
The list of targets names that the model was trained for.
- Parameters:
*args – See MODNetModel
n_models – number of inner MODNetModels, each model has the same architecture defined by the args nd kwargs.
bootstrap – whether to bootstrap the samples for each inner MODNet fit.
models – List of user provided MODNetModels. Enables to have different architectures. n_models is discarded in this case.
random_state (Optional[int]) – fix a random state for use with this model.
modnet_model – Deprecated. Same argument as models. For backward compatibility only.
**kwargs – See MODNetModel
- can_return_uncertainty = True¶
- fit(training_data, n_jobs=1, **kwargs)¶
Train the model on the passed training
MODData
object.Parameters match those of
MODNetModel.fit
.- Parameters:
training_data (MODData) –
- Return type:
None
- predict(test_data, return_unc=False, return_prob=False, remap_out_of_bounds=True)¶
Predict the target values for the passed MODData.
- Parameters:
test_data (MODData) – A featurized and feature-selected
MODData
object containing the descriptors used in training.return_prob (bool) – For a classification task only: whether to return the probability of each class OR only return the most probable class.
return_unc (bool) – whether to return a second dataframe containing the uncertainties
remap_out_of_bounds (bool) – whether to remap out-of-bounds values to the nearest bound.
- Returns:
A
pandas.DataFrame
containing the predicted values of the targets.- Return type:
- evaluate(test_data)¶
Evaluates the target values for the passed MODData by returning the corresponding loss.
- fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=False, fast=False, nested=5, callbacks=None, n_jobs=1)¶
Chooses an optimal hyper-parametered MODNet model from different presets.
This function implements the “inner loop” of a cross-validation workflow. By modifying the
nested
argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).
Sets the
self.models
attribute to the model with the lowest mean validation loss across all folds.Note: Inner models (presets) are 5-model bootstraps. The final (refit) model will be a self.n_model bootstrap.
- Parameters:
data (MODData) – MODData object contain training and validation samples.
presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.
verbose (int) – The verbosity level to pass to tf.keras
val_fraction (float) – The fraction of the data to use for validation.
classification (bool) – Whether or not we are performing classification.
refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.
fast (bool) – Used for debugging. If
True
, only fit the first 2 presets, use 1-model ensembles and reduce the number of epochs.nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the
val_fraction
argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.n_jobs (int) – number of concurrent processes to use when multiprocessing
- Returns:
A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.
A list of validation losses achieved by the best model for each fold during validation (excluding refit).
The learning curve of the final (refitted) model (or
None
ifrefit
isFalse
)A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).
The settings of the best-performing preset.
- Return type:
Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]