modnet.models.ensemble module¶

This submodule implements the EnsembleMODNetModel, an extension of the vanilla model that bootstraps uncertainties from multiple MODNet models, trained in parallel.

class modnet.models.ensemble.EnsembleMODNetModel(*args, n_models=100, bootstrap=True, models=None, modnet_models=None, random_state=None, **kwargs)¶

Bases: MODNetModel

Container class for n_model (Bootstrap) MODNetModels, that handles setting up the architecture, activations, training and learning curve.

n_feat¶: The number of features used in the model.

weights¶: The relative loss weights for each target.

optimal_descriptors¶: The list of column names used in training the model.

model¶: The keras.model.Model of the network itself.

target_names¶: The list of targets names that the model was trained for.

Parameters:

*args – See MODNetModel
n_models – number of inner MODNetModels, each model has the same architecture defined by the args nd kwargs.
bootstrap – whether to bootstrap the samples for each inner MODNet fit.
models – List of user provided MODNetModels. Enables to have different architectures. n_models is discarded in this case.
random_state (Optional[int]) – fix a random state for use with this model.
modnet_model – Deprecated. Same argument as models. For backward compatibility only.
**kwargs – See MODNetModel

can_return_uncertainty = True¶

fit(training_data, n_jobs=1, **kwargs)¶

Train the model on the passed training MODData object.

Parameters match those of MODNetModel.fit.

Parameters:: training_data (MODData) –
Return type:: None

predict(test_data, return_unc=False, return_prob=False, remap_out_of_bounds=True, voting_type='soft')¶

Predict the target values for the passed MODData.

Parameters:

test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.
return_prob (bool) – For a classification task only: whether to return the probability of each class OR only return the most probable class.
return_unc (bool) – whether to return a second dataframe containing the uncertainties
remap_out_of_bounds (bool) – whether to remap out-of-bounds values to the nearest bound.
voting_type (str) – If classification task and return_prob is False, determines if soft or hard ensemble voting is performed.

Returns:

A pandas.DataFrame containing the predicted values of the targets.

Return type:

pandas.DataFrame

evaluate(test_data)¶

Evaluates the target values for the passed MODData by returning the corresponding loss.

Parameters:: test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.
Returns:: Loss score
Return type:: pandas.DataFrame

fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=False, fast=False, nested=5, callbacks=None, n_jobs=1)¶

Chooses an optimal hyper-parametered MODNet model from different presets.

This function implements the “inner loop” of a cross-validation workflow. By modifying the nested argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.

The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).

Sets the self.models attribute to the model with the lowest mean validation loss across all folds.

Note: Inner models (presets) are 5-model bootstraps. The final (refit) model will be a self.n_model bootstrap.

Parameters:

data (MODData) – MODData object contain training and validation samples.
presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.
verbose (int) – The verbosity level to pass to tf.keras
val_fraction (float) – The fraction of the data to use for validation.
classification (bool) – Whether or not we are performing classification.
refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.
fast (bool) – Used for debugging. If True, only fit the first 2 presets, use 1-model ensembles and reduce the number of epochs.
nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the val_fraction argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.
n_jobs (int) – number of concurrent processes to use when multiprocessing
callbacks (List[Any]) –

Returns:

A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.
A list of validation losses achieved by the best model for each fold during validation (excluding refit).
The learning curve of the final (refitted) model (or None if refit is False)
A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).
The settings of the best-performing preset.

Return type:

Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]