modnet.models package

Submodules

Module contents

class modnet.models.MODNetModel(targets, weights, num_neurons=([64], [32], [16], [16]), num_classes=None, multi_label=False, n_feat=64, act='relu', out_act='linear')

Bases: object

Container class for the underlying tf.keras Model, that handles setting up the architecture, activations, training and learning curve.

n_feat

The number of features used in the model.

weights

The relative loss weights for each target.

optimal_descriptors

The list of column names used in training the model.

model

The tf.keras.model.Model of the network itself.

target_names

The list of targets names that the model was trained for.

Initialise the model on the passed targets with the desired architecture, feature count and loss functions and activation functions.

Parameters:
  • targets (List) – A nested list of targets names that defines the hierarchy of the output layers.

  • weights (Dict[str, float]) – The relative loss weights to apply for each target.

  • num_classes (Optional[Dict[str, int]]) –

    Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n,

    with n=0 for regression and n>=2 for classification with n the number of classes.

  • multi_label (Optional[bool]) – Whether the problem (if classification) is multi-label. In this case the softmax output-activation is replaced by a sigmoid.

  • num_neurons – A specification of the model layers, as a 4-tuple of lists of integers. Hidden layers are split into four blocks of tf.keras.layers.Dense, with neuron count specified by the elements of the num_neurons argument.

  • n_feat (Optional[int]) – The number of features to use as model inputs.

  • act (str) – A string defining a tf.keras activation function to pass to use in the tf.keras.layers.Dense layers.

  • out_act (str) – A string defining a tf.keras activation function to pass to use for the last output layer (regression only)

can_return_uncertainty = False
build_model(targets, n_feat, num_neurons, num_classes=None, multi_label=False, act='relu', out_act='linear')

Builds the tf.keras model and sets the self.model attribute.

Parameters:
  • targets (List) – A nested list of targets names that defines the hierarchy of the output layers.

  • n_feat (int) – The number of features to use as model inputs.

  • num_neurons (Tuple[List[int], List[int], List[int], List[int]]) – A specification of the model layers, as a 4-tuple of lists of integers. Hidden layers are split into four blocks of tf.keras.layers.Dense, with neuron count specified by the elements of the num_neurons argument.

  • num_classes (Optional[Dict[str, int]]) – Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n, with n=0 for regression and n>=2 for classification with n the number of classes.

  • multi_label (Optional[bool]) – Whether the problem (if classification) is multi-label. In this case the softmax output-activation is replaced by a sigmoid.

  • act (str) – A string defining a tf.keras activation function to pass to use in the tf.keras.layers.Dense layers.

  • out_act (str) – A string defining a tf.keras activation function to pass to use for the last output layer (regression only)

fit(training_data, custom_data=None, val_fraction=0.0, val_key=None, val_data=None, lr=0.001, epochs=200, batch_size=128, xscale='minmax', impute_missing=0, xscale_before_impute=True, metrics=['mae'], callbacks=None, verbose=0, loss=None, **fit_params)

Train the model on the passed training MODData object.

Parameters:
  • training_data (MODData) – A MODData that has been featurized and feature selected. The first self.n_feat entries in training_data.get_optimal_descriptors() will be used for training.

  • custom_data (np.ndarray) – Optional array of shape (n_sampels, n_custom_props) that will be appended to the targets (columns wise). This can be useful for defining custom loss functions.

  • val_fraction (float) – The fraction of the training data to use as a validation set for tracking model performance during training.

  • val_key (Optional[str]) – The target name to track on the validation set during training, if performing multi-target learning.

  • lr (float) – The learning rate.

  • epochs (int) – The maximum number of epochs to train for.

  • batch_size (int) – The batch size to use for training.

  • xscale (Optional[str]) – The feature scaler to use, either None, 'minmax' or 'standard'.

  • impute_missing (Optional[Union[float, str]]) – Determines how the NaN features are treated. If str, defines the strategy used in the scikit-learn SimpleImputer, e.g., “mean” sets the NaNs to the mean of their feature column. If a float is provided, and if xscale_before_impute is False, this float is used to replace NaNs in the original dataset. If a float is provided but xscale_before_impute is True, the float is not used and standard values are used. If you want to do something more sophisticated, make your own modifications to MODData.df_featurized before fitting the model.

  • xscale_before_impute (bool) – whether to first scale the input and then impute values, or first impute values and then scale the inputs.

  • metrics (List[str]) – A list of tf.keras metrics to pass to compile(...).

  • loss (str) – The built-in tf.keras loss to pass to compile(...).

  • fit_params – Any additional parameters to pass to fit(...), these will be overwritten by the explicit keyword arguments above.

  • val_data (Optional[MODData]) –

  • callbacks (List[Callable]) –

  • verbose (int) –

Return type:

None

fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=True, fast=False, nested=5, callbacks=None, n_jobs=None, **fit_params)

Chooses an optimal hyper-parametered MODNet model from different presets.

This function implements the “inner loop” of a cross-validation workflow. By modifying the nested argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.

The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).

Sets the self.model attribute to the model with the lowest mean validation loss across all folds.

Parameters:
  • data (MODData) – MODData object contain training and validation samples.

  • presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.

  • verbose (int) – The verbosity level to pass to tf.keras

  • val_fraction (float) – The fraction of the data to use for validation.

  • classification (bool) – Whether or not we are performing classification.

  • refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.

  • fast (bool) – Used for debugging. If True, only fit the first 2 presets and reduce the number of epochs.

  • nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the val_fraction argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.

  • n_jobs – number of jobs for multiprocessing

  • callbacks (List[Any]) –

Returns:

  • A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.

  • A list of validation losses achieved by the best model for each fold during validation (excluding refit).

  • The learning curve of the final (refitted) model (or None if refit is False)

  • A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).

  • The settings of the best-performing preset.

Return type:

Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]

predict(test_data, return_prob=False, remap_out_of_bounds=True)

Predict the target values for the passed MODData.

Parameters:
  • test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

  • return_prob (bool) – For a classification tasks only: whether to return the probability of each class OR only return the most probable class.

  • remap_out_of_bounds (bool) – Whether to remap out-of-bounds predictions to the training data distribution.

Returns:

A pandas.DataFrame containing the predicted values of the targets.

Return type:

pandas.DataFrame

evaluate(test_data, loss='mae')
Evaluates predictions on the passed MODData by returning the corresponding score:
  • for regression: loss function provided in loss argument. Defaults to mae.

  • for classification: negative ROC AUC.

averaged over the targets when multi-target.

Parameters:
  • test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

  • loss (Union[str, Callable]) –

Returns:

Score defined hereabove.

Return type:

pandas.DataFrame

save(filename)

Save the MODNetModel to filename:

If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be compressed accordingly by pandas.DataFrame.to_pickle().

Parameters:

filename (str) – The base filename to save to.

Return type:

None

static load(filename)

Load MODNetModel object pickled by the MODNetModel.save() method.

If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be decompressed accordingly by pandas.read_pickle().

Returns:

The loaded MODNetModel object.

Parameters:

filename (str) –

Return type:

MODNetModel

get_params(deep=True)

Get parameters for this estimator. Taken from sklearn.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object. Taken from sklearn.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

class modnet.models.EnsembleMODNetModel(*args, n_models=100, bootstrap=True, models=None, modnet_models=None, random_state=None, **kwargs)

Bases: MODNetModel

Container class for n_model (Bootstrap) MODNetModels, that handles setting up the architecture, activations, training and learning curve.

n_feat

The number of features used in the model.

weights

The relative loss weights for each target.

optimal_descriptors

The list of column names used in training the model.

model

The keras.model.Model of the network itself.

target_names

The list of targets names that the model was trained for.

Parameters:
  • *args – See MODNetModel

  • n_models – number of inner MODNetModels, each model has the same architecture defined by the args nd kwargs.

  • bootstrap – whether to bootstrap the samples for each inner MODNet fit.

  • models – List of user provided MODNetModels. Enables to have different architectures. n_models is discarded in this case.

  • random_state (Optional[int]) – fix a random state for use with this model.

  • modnet_model – Deprecated. Same argument as models. For backward compatibility only.

  • **kwargs – See MODNetModel

can_return_uncertainty = True
fit(training_data, n_jobs=1, **kwargs)

Train the model on the passed training MODData object.

Parameters match those of MODNetModel.fit.

Parameters:

training_data (MODData) –

Return type:

None

predict(test_data, return_unc=False, return_prob=False, remap_out_of_bounds=True)

Predict the target values for the passed MODData.

Parameters:
  • test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

  • return_prob (bool) – For a classification task only: whether to return the probability of each class OR only return the most probable class.

  • return_unc (bool) – whether to return a second dataframe containing the uncertainties

  • remap_out_of_bounds (bool) – whether to remap out-of-bounds values to the nearest bound.

Returns:

A pandas.DataFrame containing the predicted values of the targets.

Return type:

pandas.DataFrame

evaluate(test_data)

Evaluates the target values for the passed MODData by returning the corresponding loss.

Parameters:

test_data (MODData) – A featurized and feature-selected MODData object containing the descriptors used in training.

Returns:

Loss score

Return type:

pandas.DataFrame

fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=False, fast=False, nested=5, callbacks=None, n_jobs=1)

Chooses an optimal hyper-parametered MODNet model from different presets.

This function implements the “inner loop” of a cross-validation workflow. By modifying the nested argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.

The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).

Sets the self.models attribute to the model with the lowest mean validation loss across all folds.

Note: Inner models (presets) are 5-model bootstraps. The final (refit) model will be a self.n_model bootstrap.

Parameters:
  • data (MODData) – MODData object contain training and validation samples.

  • presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.

  • verbose (int) – The verbosity level to pass to tf.keras

  • val_fraction (float) – The fraction of the data to use for validation.

  • classification (bool) – Whether or not we are performing classification.

  • refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.

  • fast (bool) – Used for debugging. If True, only fit the first 2 presets, use 1-model ensembles and reduce the number of epochs.

  • nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the val_fraction argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.

  • n_jobs (int) – number of concurrent processes to use when multiprocessing

  • callbacks (List[Any]) –

Returns:

  • A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.

  • A list of validation losses achieved by the best model for each fold during validation (excluding refit).

  • The learning curve of the final (refitted) model (or None if refit is False)

  • A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).

  • The settings of the best-performing preset.

Return type:

Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]