modnet.models package¶
Submodules¶
- modnet.models.bayesian module
- modnet.models.ensemble module
EnsembleMODNetModel
EnsembleMODNetModel.n_feat
EnsembleMODNetModel.weights
EnsembleMODNetModel.optimal_descriptors
EnsembleMODNetModel.model
EnsembleMODNetModel.target_names
EnsembleMODNetModel.can_return_uncertainty
EnsembleMODNetModel.fit()
EnsembleMODNetModel.predict()
EnsembleMODNetModel.evaluate()
EnsembleMODNetModel.fit_preset()
- modnet.models.vanilla module
MODNetModel
MODNetModel.n_feat
MODNetModel.weights
MODNetModel.optimal_descriptors
MODNetModel.model
MODNetModel.target_names
MODNetModel.can_return_uncertainty
MODNetModel.build_model()
MODNetModel.fit()
MODNetModel.fit_preset()
MODNetModel.predict()
MODNetModel.evaluate()
MODNetModel.save()
MODNetModel.load()
MODNetModel.get_params()
MODNetModel.set_params()
Module contents¶
- class modnet.models.MODNetModel(targets, weights, num_neurons=([64], [32], [16], [16]), num_classes=None, multi_label=False, n_feat=64, act='relu', out_act='linear')¶
Bases:
object
Container class for the underlying tf.keras
Model
, that handles setting up the architecture, activations, training and learning curve.- n_feat¶
The number of features used in the model.
- weights¶
The relative loss weights for each target.
- optimal_descriptors¶
The list of column names used in training the model.
- model¶
The
tf.keras.model.Model
of the network itself.
- target_names¶
The list of targets names that the model was trained for.
Initialise the model on the passed targets with the desired architecture, feature count and loss functions and activation functions.
- Parameters:
targets (List) – A nested list of targets names that defines the hierarchy of the output layers.
weights (Dict[str, float]) – The relative loss weights to apply for each target.
num_classes (Optional[Dict[str, int]]) –
Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n,
with n=0 for regression and n>=2 for classification with n the number of classes.
multi_label (Optional[bool]) – Whether the problem (if classification) is multi-label. In this case the softmax output-activation is replaced by a sigmoid.
num_neurons – A specification of the model layers, as a 4-tuple of lists of integers. Hidden layers are split into four blocks of
tf.keras.layers.Dense
, with neuron count specified by the elements of thenum_neurons
argument.n_feat (Optional[int]) – The number of features to use as model inputs.
act (str) – A string defining a tf.keras activation function to pass to use in the
tf.keras.layers.Dense
layers.out_act (str) – A string defining a tf.keras activation function to pass to use for the last output layer (regression only)
- can_return_uncertainty = False¶
- build_model(targets, n_feat, num_neurons, num_classes=None, multi_label=False, act='relu', out_act='linear')¶
Builds the tf.keras model and sets the
self.model
attribute.- Parameters:
targets (List) – A nested list of targets names that defines the hierarchy of the output layers.
n_feat (int) – The number of features to use as model inputs.
num_neurons (Tuple[List[int], List[int], List[int], List[int]]) – A specification of the model layers, as a 4-tuple of lists of integers. Hidden layers are split into four blocks of
tf.keras.layers.Dense
, with neuron count specified by the elements of thenum_neurons
argument.num_classes (Optional[Dict[str, int]]) – Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n, with n=0 for regression and n>=2 for classification with n the number of classes.
multi_label (Optional[bool]) – Whether the problem (if classification) is multi-label. In this case the softmax output-activation is replaced by a sigmoid.
act (str) – A string defining a tf.keras activation function to pass to use in the
tf.keras.layers.Dense
layers.out_act (str) – A string defining a tf.keras activation function to pass to use for the last output layer (regression only)
- fit(training_data, custom_data=None, val_fraction=0.0, val_key=None, val_data=None, lr=0.001, epochs=200, batch_size=128, xscale='minmax', impute_missing=0, xscale_before_impute=True, metrics=['mae'], callbacks=None, verbose=0, loss=None, **fit_params)¶
Train the model on the passed training
MODData
object.- Parameters:
training_data (MODData) – A
MODData
that has been featurized and feature selected. The firstself.n_feat
entries intraining_data.get_optimal_descriptors()
will be used for training.custom_data (np.ndarray) – Optional array of shape (n_sampels, n_custom_props) that will be appended to the targets (columns wise). This can be useful for defining custom loss functions.
val_fraction (float) – The fraction of the training data to use as a validation set for tracking model performance during training.
val_key (Optional[str]) – The target name to track on the validation set during training, if performing multi-target learning.
lr (float) – The learning rate.
epochs (int) – The maximum number of epochs to train for.
batch_size (int) – The batch size to use for training.
xscale (Optional[str]) – The feature scaler to use, either
None
,'minmax'
or'standard'
.impute_missing (Optional[Union[float, str]]) – Determines how the NaN features are treated. If str, defines the strategy used in the scikit-learn SimpleImputer, e.g., “mean” sets the NaNs to the mean of their feature column. If a float is provided, and if xscale_before_impute is False, this float is used to replace NaNs in the original dataset. If a float is provided but xscale_before_impute is True, the float is not used and standard values are used. If you want to do something more sophisticated, make your own modifications to MODData.df_featurized before fitting the model.
xscale_before_impute (bool) – whether to first scale the input and then impute values, or first impute values and then scale the inputs.
metrics (List[str]) – A list of tf.keras metrics to pass to
compile(...)
.loss (str) – The built-in tf.keras loss to pass to
compile(...)
.fit_params – Any additional parameters to pass to
fit(...)
, these will be overwritten by the explicit keyword arguments above.verbose (int) –
- Return type:
None
- fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=True, fast=False, nested=5, callbacks=None, n_jobs=None, **fit_params)¶
Chooses an optimal hyper-parametered MODNet model from different presets.
This function implements the “inner loop” of a cross-validation workflow. By modifying the
nested
argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).
Sets the
self.model
attribute to the model with the lowest mean validation loss across all folds.- Parameters:
data (MODData) – MODData object contain training and validation samples.
presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.
verbose (int) – The verbosity level to pass to tf.keras
val_fraction (float) – The fraction of the data to use for validation.
classification (bool) – Whether or not we are performing classification.
refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.
fast (bool) – Used for debugging. If
True
, only fit the first 2 presets and reduce the number of epochs.nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the
val_fraction
argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.n_jobs – number of jobs for multiprocessing
- Returns:
A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.
A list of validation losses achieved by the best model for each fold during validation (excluding refit).
The learning curve of the final (refitted) model (or
None
ifrefit
isFalse
)A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).
The settings of the best-performing preset.
- Return type:
Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]
- predict(test_data, return_prob=False, remap_out_of_bounds=True)¶
Predict the target values for the passed MODData.
- Parameters:
test_data (MODData) – A featurized and feature-selected
MODData
object containing the descriptors used in training.return_prob (bool) – For a classification tasks only: whether to return the probability of each class OR only return the most probable class.
remap_out_of_bounds (bool) – Whether to remap out-of-bounds predictions to the training data distribution.
- Returns:
A
pandas.DataFrame
containing the predicted values of the targets.- Return type:
- evaluate(test_data, loss='mae')¶
- Evaluates predictions on the passed MODData by returning the corresponding score:
for regression: loss function provided in loss argument. Defaults to mae.
for classification: negative ROC AUC.
averaged over the targets when multi-target.
- save(filename)¶
Save the
MODNetModel
to filename:If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be compressed accordingly by
pandas.DataFrame.to_pickle()
.- Parameters:
filename (str) – The base filename to save to.
- Return type:
None
- static load(filename)¶
Load
MODNetModel
object pickled by theMODNetModel.save()
method.If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be decompressed accordingly by
pandas.read_pickle()
.- Returns:
The loaded
MODNetModel
object.- Parameters:
filename (str) –
- Return type:
- get_params(deep=True)¶
Get parameters for this estimator. Taken from sklearn.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Taken from sklearn.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- class modnet.models.EnsembleMODNetModel(*args, n_models=100, bootstrap=True, models=None, modnet_models=None, random_state=None, **kwargs)¶
Bases:
MODNetModel
Container class for n_model (Bootstrap) MODNetModels, that handles setting up the architecture, activations, training and learning curve.
- n_feat¶
The number of features used in the model.
- weights¶
The relative loss weights for each target.
- optimal_descriptors¶
The list of column names used in training the model.
- model¶
The
keras.model.Model
of the network itself.
- target_names¶
The list of targets names that the model was trained for.
- Parameters:
*args – See MODNetModel
n_models – number of inner MODNetModels, each model has the same architecture defined by the args nd kwargs.
bootstrap – whether to bootstrap the samples for each inner MODNet fit.
models – List of user provided MODNetModels. Enables to have different architectures. n_models is discarded in this case.
random_state (Optional[int]) – fix a random state for use with this model.
modnet_model – Deprecated. Same argument as models. For backward compatibility only.
**kwargs – See MODNetModel
- can_return_uncertainty = True¶
- fit(training_data, n_jobs=1, **kwargs)¶
Train the model on the passed training
MODData
object.Parameters match those of
MODNetModel.fit
.- Parameters:
training_data (MODData) –
- Return type:
None
- predict(test_data, return_unc=False, return_prob=False, remap_out_of_bounds=True)¶
Predict the target values for the passed MODData.
- Parameters:
test_data (MODData) – A featurized and feature-selected
MODData
object containing the descriptors used in training.return_prob (bool) – For a classification task only: whether to return the probability of each class OR only return the most probable class.
return_unc (bool) – whether to return a second dataframe containing the uncertainties
remap_out_of_bounds (bool) – whether to remap out-of-bounds values to the nearest bound.
- Returns:
A
pandas.DataFrame
containing the predicted values of the targets.- Return type:
- evaluate(test_data)¶
Evaluates the target values for the passed MODData by returning the corresponding loss.
- fit_preset(data, presets=None, val_fraction=0.15, verbose=0, classification=False, refit=False, fast=False, nested=5, callbacks=None, n_jobs=1)¶
Chooses an optimal hyper-parametered MODNet model from different presets.
This function implements the “inner loop” of a cross-validation workflow. By modifying the
nested
argument, it can be run in full nested mode (i.e. train n_fold * n_preset models) or just with a simple random hold-out set.The data is first fitted on several well working MODNet presets with a validation set (10% of the furnished data by default).
Sets the
self.models
attribute to the model with the lowest mean validation loss across all folds.Note: Inner models (presets) are 5-model bootstraps. The final (refit) model will be a self.n_model bootstrap.
- Parameters:
data (MODData) – MODData object contain training and validation samples.
presets (List[Dict[str, Any]]) – A list of dictionaries containing custom presets.
verbose (int) – The verbosity level to pass to tf.keras
val_fraction (float) – The fraction of the data to use for validation.
classification (bool) – Whether or not we are performing classification.
refit (bool) – Whether or not to refit the final model for each fold with the best-performing settings.
fast (bool) – Used for debugging. If
True
, only fit the first 2 presets, use 1-model ensembles and reduce the number of epochs.nested (int) – integer specifying whether or not to perform a full nested CV. If 0, a simple validation split is performed based on val_fraction argument. If an integer, use this number of inner CV folds, ignoring the
val_fraction
argument. Note: If set to 1, the value will be overwritten to a default of 5 folds.n_jobs (int) – number of concurrent processes to use when multiprocessing
- Returns:
A list of length num_outer_folds containing lists of MODNet models of length num_inner_folds.
A list of validation losses achieved by the best model for each fold during validation (excluding refit).
The learning curve of the final (refitted) model (or
None
ifrefit
isFalse
)A nested list of learning curves for each trained model of lengths (num_outer_folds, num_inner folds).
The settings of the best-performing preset.
- Return type:
Tuple[List[List[Any]], numpy.ndarray, Optional[List[float]], List[List[float]], Dict[str, Any]]