modnet.preprocessing module

This module defines the MODData class, featurizer functions and functions to compute normalized mutual information (NMI) and relevance redundancy (RR) between descriptors.

class modnet.preprocessing.CompositionContainer(composition)

Bases: object

A simple compatbility wrapper class for structure-less pymatgen `Structure`s.

modnet.preprocessing.compute_mi(x=None, y=None, x_name=None, y_name=None, random_state=None, n_neighbors=3)
Parameters:
  • x (numpy.ndarray) –

  • y (numpy.ndarray) –

  • x_name (str) –

  • y_name (str) –

modnet.preprocessing.map_mi(kwargs)
modnet.preprocessing.nmi_target(df_feat, df_target, task_type='regression', drop_constant_features=True, drop_duplicate_features=True, **kwargs)

Computes the Normalized Mutual Information (NMI) between a list of input features and a target variable.

Parameters:
  • df_feat (pandas.DataFrame) – Dataframe containing the input features for which the NMI with the target variable is to be computed.

  • df_target (pandas.DataFrame) – Dataframe containing the target variable. This DataFrame should contain only one column and have the same size as df_feat.

  • task_type (integer) – 0 for regression, 1 for classification

  • drop_constant_features (bool) – If True, the features that are constant across the entire data set will be dropped.

  • drop_duplicate_features (bool) – If True, the features that have exactly the same values across the entire data set will be dropped.

  • **kwargs – Keyword arguments to be passed down to the mutual_info_regression() function from scikit-learn. This can be useful e.g. for testing purposes.

Returns:

Dataframe containing the NMI between each of

the input features and the target variable.

Return type:

pandas.DataFrame

modnet.preprocessing.get_cross_nmi(df_feat, drop_thr=0.2, return_entropy=False, n_jobs=None, **kwargs)

Computes the Normalized Mutual Information (NMI) between input features.

Parameters:
  • df_feat (pandas.DataFrame) – Dataframe containing the input features for which the NMI with the target variable is to be computed.

  • drop_thr (float) – Features having an information entropy (or self mutual information) threshold below this value will be dropped.

  • return_entropy – If set to True, the information entropy of each feature is also returned

  • **kwargs – Keyword arguments to be passed down to the mutual_info_regression() function from scikit-learn. This can be useful e.g. for testing purposes.

  • n_jobs (int) –

Returns:

pandas.DataFrame containing the Normalized Mutual Information between features. if return_entropy=True : (mutual_info, diag): With diag a dictionary with all features as keys and information entropy as values.

Return type:

mutual_info

modnet.preprocessing.get_rr_p_parameter_default(nn)

Returns p for the default expression outlined in arXiv:2004:14766.

Parameters:

nn (int) – number of features currently in chosen subset.

Returns:

the value for p.

Return type:

float

modnet.preprocessing.get_rr_c_parameter_default(nn)

Returns c for the default expression outlined in arXiv:2004:14766.

Parameters:

nn (int) – number of features currently in chosen subset.

Returns:

the value for p.

Return type:

float

modnet.preprocessing.get_features_relevance_redundancy(target_nmi, cross_nmi, n_feat=None, rr_parameters=None, return_pc=False)

Select features from the Relevance Redundancy (RR) score between the input features and the target output.

The RR is defined following Equation 2 of De Breuck et al, arXiv:2004:14766, with default values,

..math:: p = max{0.1, 4.5 - n^{0.4}},

and

..math:: c = 10^{-6} n^3,

where \(n\) is the number of features in the “chosen” subset for that iteration. These values can be overriden with the rr_parameters dictionary argument.

Parameters:
  • target_nmi (pandas.DataFrame) – dataframe containing the Normalized Mutual Information (NMI) between a list of input features and a target variable, as computed from nmi_target().

  • cross_nmi (pandas.DataFrame) – dataframe containing the NMI between the input features, as computed from get_cross_nmi().

  • n_feat (int) – Number of features for which the RR score needs to be computed (default: all features).

  • rr_parameters (dict) – Allows tuning of p and c parameters. Currently allows fixing of p and c to constant values instead of using the dynamical evaluation. Expects to find keys "p" and "c", containing either a callable that takes n as an argument and returns the desired p or c, or another dictionary containing the key "value" that stores a constant value of p or c.

  • return_pc (bool) – Whether to return p and c values in the output dictionaries.

Returns:

List of dictionaries containing the results of the relevance-redundancy selection algorithm.

Return type:

list

modnet.preprocessing.get_features_dyn(n_feat, cross_nmi, target_nmi)
modnet.preprocessing.merge_ranked(lists)

For multiple lists of ranked feature names/IDs (e.g. for different targets), work through the lists and merge them such that each feature is included once according to its highest rank across each list.

Parameters:

lists (List[List[Hashable]]) – the list of lists to merge.

Returns:

list of merged and ranked feature names/IDs.

Return type:

List[Hashable]

class modnet.preprocessing.MODData(materials=None, targets=None, target_names=None, structure_ids=None, num_classes=None, df_featurized=None, featurizer=None, structures=None)

Bases: object

The MODData class takes takes a list of pymatgen Structure objects and creates a pandas.DataFrame that contains many matminer features per structure. It then uses mutual information between features and targets, and between the features themselves, to perform feature selection using relevance-redundancy indices.

df_structure

dataframe storing the pymatgen Structure representations for each structured, indexed by ID.

Type:

pd.DataFrame

df_targets

dataframe storing the prediction targets per structure, indexed by ID.

Type:

pd.Dataframe

df_featurized

dataframe with columns storing all computed features per structure, indexed by ID.

Type:

pd.DataFrame

optimal_features

if feature selection has been performed this attribute stores a list of the selected features.

Type:

List[str]

optimal_features_by_target

If feature selection has been performed this attribute stores a list of the selected features, broken down by target property.

Type:

Dict[str, List[str]]

featurizer

the class used to featurize the data.

Type:

MODFeaturizer

__modnet_version__

The MODNet version number used to create the object

Type:

str

cross_nmi

If feature selection has been performed, this attribute stores the normalized mutual information between all features.

Type:

pd.DataFrame

feature_entropy

Information entropy of all features. Only computed after a call to compute cross_nmi.

Type:

Dictionary

num_classes

Defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n, with n=0 for regression and n>=2 for classification with n the number of classes.

Type:

Dictionary

Initialise the MODData object either from a list of structures or from an already featurized dataframe. Prediction targets per structure can be specified as lists or an array alongside their target names. A list of unique IDs can be provided to label the structures.

Parameters:
  • materials (Optional[List[Union[Structure, Composition]]]) – list of structures or compositions to featurize and predict.

  • targets (Optional[Union[List[float], np.ndarray]]) – optional List of targets corresponding to each structure. When learning on multiple targets this is a ndarray where each column corresponds to a target, i.e. of shape (n_materials,n_targets).

  • target_names (Optional[Iterable]) – optional Iterable (e.g. list) of names of target properties to use in the dataframe.

  • structure_ids (Optional[Iterable]) – optional Iterable of unique IDs to use instead of generated integers.

  • num_classes (Optional[Dict[str, int]]) –

    Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n,

    with n=0 for regression and n>=2 for classification with n the number of classes.

  • df_featurized (Optional[pd.DataFrame]) – optional featurized dataframe to use instead of featurizing a new one. Should be passed without structures.

  • featurizer (Optional[Union[MODFeaturizer, str]]) – optional MODFeaturizer object to use for featurization, or string preset to look up in presets dictionary.

  • structures (Optional[List[Union[Structure, Composition]]]) – deprecated (alias to materials for backward compatibility) do not use this.

featurize(fast=False, db_file=None, n_jobs=None, drop_allnan=True)

For the input structures, construct many matminer features and save a featurized dataframe. If db_file is specified, this method will try to load previous feature calculations for each structure ID instead of recomputing.

Sets the self.df_featurized attribute.

Parameters:
  • fast (bool) – whether or not to load from the Materials Project Database.

  • keyword. (Please be sure to have provided the mp-ids in the MODData structure_ids) –

  • Note – The database will be downloaded in this case, and takes around 2GB of space on your drive !

  • db_file – Deprecated. Do Not use this anymore.

  • drop_allnan (bool) – if True, features that are fully NaNs will be removed.

feature_selection(n=1500, cross_nmi=None, use_precomputed_cross_nmi=False, n_samples=6000, drop_thr=0.2, n_jobs=None, ignore_names=[])

Compute the mutual information between features and targets, then apply relevance-redundancy rankings to choose the top n features.

Sets the self.optimal_features attribute to a list of feature names.

Parameters:
  • n (int) – number of desired features.

  • cross_nmi (Optional[pandas.DataFrame]) – specify the cross NMI between features as a dataframe.

  • use_precomputed_cross_nmi (bool) – Whether or not to use the cross NMI that was computed on Materials Project features, instead of precomputing.

  • n_jobs (int) – max. number of processes to use when calculating cross NMI.

  • ignore_names (List) – Optional list of property names to ignore during feature selection. Feature selection will be performed w.r.t. all properties except the ones in ignore_names.

  • drop_thr (float) –

shuffle()
rebalance()

Rebalancing classification data by oversampling.

property structures: List[Union[pymatgen.core.Structure, CompositionContainer]]

Returns the list of pymatgen Structure objects.

property compositions: List[Union[pymatgen.core.Structure, CompositionContainer]]

Returns the list of materials as pymatgen Composition objects.

property targets: numpy.ndarray

Returns a ndarray of prediction targets.

property names: List[str]

Returns the list of prediction target field names.

property target_names: List[str]

Returns the list of prediction target field names.

property structure_ids: List[str]

Returns the list of prediction target field names.

save(filename)

Pickle the contents of the MODData object so that it can be loaded in with MODData.load().

If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be compressed accordingly by pandas.to_pickle(...).

Parameters:

filename (str) –

static load(filename)

Load MODData object pickled by the .save(...) method.

If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be decompressed accordingly by pandas.read_pickle(...).

Parameters:

filename (Union[str, Path]) –

Return type:

MODData

classmethod load_precomputed(dataset_name)

Load a MODData object from a pre-computed dataset.

Note

Datasets may require significant (~10 GB) amounts of memory to load.

Parameters:
  • dataset – the name of the precomputed dataset to load. Currently available: ‘MP_2018.6’.

  • dataset_name (str) –

Returns:

the precomputed dataset.

Return type:

MODData

get_structure_df()
get_target_df()
get_featurized_df()
get_optimal_descriptors()
get_optimal_df()
split(train_test_split)

Create two new MODData’s that contain only the data corresponding to the indices passed in the train_test_split tuple.

Parameters:

train_test_split (Tuple[List[int], List[int]]) – A tuple containing two lists of integers: the indices of the training data and test data respectively.

Returns:

The training MODData and the test MODData as a tuple.

Return type:

Tuple[MODData, MODData]

from_indices(indices)

Create a new MODData that contains only the data at the given rows indices provided.

Parameters:

indices (List[int]) – The list of integers corresponding to the rows.

Returns:

A MODData containing only the rows passed.

Return type:

MODData