modnet.preprocessing module¶
This module defines the MODData
class, featurizer functions
and functions to compute normalized mutual information (NMI) and relevance redundancy
(RR) between descriptors.
- class modnet.preprocessing.CompositionContainer(composition)¶
Bases:
object
A simple compatbility wrapper class for structure-less pymatgen `Structure`s.
- modnet.preprocessing.compute_mi(x=None, y=None, x_name=None, y_name=None, random_state=None, n_neighbors=3)¶
- modnet.preprocessing.map_mi(kwargs)¶
- modnet.preprocessing.nmi_target(df_feat, df_target, task_type='regression', drop_constant_features=True, drop_duplicate_features=True, **kwargs)¶
Computes the Normalized Mutual Information (NMI) between a list of input features and a target variable.
- Parameters:
df_feat (pandas.DataFrame) – Dataframe containing the input features for which the NMI with the target variable is to be computed.
df_target (pandas.DataFrame) – Dataframe containing the target variable. This DataFrame should contain only one column and have the same size as
df_feat
.task_type (integer) – 0 for regression, 1 for classification
drop_constant_features (bool) – If True, the features that are constant across the entire data set will be dropped.
drop_duplicate_features (bool) – If True, the features that have exactly the same values across the entire data set will be dropped.
**kwargs – Keyword arguments to be passed down to the
mutual_info_regression()
function from scikit-learn. This can be useful e.g. for testing purposes.
- Returns:
- Dataframe containing the NMI between each of
the input features and the target variable.
- Return type:
- modnet.preprocessing.get_cross_nmi(df_feat, drop_thr=0.2, return_entropy=False, n_jobs=None, **kwargs)¶
Computes the Normalized Mutual Information (NMI) between input features.
- Parameters:
df_feat (pandas.DataFrame) – Dataframe containing the input features for which the NMI is to be computed.
drop_thr (float) – Features having an information entropy (or self mutual information) threshold below this value will be dropped.
return_entropy – If set to True, the information entropy of each feature is also returned
**kwargs – Keyword arguments to be passed down to the
mutual_info_regression()
function from scikit-learn. This can be useful e.g. for testing purposes.n_jobs (int) –
- Returns:
pandas.DataFrame containing the Normalized Mutual Information between features. if return_entropy=True : (mutual_info, diag): With diag a dictionary with all features as keys and information entropy as values.
- Return type:
mutual_info
- modnet.preprocessing.get_rr_p_parameter_default(nn)¶
Returns p for the default expression outlined in arXiv:2004:14766.
- modnet.preprocessing.get_rr_c_parameter_default(nn)¶
Returns c for the default expression outlined in arXiv:2004:14766.
- modnet.preprocessing.get_features_relevance_redundancy(target_nmi, cross_nmi, n_feat=None, rr_parameters=None, return_pc=False)¶
Select features from the Relevance Redundancy (RR) score between the input features and the target output.
The RR is defined following Equation 2 of De Breuck et al, arXiv:2004:14766, with default values,
..math:: p = max{0.1, 4.5 - n^{0.4}},
and
..math:: c = 10^{-6} n^3,
where \(n\) is the number of features in the “chosen” subset for that iteration. These values can be overriden with the
rr_parameters
dictionary argument.- Parameters:
target_nmi (pandas.DataFrame) – dataframe containing the Normalized Mutual Information (NMI) between a list of input features and a target variable, as computed from
nmi_target()
.cross_nmi (pandas.DataFrame) – dataframe containing the NMI between the input features, as computed from
get_cross_nmi()
.n_feat (int) – Number of features for which the RR score needs to be computed (default: all features).
rr_parameters (dict) – Allows tuning of p and c parameters. Currently allows fixing of p and c to constant values instead of using the dynamical evaluation. Expects to find keys
"p"
and"c"
, containing either a callable that takesn
as an argument and returns the desiredp
orc
, or another dictionary containing the key"value"
that stores a constant value ofp
orc
.return_pc (bool) – Whether to return p and c values in the output dictionaries.
- Returns:
List of dictionaries containing the results of the relevance-redundancy selection algorithm.
- Return type:
- modnet.preprocessing.get_features_dyn(n_feat, cross_nmi, target_nmi)¶
- modnet.preprocessing.merge_ranked(lists)¶
For multiple lists of ranked feature names/IDs (e.g. for different targets), work through the lists and merge them such that each feature is included once according to its highest rank across each list.
- Parameters:
lists (List[List[Hashable]]) – the list of lists to merge.
- Returns:
list of merged and ranked feature names/IDs.
- Return type:
List[Hashable]
- class modnet.preprocessing.MODData(materials=None, targets=None, target_names=None, structure_ids=None, num_classes=None, df_featurized=None, featurizer=None, structures=None)¶
Bases:
object
The MODData class takes takes a list of pymatgen
Structure
objects and creates apandas.DataFrame
that contains many matminer features per structure. It then uses mutual information between features and targets, and between the features themselves, to perform feature selection using relevance-redundancy indices.- df_structure¶
dataframe storing the pymatgen
Structure
representations for each structured, indexed by ID.- Type:
pd.DataFrame
- df_targets¶
dataframe storing the prediction targets per structure, indexed by ID.
- Type:
pd.Dataframe
- df_featurized¶
dataframe with columns storing all computed features per structure, indexed by ID.
- Type:
pd.DataFrame
- optimal_features¶
if feature selection has been performed this attribute stores a list of the selected features.
- Type:
List[str]
- optimal_features_by_target¶
If feature selection has been performed this attribute stores a list of the selected features, broken down by target property.
- featurizer¶
the class used to featurize the data.
- Type:
- cross_nmi¶
If feature selection has been performed, this attribute stores the normalized mutual information between all features.
- Type:
pd.DataFrame
- feature_entropy¶
Information entropy of all features. Only computed after a call to compute cross_nmi.
- Type:
Dictionary
- num_classes¶
Defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n, with n=0 for regression and n>=2 for classification with n the number of classes.
- Type:
Dictionary
Initialise the MODData object either from a list of structures or from an already featurized dataframe. Prediction targets per structure can be specified as lists or an array alongside their target names. A list of unique IDs can be provided to label the structures.
- Parameters:
materials (Optional[List[Union[Structure, Composition]]]) – list of structures or compositions to featurize and predict.
targets (Optional[Union[List[float], np.ndarray]]) – optional List of targets corresponding to each structure. When learning on multiple targets this is a ndarray where each column corresponds to a target, i.e. of shape (n_materials,n_targets).
target_names (Optional[Iterable]) – optional Iterable (e.g. list) of names of target properties to use in the dataframe.
structure_ids (Optional[Iterable]) – optional Iterable of unique IDs to use instead of generated integers.
num_classes (Optional[Dict[str, int]]) –
Dictionary defining the target types (classification or regression). Should be constructed as follows: key: string giving the target name; value: integer n,
with n=0 for regression and n>=2 for classification with n the number of classes.
df_featurized (Optional[pd.DataFrame]) – optional featurized dataframe to use instead of featurizing a new one. Should be passed without structures.
featurizer (Optional[Union[MODFeaturizer, str]]) – optional MODFeaturizer object to use for featurization, or string preset to look up in presets dictionary.
structures (Optional[List[Union[Structure, Composition]]]) – deprecated (alias to materials for backward compatibility) do not use this.
- featurize(fast=False, db_file=None, n_jobs=None, drop_allnan=True)¶
For the input structures, construct many matminer features and save a featurized dataframe. If
db_file
is specified, this method will try to load previous feature calculations for each structure ID instead of recomputing.Sets the
self.df_featurized
attribute.- Parameters:
fast (bool) – whether or not to load from the Materials Project Database.
keyword. (Please be sure to have provided the mp-ids in the MODData structure_ids) –
Note – The database will be downloaded in this case, and takes around 2GB of space on your drive !
db_file – Deprecated. Do Not use this anymore.
drop_allnan (bool) – if True, features that are fully NaNs will be removed.
- feature_selection(n=1500, cross_nmi=None, use_precomputed_cross_nmi=False, n_samples=6000, drop_thr=0.2, n_jobs=None, ignore_names=[], random_state=None)¶
Compute the mutual information between features and targets, then apply relevance-redundancy rankings to choose the top
n
features.Sets the
self.optimal_features
attribute to a list of feature names.- Parameters:
n (int) – number of desired features.
cross_nmi (Optional[pandas.DataFrame]) – specify the cross NMI between features as a dataframe.
use_precomputed_cross_nmi (bool) – Whether or not to use the cross NMI that was computed on Materials Project features, instead of precomputing.
n_jobs (int) – max. number of processes to use when calculating cross NMI.
ignore_names (List) – Optional list of property names to ignore during feature selection. Feature selection will be performed w.r.t. all properties except the ones in ignore_names.
random_state (int) – Seed used to compute the NMI.
drop_thr (float) –
- shuffle()¶
- rebalance()¶
Rebalancing classification data by oversampling.
- property structures: List[Union[pymatgen.core.Structure, CompositionContainer]]¶
Returns the list of pymatgen
Structure
objects.
- property compositions: List[Union[pymatgen.core.Structure, CompositionContainer]]¶
Returns the list of materials as pymatgen
Composition
objects.
- property targets: numpy.ndarray¶
Returns a ndarray of prediction targets.
- save(filename)¶
Pickle the contents of the
MODData
object so that it can be loaded in withMODData.load()
.If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be compressed accordingly by
pandas.to_pickle(...)
.- Parameters:
filename (str) –
- static load(filename)¶
Load
MODData
object pickled by the.save(...)
method.If the filename ends in “tgz”, “bz2” or “zip”, the pickle will be decompressed accordingly by
pandas.read_pickle(...)
.
- classmethod load_precomputed(dataset_name)¶
Load a
MODData
object from a pre-computed dataset.Note
Datasets may require significant (~10 GB) amounts of memory to load.
- get_structure_df()¶
- get_target_df()¶
- get_featurized_df()¶
- get_optimal_descriptors()¶
- get_optimal_df()¶
- split(train_test_split)¶
Create two new MODData’s that contain only the data corresponding to the indices passed in the
train_test_split
tuple.