wadi.mapper module

class wadi.mapper.MapperDict(dict=None, /, **kwargs)

Bases: UserDict

Instances of this class are meant to be used as dictionaries that translate names and units into their aliases. It derives from the UserDict wrapper to extend the functionality of a regular Python dict with some class methods to create an instance from json files or PubChem queries. The dictionary can also be saved to a json file and it contains a translate method to create a dictionary that translates between languages (currently not working due to issues with Google Translate).

classmethod from_file(file_path)

This method initializes the class by reading a dictionary from a json file.

Parameters:

file_path (str) – The json file to be read.

Returns:

result – An instance of the UserDict class containing the imported data.

Return type:

class instance

classmethod default_dict(keys, values)

This method initializes the class by reading a dictionary from the file default_feature_map.json. The function arguments are the column names that should become the dictionary keys and values, respectively.

Parameters:
  • keys (str) – The name of the column whose values should become the dictionary keys.

  • values (str) – The name of the column whose values should become the dictionary values.

Returns:

result – An instance of the UserDict class containing the imported data.

Return type:

class instance

classmethod _create_hgc_units_dict()

This method initializes the class by reading the csv files in the folder hgc_constants, which contain the feature names and their target units.

Returns:

result – An instance of the UserDict class containing the imported data.

Return type:

class instance

classmethod pubchem_cas_dict(strings, src_lang=None, max_attempts=10)

This method creates a dictionary with CAS numbers for the names in ‘strings’ using the PubChem REST API.

Parameters:
  • strings (list or list-like) – A list of the names that should be looked up.

  • src_lang (str, optional) – String that specifies the original language of the names in ‘strings. Default: None.

  • max_attempts (int, optional) – The maximum number of attempts to connect to the Google Translate API. Default: 10.

Returns:

result – An instance of the UserDict class in which the elements of ‘strings’ are the keys and the CAS numbers the values. An empty dictionary is returned if translation of the strings failed for some reason.

Return type:

class instance

classmethod pubchem_cid_dict(strings, src_lang=None, max_attempts=10)

This method creates a dictionary with CID numbers for the names in ‘strings’ using the PubChem REST API.

Parameters:
  • strings (list or list-like) – A list of the names that should be looked up.

  • src_lang (str, optional) – String that specifies the original language of the names in ‘strings. Default: None.

  • max_attempts (int, optional) – The maximum number of attempts to connect to the Google Translate API. Default: 10.

Returns:

result – An instance of the UserDict class in which the elements of ‘strings’ are the keys and the CID numbers the values. An empty dictionary is returned if translation of the strings failed for some reason.

Return type:

class instance

classmethod translation_dict(strings, src_lang='NL', dst_lang='EN', max_attempts=10)

This method attempts to create a mapping dictionary with ‘strings’ being the keys and their translations being the values.

Parameters:
  • strings (list) – List with the strings to translate.

  • src_lang (str, optional) – String that specifies the language to translate from. Default: “NL”.

  • dst_lang (str, optional) – String that specifies the language to translate to. Default: “EN”.

  • max_attempts (int, optional) – The maximum number of attempts to connect to the Google Translate API. Default: 10.

to_file(file_path)

This method saves the contents of the dictionary as a json file.

Parameters:

file_path (str) – The json file to be written.

translate_keys(src_lang='NL', dst_lang='EN', max_attempts=10)

This method attempts to translate the dictionary keys from src_lang to dst_lang using Google Translate.

Parameters:
  • src_lang (str, optional) – String that specifies the language to translate from. Default: “NL”.

  • dst_lang (str, optional) – String that specifies the language to translate to. Default: “EN”.

  • max_attempts (int, optional) – The maximum number of attempts to connect to the Google Translate API. Default: 10.

class wadi.mapper.Mapper(key_1)

Bases: WadiBaseClass

WaDI class that implements the operations to map feature names and units to their aliases.

__init__(key_1)

Class initialization method.

Parameters:

key_1 (str) – The value of key_1 is either ‘name’ or ‘unit’ depending on what needs to be mapped. Note that these names must correspond to their respective keys in the dict_1 elements of the InfoTable, hence the attribute name ‘key_1’.

__call__(m_dict=None, match_method=None, regex_map=<wadi.unitconverter.UnitRegexMapper object>, replace_strings=None, remove_strings=None)

This method provides an interface for the user to set the attributes that determine the Mapper object behavior.

Parameters:
  • m_dict (MapperDict or dict) – The dictionary that will map the names to their alias.

  • match_method (str or list) – One or more names of the match method(s) to be used to find feature and unit names. Valid values include ‘exact’, ‘ascii’, ‘regex’, ‘fuzzy’, ‘pubchem’. Default is ‘exact’ for name mapping or ‘regex’ for unit mapping.

  • regex_map (UnitRegexMapper) – UnitRegexMapper object to be used for mapping when match_method = ‘regex’.

  • replace_strings (dict) – Dictionary for searching and replacing string values in in the feature names or units. The keys are the search strings and the values the replacement strings. Default: DEFAULT_STR2REPLACE.

  • remove_strings (list) – List of strings that need to be deleted from the feature names or units. Default: DEFAULT_STR2REMOVE.

_df2excel(df)

This method creates an ExcelWriter instance that will either append a worksheet to an existing Excel file (for example when units are mapped after names) or creates a new Excel file when it does not yet exist (when mapping is performed for the first time). Any sheets in an already-existing file will get overwritten through the use of if_sheet_exists=’replace’.

Parameters:

df (DataFrame) – The DataFrame to be saved to the Excel file.

Raises:

FileNotFoundError – When the file exists but has a size of zero bytes. This error is caught internally and the ‘mode’ attribute is changed from ‘a’ to ‘w’ to overwrite the existing file.

_match_exact(strings, m_dict)

This method returns the values of the items in m_dict for the elements in ‘strings’ that match exactly with a key.

Parameters:
  • strings (list) – A list with strings to be matched to the keys in m_dict.

  • m_dict (dict) – Dictionary to look up the elements of ‘strings’ in.

Returns:

result – A nested list which for each element in ‘strings’ contains a two-item list, the first element being the key that was matched, the second element being the corresponding value from m_dict.

Return type:

list

_match_regex(strings)

This method returns the strings returned by the RegexMapper which tries to match and parse the elements in ‘strings’ using a regular expression.

Parameters:

strings (list) – A list with strings to be parsed.

Returns:

result – A nested list which for each element in ‘strings’ contains a two-item list, the first element being the string produced by the RegexMapper, the second element being None.

Return type:

list

_match_fuzzy(strings, m_dict)

This method returns the values of the keys in m_dict for the elements in ‘strings’ that have a fuzzy score above a certain threshold. The score is calculated by fuzzywuzzy’s extractOne method.

Parameters:
  • strings (list) – A list with strings to be matched to the keys in m_dict.

  • m_dict (dict) – Dictionary to look up the elements of ‘strings’ in.

Returns:

result – A nested list which for each element in ‘strings’ contains a two-item list, the first element being the key that was matched (alongside with the score), the second element being the corresponding value from m_dict.

Return type:

list

_match_pubchem(strings)

This method tries to look up the first compound returned by a call to PubChem’s autocomplete API and its synonym.

Parameters:

strings (list) – A list with strings for which to look up the PubChem compound and synonym.

Returns:

result – A nested list which for each element in ‘strings’ contains a two-item list, the first element being the first compound returned by the PubChem autocomplete API, the second element being the corresponding synonym.

Return type:

list

_execute(columns, strings)

This method calls the match methods specified by the user and saves a summary of the results in an Excel file. The results are passe back to WaDI’s DataObject in a dictionary.

Parameters:
  • columns (list) – A list with the names of the features (for ‘stacked’ data) or columns for ‘wide’ data. In both cases they correspond to the ‘key_0’ items in the InfoTable.

  • strings (list) – A list with strings to be matched. These can be (feature) names or units.

Returns:

result – A dictionary with the aliases (for names) or parsed strings (for units) for each element in ‘columns’.

Return type:

dict

match(strings)