Creating mapping dictionaries

When mapping feature names to their alias, WaDI uses a dictionary, of which the keys are the feature names and the values the corresponding aliases. There are several options to create a mapping dictionary, which will be demonstrated in the examples below

Using the default dictionary

The first option is to use the default dictionary, which is bundled with WaDI and has been developed by Martin van der Schans at KWR. It contains a curated collection of feature names, CAS numbers and PubChem Identifiers (CIDs) for a large number of chemical substances. The dictionary is stored in the file default_feature_map.json. Specifically, it contains substance names according to the following systems:

  • SIKB

  • NVWA

  • REWAB

  • HGC

  • PubChem CIDs

  • CAS numbers

In addition to this, there are also English and Dutch synonyms for substances that have multiple names. Creating the default dictionary is done with the following command:

In [1]: import wadi as wd

In [2]: names_dict = wd.mapper.MapperDict.default_dict('REWAB', 'ValidCid')

As can be inferred from the arguments passed to the default_dict function, the names according to the REWAB system will be used to search for features and they will be converted to a PubChem CID. Information about the dictionary contents can be obtained by printing it to the screen

In [3]: print(names_dict)
This dictionary contains 2413 elements.
Only the first 10 elements are shown.
This mapping dictionary contains the following names and their aliases:
 - borneol --> 10049
 - flurenol --> 10087
 - chloorfenvinfos (alfa-) --> 10107
 - chloorfenvinfos (beta-) --> 10107
 - cis-chloorfenvinfos --> 10107
 - trans-chloorfenvinfos --> 10107
 - halfenprox --> 10140464
 - ethyl-2-pentenoaat --> 102263
 - binapacryl --> 10234
 - amisulbrom --> 10238657
 - dimethylbenzylether --> 102408100
 - fluorenon --> 10241

Note that there are four synonyms for chloorfenvinfos that will all map to the same CID (10107).

The use of the mapping dictionaries will be demonstrated using the spreadsheet file mapping_example.xlsx. The following code instructs WaDI what columns to look for in the spreadsheet file in order to be able to import the data. For the purpose of the demonstration in this documentation, find the directory with the data files with get_data_dir

In [4]: from wadi.documentation_helpers import get_data_dir

In [5]: DATA_DIRECTORY = get_data_dir()

# Create an instance of a WaDI DataObject, specify the log file name
In [6]: wdo = wd.DataObject(log_fname='mapping_example.log', silent=True)

# Import the data. The 'c_dict' dictionary specifies the column names
# for the sample identifiers,  feature names, concentrations and units.
In [7]: wdo.file_reader(DATA_DIRECTORY / 'mapping_example.xlsx',
   ...:     format='stacked',
   ...:     c_dict={'SampleId': 'sample_code',
   ...:             'Features': 'parameter',
   ...:             'Units': 'dimensie',
   ...:             'Values': 'waarde',
   ...:     },
   ...: )
   ...: 

Let’s print the names of the imported features and create a for loop to see what aliases will be used if a name appears in the dictionary.

In [8]: names = wdo.get_imported_dataframe()["parameter"]

In [9]: print(names)
0    chloorbenzeen
1      carbendazim
2        naftaleen
3            koper
4           pyreen
Name: parameter, dtype: object

In [10]: for name in names:
   ....:     print(name, "will be mapped to", names_dict.get(name))
   ....: 
chloorbenzeen will be mapped to 7964
carbendazim will be mapped to 25429
naftaleen will be mapped to 931
koper will be mapped to 23978
pyreen will be mapped to 31423

To actually perform the mapping, a name_map must be added to the WaDI DataObject wdo. By calling get_converted_dataframe and printing the returned DataFrame, the result of the mapping operation can be inspected

In [11]: wdo.name_map(m_dict=names_dict)

In [12]: df = wdo.get_converted_dataframe()

In [13]: print(df.head())
               7964 25429    931 23978 31423
               µg/l  µg/l   µg/l  µg/l  µg/l
sample_code                                 
1            < 0.01  0.01  0.043  3.97  0.04

Instead of the original feature names, the column names in the converted DataFrame are now the PubChem CIDs.

Querying PubChem for CIDs

The above example works for features that are contained in WaDI’s default database. When importing features that are not in there, CIDs can be looked up directly in the online PubChem database by creating a mapping dictionary with the pubchem_cid_dict function. The first argument for this function is a list of strings, in this case the list names with the original feature names. For each of the strings, WaDI tries to obtain the CID by contacting the PubChem online database. Because PubChem uses English names, translation is necessary for feature names in another language, in this case Dutch. Therefore the source language may be specified with the src_lang argument. WaDI will use the Google Translate API to determine the English feature name. However, translations may be unreliable and may not yield the desired result. In this example, the feature name koper is Dutch for the element copper but Google Translate finds the English word buyer, which is another, equally valid, meaning of the Dutch word koper ( see Creating a translation dictionary). The user should therefore proceed with extreme caution when using this functionality!

In [14]: names_dict = wd.mapper.MapperDict.pubchem_cid_dict(names, src_lang="NL")

In [15]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
 - chloorbenzeen --> 7964
 - carbendazim --> 25429
 - naftaleen --> 931
 - koper --> None
 - pyreen --> 31423


In [16]: wdo.name_map(m_dict=names_dict)

In [17]: df = wdo.get_converted_dataframe()

In [18]: print(df.head())
               7964 25429    931 23978 31423
               µg/l  µg/l   µg/l  µg/l  µg/l
sample_code                                 
1            < 0.01  0.01  0.043  3.97  0.04

As in the previous example, the column names are now the PubChem CID, except for copper (koper).

Querying PubChem for CAS numbers

Just like the function pubchem_cid_dict can be used to look up CIDs, the function pubchem_cas_dict can be invoked to look up CAS numbers in PubChem.

In [19]: names_dict = wd.mapper.MapperDict.pubchem_cas_dict(names, src_lang="NL")

In [20]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
 - chloorbenzeen --> 108-90-7
 - carbendazim --> 10605-21-7
 - naftaleen --> 91-20-3
 - koper --> None
 - pyreen --> 129-00-0


In [21]: wdo.name_map(m_dict=names_dict)

In [22]: df = wdo.get_converted_dataframe()

In [23]: print(df.head())
               7964 25429    931 23978 31423
               µg/l  µg/l   µg/l  µg/l  µg/l
sample_code                                 
1            < 0.01  0.01  0.043  3.97  0.04

The column names are now the CAS numbers that could be retrieved. When no CAS number could be determined (such as for koper), the original feature name is retained as the column heading.

Creating a translation dictionary

In the previous examples, the translation of the original feature names to English was done internally by WaDI. This functionality can also be used to create a mapping dictionary that translates feature names from one language into another. The function to create this dictionary is translation_dict and is demonstrated in the following code snippet

In [24]: names_dict = wd.mapper.MapperDict.translation_dict(names,
   ....:     src_lang="NL",
   ....:     dst_lang="EN",
   ....: )
   ....: 

In [25]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
 - chloorbenzeen --> chlorobenzene
 - carbendazim --> carbendazim
 - naftaleen --> naphthalene
 - koper --> buyer
 - pyreen --> pyrene


In [26]: wdo.name_map(m_dict=names_dict)

In [27]: df = wdo.get_converted_dataframe()

In [28]: print(df.head())
               7964 25429    931 23978 31423
               µg/l  µg/l   µg/l  µg/l  µg/l
sample_code                                 
1            < 0.01  0.01  0.043  3.97  0.04

The new column names are now the English names that Google Translate provided. The Dutch feature name koper has been translated to buyer, which stricly speaking is correct, but from a chemical point of view, this is obviously not the desired result. Future versions of WaDI will incorporate a better translation service when it becomes available. Until then, the user must proceed with extreme caution when using the WaDI features that require translation.