Creating mapping dictionaries
When mapping feature names to their alias, WaDI uses a dictionary, of which the keys are the feature names and the values the corresponding aliases. There are several options to create a mapping dictionary, which will be demonstrated in the examples below
Using the default dictionary
The first option is to use the default dictionary, which is bundled with WaDI and has been developed by Martin van der Schans at KWR. It contains a curated collection of feature names, CAS numbers and PubChem Identifiers (CIDs) for a large number of chemical substances. The dictionary is stored in the file default_feature_map.json. Specifically, it contains substance names according to the following systems:
SIKB
NVWA
REWAB
HGC
PubChem CIDs
CAS numbers
In addition to this, there are also English and Dutch synonyms for substances that have multiple names. Creating the default dictionary is done with the following command:
In [1]: import wadi as wd
In [2]: names_dict = wd.mapper.MapperDict.default_dict('REWAB', 'ValidCid')
As can be inferred from the arguments passed to the default_dict function, the names according to the REWAB system will be used to search for features and they will be converted to a PubChem CID. Information about the dictionary contents can be obtained by printing it to the screen
In [3]: print(names_dict)
This dictionary contains 2413 elements.
Only the first 10 elements are shown.
This mapping dictionary contains the following names and their aliases:
- borneol --> 10049
- flurenol --> 10087
- chloorfenvinfos (alfa-) --> 10107
- chloorfenvinfos (beta-) --> 10107
- cis-chloorfenvinfos --> 10107
- trans-chloorfenvinfos --> 10107
- halfenprox --> 10140464
- ethyl-2-pentenoaat --> 102263
- binapacryl --> 10234
- amisulbrom --> 10238657
- dimethylbenzylether --> 102408100
- fluorenon --> 10241
Note that there are four synonyms for chloorfenvinfos that will all map to the same CID (10107).
The use of the mapping dictionaries will be demonstrated using the spreadsheet file mapping_example.xlsx. The following code instructs WaDI what columns to look for in the spreadsheet file in order to be able to import the data. For the purpose of the demonstration in this documentation, find the directory with the data files with get_data_dir
In [4]: from wadi.documentation_helpers import get_data_dir
In [5]: DATA_DIRECTORY = get_data_dir()
# Create an instance of a WaDI DataObject, specify the log file name
In [6]: wdo = wd.DataObject(log_fname='mapping_example.log', silent=True)
# Import the data. The 'c_dict' dictionary specifies the column names
# for the sample identifiers, feature names, concentrations and units.
In [7]: wdo.file_reader(DATA_DIRECTORY / 'mapping_example.xlsx',
...: format='stacked',
...: c_dict={'SampleId': 'sample_code',
...: 'Features': 'parameter',
...: 'Units': 'dimensie',
...: 'Values': 'waarde',
...: },
...: )
...:
Let’s print the names of the imported features and create a for loop to see what aliases will be used if a name appears in the dictionary.
In [8]: names = wdo.get_imported_dataframe()["parameter"]
In [9]: print(names)
0 chloorbenzeen
1 carbendazim
2 naftaleen
3 koper
4 pyreen
Name: parameter, dtype: object
In [10]: for name in names:
....: print(name, "will be mapped to", names_dict.get(name))
....:
chloorbenzeen will be mapped to 7964
carbendazim will be mapped to 25429
naftaleen will be mapped to 931
koper will be mapped to 23978
pyreen will be mapped to 31423
To actually perform the mapping, a name_map must be added to the WaDI
DataObject wdo
. By calling get_converted_dataframe
and printing
the returned DataFrame, the result of the mapping operation can be
inspected
In [11]: wdo.name_map(m_dict=names_dict)
In [12]: df = wdo.get_converted_dataframe()
In [13]: print(df.head())
7964 25429 931 23978 31423
µg/l µg/l µg/l µg/l µg/l
sample_code
1 < 0.01 0.01 0.043 3.97 0.04
Instead of the original feature names, the column names in the converted DataFrame are now the PubChem CIDs.
Querying PubChem for CIDs
The above example works for features that are contained in WaDI’s default
database. When importing features that are not in there, CIDs can be looked
up directly in the online PubChem database by creating a mapping dictionary
with the pubchem_cid_dict
function. The first argument for this
function is a list of strings, in this case the list names
with
the original feature names. For each of the strings, WaDI tries to obtain
the CID by contacting the PubChem online database. Because PubChem uses
English names, translation is necessary for feature names in another language,
in this case Dutch. Therefore the source language may be specified with the
src_lang
argument. WaDI will use the Google Translate API to determine
the English feature name. However, translations may be unreliable and may
not yield the desired result. In this example, the feature name koper is
Dutch for the element copper but Google Translate finds the English word
buyer, which is another, equally valid, meaning of the Dutch word koper (
see Creating a translation dictionary).
The user should therefore proceed with extreme caution when using this
functionality!
In [14]: names_dict = wd.mapper.MapperDict.pubchem_cid_dict(names, src_lang="NL")
In [15]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
- chloorbenzeen --> 7964
- carbendazim --> 25429
- naftaleen --> 931
- koper --> None
- pyreen --> 31423
In [16]: wdo.name_map(m_dict=names_dict)
In [17]: df = wdo.get_converted_dataframe()
In [18]: print(df.head())
7964 25429 931 23978 31423
µg/l µg/l µg/l µg/l µg/l
sample_code
1 < 0.01 0.01 0.043 3.97 0.04
As in the previous example, the column names are now the PubChem CID, except for copper (koper).
Querying PubChem for CAS numbers
Just like the function pubchem_cid_dict
can be used to look up
CIDs, the function pubchem_cas_dict
can be invoked to look up
CAS numbers in PubChem.
In [19]: names_dict = wd.mapper.MapperDict.pubchem_cas_dict(names, src_lang="NL")
In [20]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
- chloorbenzeen --> 108-90-7
- carbendazim --> 10605-21-7
- naftaleen --> 91-20-3
- koper --> None
- pyreen --> 129-00-0
In [21]: wdo.name_map(m_dict=names_dict)
In [22]: df = wdo.get_converted_dataframe()
In [23]: print(df.head())
7964 25429 931 23978 31423
µg/l µg/l µg/l µg/l µg/l
sample_code
1 < 0.01 0.01 0.043 3.97 0.04
The column names are now the CAS numbers that could be retrieved. When no CAS number could be determined (such as for koper), the original feature name is retained as the column heading.
Creating a translation dictionary
In the previous examples, the translation of the original feature names to
English was done internally by WaDI. This functionality can also be used
to create a mapping dictionary that translates feature names from one
language into another. The function to create this dictionary is
translation_dict
and is demonstrated in the following code snippet
In [24]: names_dict = wd.mapper.MapperDict.translation_dict(names,
....: src_lang="NL",
....: dst_lang="EN",
....: )
....:
In [25]: print(names_dict)
This dictionary contains 5 elements.
This mapping dictionary contains the following names and their aliases:
- chloorbenzeen --> chlorobenzene
- carbendazim --> carbendazim
- naftaleen --> naphthalene
- koper --> buyer
- pyreen --> pyrene
In [26]: wdo.name_map(m_dict=names_dict)
In [27]: df = wdo.get_converted_dataframe()
In [28]: print(df.head())
7964 25429 931 23978 31423
µg/l µg/l µg/l µg/l µg/l
sample_code
1 < 0.01 0.01 0.043 3.97 0.04
The new column names are now the English names that Google Translate provided. The Dutch feature name koper has been translated to buyer, which stricly speaking is correct, but from a chemical point of view, this is obviously not the desired result. Future versions of WaDI will incorporate a better translation service when it becomes available. Until then, the user must proceed with extreme caution when using the WaDI features that require translation.