Example 1: Import stacked data
This example is an extension of the minimal working example and demonstrates how to import an Excel file with stacked data and perform some basic transformations. It is intended to illustrate a typical WaDI workflow:
Note
Instructions for importing ‘wide’ data are provided in example 2.
Step 1: Initialize the DataObject class
The very first thing to do is to import the WaDI library. In the code example below it is renamed to wd.
In [1]: import wadi as wd
The DataObject class is now available as wd.DataObject
. This
class must always be initialized. It can be called without arguments,
but the optional arguments log_fname
and output_dir
make it possible to change the name and the output directory of the
log file that WaDI creates.
In [2]: wdo = wd.DataObject(log_fname='WaDI_example.log', silent=True)
The DataObject class instance is stored as wdo
, which will be used
in the next step to read the data from the spreadsheet file.
Note
The log file contains information about all the steps it takes to process the data. Users must always inspect the contents of this file to ensure that the WaDI methods resulted in the expected behavior.
Step 2: Read the data
Once the DataObject class has been initialized, its file_reader
method must be called to specify the name and the structure of the file
that is to be imported. For the purpose of the demonstration in this
documentation, find the directory with the data files with get_data_dir
In [3]: from wadi.documentation_helpers import get_data_dir
In [4]: DATA_DIRECTORY = get_data_dir()
In [5]: wdo.file_reader(DATA_DIRECTORY / 'stacked_data.xlsx',
...: format='stacked',
...: c_dict={'SampleId': 'Sample number',
...: 'Features': 'Parameter description',
...: 'Units': 'Unit description',
...: 'Values': 'Reported value',
...: },
...: )
...:
The format
argument is to indicate if the data are in
‘stacked’ format. This means that four compulsory columns must
be present in the spreadsheet file
SampleId: A unique sample identifier
Features: The names of the features
Units: The (chemical concentration) units
Values: The measured data
Inspection of the spreadsheet file ‘stacked_data.xlsx’ shows that
these columns are called ‘Sample number’, ‘Parameter description’,
‘Unit description’ and ‘Reported value’. Therefore, for WaDI to be
able to find the compulsory columns, their names must be mapped to
the column names listed above. This is why the c_dict
argument (the c is shorthand for column) is used: It takes a
dictionary with the compulsory column names as keys, and each
corresponding value contains the column name as it appears in the
spreadsheet file.
Note
By default, file_reader
will set the Pandas method
read_excel
as the importer for the file contents. With the keyword
argument pd_reader
the name of any Pandas reader function
(for example read_csv
) can be used instead (note that WaDI
has been designed to work with read_excel
and
read_csv
, other functions are not guaranteed to work).
The contents of the imported DataFrame can be displayed by calling the
get_imported_dataframe()
method of the wdo
object. Note
that the imported DataFrame has nine rows of data.
In [6]: df = wdo.get_imported_dataframe()
In [7]: df.head(9)
Out[7]:
Sample location ... Unit description
0 PW1100 ... µg/l
1 PW1100 ... mg/l Cl
2 PW1100 ... µg/l
3 PW1100 ... mg/l SO4
4 OW1005 ... µg/l
5 OW1005 ... µg/l
6 OW1203 ... mg/l
7 OW1203 ... mS/m
8 OW1203 ... mg/l Cl
[9 rows x 8 columns]
Inspection of the parameter names shows that sulphate was wrongly spelled as Sulpate and that the name for calcium also includes the laboratory method (ICP-AES). Issues such as these can be remedied by mapping the names to new values, which will be demonstrated in the next step.
Step 3: Map the names
Mapping involves ‘translating’ the feature names and the units to a desired format. To illustrate the principle, the following mapping operations will be performed
‘Chloride’ will be mapped to ‘Cl’
‘Sulpate’ will be mapped to ‘SO4’
The text string ‘(ICP-AES)’ will be removed and ‘Calcium’ will be mapped to ‘Ca’.
By assigning the text string ‘(ICP-AES)’ to the remove_strings
argument (note that this must be within a list, as there could be
multiple text strings that need removing), it will be deleted from the
feature name. The name mapping is accomplished by defining a dictionary called
name_mapper
, which is passed as the m_dict
argument
of the name_map
method. The keys of m_dict
are the feature
names to be matched, which will be replaced by the corresponding values.
In [8]: name_mapper = {'Chloride': 'Cl',
...: 'Calcium': 'Ca',
...: 'Sulphate': 'SO4',
...: }
...:
In [9]: wdo.name_map(m_dict=name_mapper,
...: match_method=['exact', 'fuzzy'],
...: remove_strings=['(ICP-AES)'],
...: )
...:
Both the ‘exact’ and ‘fuzzy’ mapping methods are used to match feature names
to the keys in m_dict
. The fuzzy search algorithm finds a match if two
terms are sufficiently close based on score between 0 and 100 percent. This
match method will therefore result in a match for the misspelled feature name
‘Sulpate’. The ‘exact’ match method will find ‘Chloride’ and ‘Calcium’. The
organic substances are not in name_mapper
, so their names will remain
unchanged.
Note
More information on creating mapping dictionaries can be found here
Step 4: Harmonize the data
Harmonizing the data can involve several operations (combining
features, deleting features, converting units). Here a harmonizer
object it will be added to the WaDI DataObject wdo
to convert
the data format from ‘stacked’ to ‘wide’ and to convert the chemical
concentrations to mmol/l by setting convert_units
to True
(the default is False).
In [10]: df = wdo.harmonizer(convert_units=True,
....: target_units='mmol/l',
....: )
....:
Finally the result of the operations defined above can be obtained by
calling the get_converted_dataframe
method.
In [11]: df = wdo.get_converted_dataframe()
- Could not retrieve molar mass 1,2-Dichloroethane with molmass library. Trying PubChem...
- Could not retrieve molar mass 1,2-Dimethylbenzene with molmass library. Trying PubChem...
- Could not retrieve molar mass EC 20degC with molmass library. Trying PubChem...
In [12]: df.head()
Out[12]:
1,2-Dichloroethane Cl ... Ca EC 20degC
mmol/l mmol/l ... mmol/l mS/m
Sample number ...
23010701 < 5.052546483427648e-07 2.820644 ... NaN NaN
22122401 < 5.052546483427648e-07 NaN ... NaN NaN
22122402 NaN 0.282064 ... 0.948151 26.0
[3 rows x 6 columns]
The mapping results are summarized in the file ‘mapping_results_WaDI_example.xlsx’ in the folder named ‘WaDI_output’. In this file it can be seen that a match was found for Chloride, Sulpate and Calcium (ICP-AES). All the other features will keep their original names.
WaDI uses the molmass package, which tries to calculate the molar mass from chemical formulas. If the molmass package is unable to determine the molar mass, WaDI tries to find it in the online PubChem library. In some cases unit conversion fails and the imported data will remain in their original units. This is the case here for the electrical conductivity (as expected this cannot be converted to molar concentration units) and the original numbers are simply kept. Concentrations that were below the detection limit (values with a ‘<’ symbol) were originally reported with a comma as a decimal separator. In the converted DataFrame the decimal separator is replaced with a dot.