Example 1: Import stacked data

This example is an extension of the minimal working example and demonstrates how to import an Excel file with stacked data and perform some basic transformations. It is intended to illustrate a typical WaDI workflow:

Note

Instructions for importing ‘wide’ data are provided in example 2.

Step 1: Initialize the DataObject class

The very first thing to do is to import the WaDI library. In the code example below it is renamed to wd.

In [1]: import wadi as wd

The DataObject class is now available as wd.DataObject. This class must always be initialized. It can be called without arguments, but the optional arguments log_fname and output_dir make it possible to change the name and the output directory of the log file that WaDI creates.

In [2]: wdo = wd.DataObject(log_fname='WaDI_example.log', silent=True)

The DataObject class instance is stored as wdo, which will be used in the next step to read the data from the spreadsheet file.

Note

The log file contains information about all the steps it takes to process the data. Users must always inspect the contents of this file to ensure that the WaDI methods resulted in the expected behavior.

Step 2: Read the data

Once the DataObject class has been initialized, its file_reader method must be called to specify the name and the structure of the file that is to be imported. For the purpose of the demonstration in this documentation, find the directory with the data files with get_data_dir

In [3]: from wadi.documentation_helpers import get_data_dir

In [4]: DATA_DIRECTORY = get_data_dir()

In [5]: wdo.file_reader(DATA_DIRECTORY / 'stacked_data.xlsx',
   ...:     format='stacked',
   ...:     c_dict={'SampleId': 'Sample number',
   ...:             'Features': 'Parameter description',
   ...:             'Units': 'Unit description',
   ...:             'Values': 'Reported value',
   ...:     },
   ...: )
   ...: 

The format argument is to indicate if the data are in ‘stacked’ format. This means that four compulsory columns must be present in the spreadsheet file

  1. SampleId: A unique sample identifier

  2. Features: The names of the features

  3. Units: The (chemical concentration) units

  4. Values: The measured data

Inspection of the spreadsheet file ‘stacked_data.xlsx’ shows that these columns are called ‘Sample number’, ‘Parameter description’, ‘Unit description’ and ‘Reported value’. Therefore, for WaDI to be able to find the compulsory columns, their names must be mapped to the column names listed above. This is why the c_dict argument (the c is shorthand for column) is used: It takes a dictionary with the compulsory column names as keys, and each corresponding value contains the column name as it appears in the spreadsheet file.

Note

By default, file_reader will set the Pandas method read_excel as the importer for the file contents. With the keyword argument pd_reader the name of any Pandas reader function (for example read_csv) can be used instead (note that WaDI has been designed to work with read_excel and read_csv, other functions are not guaranteed to work).

The contents of the imported DataFrame can be displayed by calling the get_imported_dataframe() method of the wdo object. Note that the imported DataFrame has nine rows of data.

In [6]: df = wdo.get_imported_dataframe()

In [7]: df.head(9)
Out[7]: 
  Sample location  ... Unit description
0          PW1100  ...             µg/l
1          PW1100  ...          mg/l Cl
2          PW1100  ...             µg/l
3          PW1100  ...         mg/l SO4
4          OW1005  ...             µg/l
5          OW1005  ...             µg/l
6          OW1203  ...             mg/l
7          OW1203  ...             mS/m
8          OW1203  ...          mg/l Cl

[9 rows x 8 columns]

Inspection of the parameter names shows that sulphate was wrongly spelled as Sulpate and that the name for calcium also includes the laboratory method (ICP-AES). Issues such as these can be remedied by mapping the names to new values, which will be demonstrated in the next step.

Step 3: Map the names

Mapping involves ‘translating’ the feature names and the units to a desired format. To illustrate the principle, the following mapping operations will be performed

  • ‘Chloride’ will be mapped to ‘Cl’

  • ‘Sulpate’ will be mapped to ‘SO4’

  • The text string ‘(ICP-AES)’ will be removed and ‘Calcium’ will be mapped to ‘Ca’.

By assigning the text string ‘(ICP-AES)’ to the remove_strings argument (note that this must be within a list, as there could be multiple text strings that need removing), it will be deleted from the feature name. The name mapping is accomplished by defining a dictionary called name_mapper, which is passed as the m_dict argument of the name_map method. The keys of m_dict are the feature names to be matched, which will be replaced by the corresponding values.

In [8]: name_mapper = {'Chloride': 'Cl',
   ...:     'Calcium': 'Ca',
   ...:     'Sulphate': 'SO4',
   ...: }
   ...: 

In [9]: wdo.name_map(m_dict=name_mapper,
   ...:     match_method=['exact', 'fuzzy'],
   ...:     remove_strings=['(ICP-AES)'],
   ...: )
   ...: 

Both the ‘exact’ and ‘fuzzy’ mapping methods are used to match feature names to the keys in m_dict. The fuzzy search algorithm finds a match if two terms are sufficiently close based on score between 0 and 100 percent. This match method will therefore result in a match for the misspelled feature name ‘Sulpate’. The ‘exact’ match method will find ‘Chloride’ and ‘Calcium’. The organic substances are not in name_mapper, so their names will remain unchanged.

Note

More information on creating mapping dictionaries can be found here

Step 4: Harmonize the data

Harmonizing the data can involve several operations (combining features, deleting features, converting units). Here a harmonizer object it will be added to the WaDI DataObject wdo to convert the data format from ‘stacked’ to ‘wide’ and to convert the chemical concentrations to mmol/l by setting convert_units to True (the default is False).

In [10]: df = wdo.harmonizer(convert_units=True,
   ....:     target_units='mmol/l',
   ....: )
   ....: 

Finally the result of the operations defined above can be obtained by calling the get_converted_dataframe method.

In [11]: df = wdo.get_converted_dataframe()
 - Could not retrieve molar mass 1,2-Dichloroethane with molmass library. Trying PubChem...
 - Could not retrieve molar mass 1,2-Dimethylbenzene with molmass library. Trying PubChem...
 - Could not retrieve molar mass EC 20degC with molmass library. Trying PubChem...

In [12]: df.head()
Out[12]: 
                    1,2-Dichloroethane        Cl  ...        Ca EC 20degC
                                mmol/l    mmol/l  ...    mmol/l      mS/m
Sample number                                     ...                    
23010701       < 5.052546483427648e-07  2.820644  ...       NaN       NaN
22122401       < 5.052546483427648e-07       NaN  ...       NaN       NaN
22122402                           NaN  0.282064  ...  0.948151      26.0

[3 rows x 6 columns]

The mapping results are summarized in the file ‘mapping_results_WaDI_example.xlsx’ in the folder named ‘WaDI_output’. In this file it can be seen that a match was found for Chloride, Sulpate and Calcium (ICP-AES). All the other features will keep their original names.

WaDI uses the molmass package, which tries to calculate the molar mass from chemical formulas. If the molmass package is unable to determine the molar mass, WaDI tries to find it in the online PubChem library. In some cases unit conversion fails and the imported data will remain in their original units. This is the case here for the electrical conductivity (as expected this cannot be converted to molar concentration units) and the original numbers are simply kept. Concentrations that were below the detection limit (values with a ‘<’ symbol) were originally reported with a comma as a decimal separator. In the converted DataFrame the decimal separator is replaced with a dot.