Example 2: Using blocks to import messy data

By using Pandas’ exhaustive functionality for reading files, WaDI can handle almost any spreadsheet or comma separated values file, even if its contents are poorly organized. Inspection of the spreadsheet file ‘messy_data.xlsx’ shows that its contents are truly a mess. For example, some concentration data are spread over two columns. Also the rows with column headers are not nicely aligned, there are empty rows and columns and there are trailing rows with data that should be skipped. This example demonstrates how WaDI uses the keyword arguments of the read_excel function to import separate blocks of data (called ‘blocks’) into a single DataFrame.

Step 1: Initialize the WaDI DataObject class

Nothing special needs to be done in this step, except to provide the log file name.

In [1]: import wadi as wd

In [2]: wdo = wd.DataObject(
   ...:     log_fname='messy_data.log',
   ...:     silent=True
   ...: )
   ...: 

Step 2: Read the data

Because the required data in the file are scattered, three so-called ‘blocks’ are defined. Each takes a number of keyword arguments. Basically all keyword arguments of a Pandas reader function, in this case read_excel, are allowed. In this case, they are skiprows, usecols and na_values. The latter is used to filter out cell that have ‘999’ as values.

The keyword argument skiprows accepts a list of (zero-based) row numbers to be skipped, and for each block this list is stored in rows2skip before it is passed as an argument value for skiprows. The row numbers, unfortunately, must be counted manually in the spreadsheet file. Note how usecols is used to grab only a selection of columns for each block.

There are also a few keyword arguments that are specific to the read_data function. These are datatype, which tells WaDI what kind of data is being read. Valid values are sampleinfo, for sample data that are not going to be modified (for example, sample coordinates) in any way, and feature, for data that are going to be mapped and/or harmonized.

The keyword argument units_row tells read_data on which row in the spreadsheet it has to look for the (chemical concentration) units. As with skiprows, the number has to be looked up by the user by inspecting the contents of the spreadsheet file.

In [3]: rows2skip0 = list(range(8)) + [21, 22] + list(range(24, 34))

In [4]: df0_kwargs = {
   ...:     'skiprows': rows2skip0,
   ...:     'usecols': "B:D",
   ...:     'datatype': 'sampleinfo',
   ...: }
   ...: 

In [5]: rows2skip1 = list(range(2)) + list(range(3, 9)) + [21, 22]

In [6]: df1_kwargs = {
   ...:     'skiprows': rows2skip1,
   ...:     'usecols': "E:AE",
   ...:     'units_row': 4,
   ...:     'datatype': 'feature',
   ...:     'na_values': [999],
   ...: }
   ...: 

In [7]: rows2skip2 = list(range(7)) + [8, 21, 22]

In [8]: df2_kwargs = {
   ...:     'skiprows': rows2skip2,
   ...:     'usecols': "AF",
   ...:     'datatype': 'sampleinfo',
   ...: }
   ...: 

Now that the arguments for the three data blocks have been defined, they can be passed on to the file_reader method. Additionally, the file name and the data format is passed. In this case, the concentrations are organized in columns, so the format argument should become ‘wide’. For the purpose of the demonstration in this documentation, find the directory with the data files with get_data_dir

In [9]: from wadi.documentation_helpers import get_data_dir

In [10]: DATA_DIRECTORY = get_data_dir()

In [11]: wdo.file_reader(
   ....:     file_path=DATA_DIRECTORY / 'messy_data.xlsx',
   ....:     format='wide',
   ....:     blocks=[df0_kwargs, df1_kwargs, df2_kwargs],
   ....: )
   ....: 

Once the data have been read, the contents of the DataFrame that was imported can be inspected. Note how Pandas (not WaDI) has automatically numbered the second column of duplicate columns by appending ‘.1’.

In [12]: import sys

In [13]: sys.version
Out[13]: '3.11.6 (main, Oct 18 2023, 21:49:15) [GCC 11.3.0]'

In [14]: df = wdo.get_imported_dataframe()

In [15]: df.head()
Out[15]: 
         Sampling Date LocationID  ... wrong_feature.1 Sample Number
0  2001-05-02 00:00:00         #2  ...             NaN        1002.0
1           02-05-2001         #2  ...             NaN        1005.0
2  2001-05-02 00:00:00          2  ...             NaN        1008.0
3  2001-05-03 00:00:00          2  ...             NaN        1011.0
4           03-05-2001          2  ...             NaN        1014.0

[5 rows x 31 columns]

Note how a mistake has creeped in the spelling of ‘Nitrate’ in the spreadsheet: by accident the Dutch word was typed in the first column.

Step 3: Map the names and units

The feature names are mapped using a dictionary that matches the original column names to the desired column names. In the code below, this dictionary is created manually and is stored as feature_dict. The feature_dict is assigned to the m_dict keyword argument of the name_map method.

The match methods are grouped in a list that is assigned to the match_method keyword argument. Both the ‘exact’ and ‘fuzzy’ match methods are included. The latter will be able to match ‘Nitraat’ to ‘Nitrate’, which will automatically fix the translation mistake for this feature.

In [16]: feature_dict = wd.MapperDict({
   ....:         'Phosphate': 'PO4',
   ....:         'Nitrate': 'NO3',
   ....:         'Nitrite': 'NO2',
   ....:         'Ammonium': 'NH4',
   ....:         'Silica': 'SiO2',
   ....:         'Sulphate': 'SO4',
   ....:         'Sodium': 'Na',
   ....:         'Calcium': 'Ca',
   ....:         'Arsenic': 'As',
   ....:     }
   ....: )
   ....: 

In [17]: wdo.name_map(
   ....:     m_dict=feature_dict,
   ....:     match_method=['exact', 'fuzzy'],
   ....: )
   ....: 

The way units are mapped can be controlled with the unit_map method. In this case the preferred match method is ‘regex’, which uses the specialized WaDI search method (based on regular expressions) that tries to decipher the units strings. For example, it can tell the difference between ‘mg N/l’, ‘mg N/l NO3’ or ‘mg/l NO3’, and knows what molecular weight to use when concentrations reported in mass units are to be converted to molar units.

In this case there are also a few symbols that need to be replaced for the unit mapping to be successful. These are passed as a dictionary with the keyword arguments replace_strings.

In [18]: wdo.unit_map(
   ....:     match_method=['regex'],
   ....:     replace_strings={'μ': 'u', '-': ' ', '%': 'percent'},
   ....: )
   ....: 

Step 4: Harmonize the data

The harmonizer method is used below to define which columns are to be merged or deleted and to specify how measurement data are to be converted from one unit to another.

The columns to be combined must be grouped in a list of at least two column names. The data in the first column in the list will be overwritten with data from the next column where the values in the first column are NaN (not a number). More than two columns are allowed, WaDI will simply try to fill up as many NaN values as possible. Note that eight column pairs are selected for merging and that these are grouped in a list. This means that the value passed for merge_columns must always be a nested list, even if only one set of column names is passed.

The harmonizer method also has a keyword argument drop_columns, which takes a list of column names that will be deleted.

The target_units keyword argument specifies what (concentration) units will be used for the feature data. In this case the values will be converted to ‘mmol/l’. This value can be overridden for individual columns with the override_units keyword argument. The code example below shows how this option can be used to convert only the arsenic concentrations to μmol/l.

Note

Behind the scenes, WaDI relies on Pint to convert the units. Any problems that it encountered will be recorded in the log file (messy_data.log in this example). It is strongly advised that the user always checks the log file to verify that no unexpected behavior occurred.

In [19]: drop_cols = [
   ....:     "SampleID",
   ....:     "Unnamed: 17",
   ....: ]
   ....: 

In [20]: override_units = {
   ....:     "Arsenic": "umol/l",
   ....:     "Arsenic.1": "umol/l",
   ....:     "ec": "µS/cm",
   ....: }
   ....: 

In [21]: merge_cols = [
   ....:     ['Phosphate', 'Phosphate.1'],
   ....:     ['Nitraat', 'Nitrate'],
   ....:     ['Nitrite', 'Nitrite.1'],
   ....:     ['Ammonium', 'Ammonium.1'],
   ....:     ['Silica', 'Silica.1'],
   ....:     ['Sulphate', 'Sulphate.1'],
   ....:     ['Calcium', 'Calcium.1'],
   ....:     ['Arsenic', 'Arsenic.1'],
   ....: ]
   ....: 

In [22]: df = wdo.harmonizer(
   ....:     merge_columns=merge_cols,
   ....:     drop_columns=drop_cols,
   ....:     convert_units=True,
   ....:     target_units="mmol/l",
   ....:     override_units=override_units,
   ....: )
   ....: 

In [23]: df = wdo.get_converted_dataframe()

In [24]: df.head()
Out[24]: 
         Sampling Date LocationID    PO4  ... E.coli wrong_feature Sample Number
                                          ...                                   
0  2001-05-02 00:00:00         #2  < 1.0  ...    NaN           NaN        1002.0
1           02-05-2001         #2  < 1.0  ...    NaN           NaN        1005.0
2  2001-05-02 00:00:00          2  < 1.0  ...    NaN           NaN        1008.0
3  2001-05-03 00:00:00          2  < 1.0  ...    NaN           NaN        1011.0
4           03-05-2001          2    1.0  ...    NaN           NaN        1014.0

[5 rows x 21 columns]

Finally, the resulting DataFrame can be saved to an Excel file using the to_excel function.

In [25]: df.to_excel('tidied_data.xlsx')