Example 2: Using blocks to import messy data
By using Pandas’ exhaustive functionality for reading files,
WaDI can handle almost any spreadsheet or comma separated values
file, even if its contents are poorly organized. Inspection of the
spreadsheet file ‘messy_data.xlsx’ shows that its contents are truly
a mess. For example, some concentration data are spread over two
columns. Also the rows with column headers are not nicely aligned,
there are empty rows and columns and there are trailing rows with
data that should be skipped. This example demonstrates how WaDI uses
the keyword arguments of the read_excel
function to import
separate blocks of data (called ‘blocks’) into a single DataFrame.
Step 1: Initialize the WaDI DataObject class
Nothing special needs to be done in this step, except to provide the log file name.
In [1]: import wadi as wd
In [2]: wdo = wd.DataObject(
...: log_fname='messy_data.log',
...: silent=True
...: )
...:
Step 2: Read the data
Because the required data in the file are scattered,
three so-called ‘blocks’ are defined. Each takes a number of keyword
arguments. Basically all keyword arguments of a Pandas reader function,
in this case read_excel
, are allowed. In this case, they are
skiprows
, usecols
and na_values
. The latter is
used to filter out cell that have ‘999’ as values.
The keyword argument skiprows
accepts a list of (zero-based) row
numbers to be skipped, and for each block this list is stored in
rows2skip
before it is passed as an argument value for
skiprows
. The row numbers, unfortunately, must be counted manually
in the spreadsheet file. Note how usecols
is used to grab only a
selection of columns for each block.
There are also a few keyword arguments that are specific to the
read_data
function. These are datatype
, which tells
WaDI what kind of data is being read. Valid values are sampleinfo
,
for sample data that are not going to be modified (for example, sample
coordinates) in any way, and feature
, for data that are going to be
mapped and/or harmonized.
The keyword argument units_row
tells read_data
on which row
in the spreadsheet it has to look for the (chemical concentration) units.
As with skiprows
, the number has to be looked up by the user by
inspecting the contents of the spreadsheet file.
In [3]: rows2skip0 = list(range(8)) + [21, 22] + list(range(24, 34))
In [4]: df0_kwargs = {
...: 'skiprows': rows2skip0,
...: 'usecols': "B:D",
...: 'datatype': 'sampleinfo',
...: }
...:
In [5]: rows2skip1 = list(range(2)) + list(range(3, 9)) + [21, 22]
In [6]: df1_kwargs = {
...: 'skiprows': rows2skip1,
...: 'usecols': "E:AE",
...: 'units_row': 4,
...: 'datatype': 'feature',
...: 'na_values': [999],
...: }
...:
In [7]: rows2skip2 = list(range(7)) + [8, 21, 22]
In [8]: df2_kwargs = {
...: 'skiprows': rows2skip2,
...: 'usecols': "AF",
...: 'datatype': 'sampleinfo',
...: }
...:
Now that the arguments for the three data blocks have been defined,
they can be passed on to the file_reader
method.
Additionally, the file name and the data format is passed. In this case, the
concentrations are organized in columns, so the format
argument
should become ‘wide’. For the purpose of the demonstration in this
documentation, find the directory with the data files with get_data_dir
In [9]: from wadi.documentation_helpers import get_data_dir
In [10]: DATA_DIRECTORY = get_data_dir()
In [11]: wdo.file_reader(
....: file_path=DATA_DIRECTORY / 'messy_data.xlsx',
....: format='wide',
....: blocks=[df0_kwargs, df1_kwargs, df2_kwargs],
....: )
....:
Once the data have been read, the contents of the DataFrame that was imported can be inspected. Note how Pandas (not WaDI) has automatically numbered the second column of duplicate columns by appending ‘.1’.
In [12]: import sys
In [13]: sys.version
Out[13]: '3.11.6 (main, Oct 18 2023, 21:49:15) [GCC 11.3.0]'
In [14]: df = wdo.get_imported_dataframe()
In [15]: df.head()
Out[15]:
Sampling Date LocationID ... wrong_feature.1 Sample Number
0 2001-05-02 00:00:00 #2 ... NaN 1002.0
1 02-05-2001 #2 ... NaN 1005.0
2 2001-05-02 00:00:00 2 ... NaN 1008.0
3 2001-05-03 00:00:00 2 ... NaN 1011.0
4 03-05-2001 2 ... NaN 1014.0
[5 rows x 31 columns]
Note how a mistake has creeped in the spelling of ‘Nitrate’ in the spreadsheet: by accident the Dutch word was typed in the first column.
Step 3: Map the names and units
The feature names are mapped using a dictionary that matches the original
column names to the desired column names. In the code below, this dictionary
is created manually and is stored as feature_dict
. The
feature_dict
is assigned to the m_dict
keyword argument of
the name_map
method.
The match methods are grouped in a list that is assigned to the
match_method
keyword argument. Both the ‘exact’ and ‘fuzzy’
match methods are included. The latter will be able to match ‘Nitraat’
to ‘Nitrate’, which will automatically fix the translation mistake for
this feature.
In [16]: feature_dict = wd.MapperDict({
....: 'Phosphate': 'PO4',
....: 'Nitrate': 'NO3',
....: 'Nitrite': 'NO2',
....: 'Ammonium': 'NH4',
....: 'Silica': 'SiO2',
....: 'Sulphate': 'SO4',
....: 'Sodium': 'Na',
....: 'Calcium': 'Ca',
....: 'Arsenic': 'As',
....: }
....: )
....:
In [17]: wdo.name_map(
....: m_dict=feature_dict,
....: match_method=['exact', 'fuzzy'],
....: )
....:
The way units are mapped can be controlled with the unit_map
method. In this case the preferred match method is ‘regex’, which uses the
specialized WaDI search method (based on regular expressions) that tries to
decipher the units strings. For example, it can tell the difference between
‘mg N/l’, ‘mg N/l NO3’ or ‘mg/l NO3’, and knows what molecular weight to use
when concentrations reported in mass units are to be converted to molar units.
In this case there are also a few symbols that need to be replaced for the
unit mapping to be successful. These are passed as a dictionary with the
keyword arguments replace_strings
.
In [18]: wdo.unit_map(
....: match_method=['regex'],
....: replace_strings={'μ': 'u', '-': ' ', '%': 'percent'},
....: )
....:
Step 4: Harmonize the data
The harmonizer
method is used below to define which columns
are to be merged or deleted and to specify how measurement data are
to be converted from one unit to another.
The columns to be combined must be grouped in a list of at least two
column names. The data in the first column in the list will be
overwritten with data from the next column where the values in the
first column are NaN (not a number). More than two columns are
allowed, WaDI will simply try to fill up as many NaN values as
possible. Note that eight column pairs are selected for merging
and that these are grouped in a list. This means that the value
passed for merge_columns
must always be a nested list,
even if only one set of column names is passed.
The harmonizer
method also has a keyword argument
drop_columns
, which takes a list of column names that will
be deleted.
The target_units
keyword argument specifies what
(concentration) units will be used for the feature data. In this
case the values will be converted to ‘mmol/l’. This value can be
overridden for individual columns with the override_units
keyword argument. The code example below shows how this option can
be used to convert only the arsenic concentrations to μmol/l.
Note
Behind the scenes, WaDI relies on Pint to convert the units. Any problems that it encountered will be recorded in the log file (messy_data.log in this example). It is strongly advised that the user always checks the log file to verify that no unexpected behavior occurred.
In [19]: drop_cols = [
....: "SampleID",
....: "Unnamed: 17",
....: ]
....:
In [20]: override_units = {
....: "Arsenic": "umol/l",
....: "Arsenic.1": "umol/l",
....: "ec": "µS/cm",
....: }
....:
In [21]: merge_cols = [
....: ['Phosphate', 'Phosphate.1'],
....: ['Nitraat', 'Nitrate'],
....: ['Nitrite', 'Nitrite.1'],
....: ['Ammonium', 'Ammonium.1'],
....: ['Silica', 'Silica.1'],
....: ['Sulphate', 'Sulphate.1'],
....: ['Calcium', 'Calcium.1'],
....: ['Arsenic', 'Arsenic.1'],
....: ]
....:
In [22]: df = wdo.harmonizer(
....: merge_columns=merge_cols,
....: drop_columns=drop_cols,
....: convert_units=True,
....: target_units="mmol/l",
....: override_units=override_units,
....: )
....:
In [23]: df = wdo.get_converted_dataframe()
In [24]: df.head()
Out[24]:
Sampling Date LocationID PO4 ... E.coli wrong_feature Sample Number
...
0 2001-05-02 00:00:00 #2 < 1.0 ... NaN NaN 1002.0
1 02-05-2001 #2 < 1.0 ... NaN NaN 1005.0
2 2001-05-02 00:00:00 2 < 1.0 ... NaN NaN 1008.0
3 2001-05-03 00:00:00 2 < 1.0 ... NaN NaN 1011.0
4 03-05-2001 2 1.0 ... NaN NaN 1014.0
[5 rows x 21 columns]
Finally, the resulting DataFrame can be saved to an Excel file using the
to_excel
function.
In [25]: df.to_excel('tidied_data.xlsx')