Creating HGC compatible DataFrames

WaDI is intended to be a generic processor of hydrochemical data but also offers special support for HGC, which is a package for correction, validation and analysis of groundwater quality samples. An example workflow for reading data from a spreadsheet file in wide format is provided below. It demonstrates two key features

  • how to use WaDI’s default dictionary to map the feature names to HGC feature names

  • automatic unit conversion to HGC compatible units

As always, the first step is to define the WaDI DataObject that will perform the required steps. As can be seen from the function arguments, the name of the log file will be ‘hgc_compatible.log’ and no messages will be printed to the screen because the silent keyword argument is set to True.

In [1]: import wadi as wd

In [2]: wdo = wd.DataObject(log_fname='hgc_compatible.log',
   ...:     silent=True,
   ...: )
   ...: 

Second, the file_reader of the DataObject must understand the structure of the spreadsheet file. As explained in example 2, this is done by creating dictionaries with keyword arguments which enable importing multiple different ‘blocks’ of data from the file. Here, the general sample data (sample code, location, sampling date) are found in columns A through E and the feature data in columns F through N. A separate dictionary needs to be created for both.

In [3]: df0_kwargs = {
   ...:     "skiprows": [1],
   ...:     "usecols": "A:E",
   ...:     "units_row": 1,
   ...:     "datatype": "sampleinfo",
   ...: }
   ...: 

In [4]: df1_kwargs = {
   ...:     "skiprows": [1],
   ...:     "usecols": "F:N",
   ...:     "units_row": 1,
   ...:     "datatype": "feature",
   ...: }
   ...: 

As can be seen from the items in both dictionaries defined above, "skiprows": [1] is provided so that the second row of the spreadsheet is not read when reading the concentration (or sample info) data. This is because the second row contains the units. This row must be read separately, which is accomplished by also passing "units_row": 1.

With these dictionaries defined, the file_reader can be initialized and the result of the import operation can be visualized by printing the contents of the imported DataFrame

For the purpose of the demonstration in this documentation, find the folder with the data files

In [5]: from wadi.documentation_helpers import get_data_dir

In [6]: DATA_DIRECTORY = get_data_dir()
In [7]: wdo.file_reader(
   ...:     file_path=DATA_DIRECTORY / "hgc_example.xlsx",
   ...:     format="wide",
   ...:     blocks=[df0_kwargs, df1_kwargs],
   ...: )
   ...: 

In [8]: print(wdo.get_imported_dataframe().head())
            id       x       y  ...    Nikkel zuurgraad lab  zuurgraad veld
0   Prec. 1987  105440  522880  ...  0.000001             5             5.3
1  14C.041+6.5  105330  525660  ...  0.000001             7             7.6
2  14C.041+5.5  105330  525660  ...  0.000001             8             8.2

[3 rows x 14 columns]

The next step is to define a mapping dictionary. WaDI’s default dictionary already contains the HGC names for features. In this case, the feature names provided in the spreadsheet are according to the Dutch SIKB standard, which are also known to WaDI’s default database (see example 2).

The mapping dictionary is created by calling the default_dict function of the MapperDict object, with the arguments ‘SIKB’ and ‘HGC’. The resulting dictionary is stored under the variable name names_dict, which is then used to initialize the name_map of the WaDI DataObject. The selected match methods are ‘exact’ and ‘fuzzy’, with the latter being included because it is case insensitive.

In [9]: names_dict = wd.mapper.MapperDict.default_dict('SIKB', 'HGC')

In [10]: wdo.name_map(
   ....:     m_dict=names_dict,
   ....:     match_method=["exact", "fuzzy"],
   ....: )
   ....: 

The final step before the data can be converted is to define the harmonizer. The convert_units argument needs to be set to True and instead of specifying chemical concentration units, the target_units are set to ‘hgc’. WaDI then understands that it must convert the feature units to values that are prescribed in HGC, which are different for different species (for example, mg/l for chloride, but ug/l for bromide).

In [11]: df = wdo.harmonizer(
   ....:     convert_units=True,
   ....:     target_units="hgc",
   ....: )
   ....: 

The data can now be converted and displayed on the screen.

In [12]: df = wdo.get_converted_dataframe()

In [13]: print(df.head())
            id       x       y     z  ...        Br     Ni   ph zuurgraad veld
             -       m       m m+ref  ...      µg/l   µg/l    -              -
0   Prec. 1987  105440  522880   8.0  ...   50000.0  0.001  5.0            5.3
1  14C.041+6.5  105330  525660   6.5  ...  224000.0  0.001  7.0            7.6
2  14C.041+5.5  105330  525660   5.5  ...  340000.0  0.001  8.0            8.2

[3 rows x 14 columns]

The user should always check the contents of the DataFrame created by WaDI to ensure that the mapping and harmonzing operations yielded the desired results. This is why it is critically important to inspect the conversion results, especially the column names, the units and the concentrations before proceeding with doing any calculations in HGC!

HGC requires a DataFrame without the units. This can be created by setting the include_units argument of the get_converted_dataframe function to False.

In [14]: df_hgc = wdo.get_converted_dataframe(include_units=False)

In [15]: print(df_hgc.head())
            id       x       y    z  ...        Br     Ni   ph  zuurgraad veld
0   Prec. 1987  105440  522880  8.0  ...   50000.0  0.001  5.0             5.3
1  14C.041+6.5  105330  525660  6.5  ...  224000.0  0.001  7.0             7.6
2  14C.041+5.5  105330  525660  5.5  ...  340000.0  0.001  8.0             8.2

[3 rows x 14 columns]