wadi.harmonizer module

class wadi.harmonizer.Harmonizer

Bases: WadiBaseClass

WaDI class for transforming the data to a harmonized format.

__init__()

Class initialization method.

__call__(convert_units=False, target_units='mg/l', override_units=None, drop_columns=None, merge_columns=None, detection_limit_symbols='<>', decimal_str=',')

This method provides an interface for the user to set the attributes that determine the Harmonizer object behavior.

Parameters:
  • convert_units (bool, optional) – When True, the units will be converted to the specified target units. Default: False

  • target_units (str, optional) – The desired target units. Must be a string in a format that Pint can parse. Default: mg/l

  • override_units (dict, optional) – A dictionary which specifies the column names for which ‘target_units’ must be ignored and replaced with the units specified as the values of the dictionary.

  • drop_columns (list, optional) – A list of columns that should not appear in the harmonized DataFrame.

  • merge_columns (list, optional) – A nested list of columns names that will be merged into a single column. Each list element can contain any number of column names. The first element in each of the list elements is the column to keep. Any NaN values that appear in this column will be replaced by non-NaN values from the subsequent columns in the list element.

  • detection_limit_symbols (str, optional) – A string that contains the symbols that are used for measurement values beyond (below or above) the detection limit. Default: ‘<>’ (both values below the detection limit, e.g. < 0.1, and above the detection limit, e.g. > 0.1 will be identified this way).

  • decimal_str (str, optional) – The character used as the decimal separator when a number is represented as a string in the original data, typically when it is below the detection limit, for example ‘< 0,5’. Default: ‘,’.

Returns:

result – A DataFrame with the transformed data.

Return type:

DataFrame

_convert_values(v, conversion_factor)

Convert cell values using the unit conversion factor. This function is used in _execute with the Pandas ‘apply’ function and is needed because (i) values with a detection limit symbol cannot be converted simply by multiplying and (ii) some cell values are imported as strings even though they are numbers.

Parameters:
  • v (float or str) – The value to be converted.

  • conversion_factor (float) – The unit conversion factor.

Returns:

result – The converted value.

Return type:

float

_execute(infotable)

This function performs the harmonize operations, that is it converts the units to the target units, it deletes any undesired columns and merges columns.

Parameters:

infotable (InfoTable) – The DataObject’s InfoTable.

Returns:

df – DataFrame with the transformed data.

Return type:

DataFrame