Sorter

class bonobo_trans.sorter.Sorter(*args, **kwargs)

The Sorter transformation sorts rows and can de-duplicate data.

Configuration options

Required:

  • keys_sort (dict) {key:direction}

Optional:

  • name (str)
  • distinct (int) Default: SRT_DUP_KEEP
  • keys_dedup (list of str)
  • case_sensitive (bool) Default: False
  • null_is_last (bool) Default: True

Description of the options:

keys_sort

The sort_keys option is a dictionary where the keys refer to the keys in the incoming row. The direction indicates an ascending or descending sort.

Direction can be one of the following:

  • ‘ASC’, ‘ASCENDING’, True, 1
  • ‘DESC’, ‘DESCENDING’, False, any number except 1

Example:

{'year':'ASC', 'month':'DESC', 'day':'ASC'}
name
Name of the transformation. Mainly used for identification in logging.
distinct, keys_dedup

The sorter transformation allows for removal of duplicate rows. There are different strategies to choose from:

distinct Description
SRT_DUP_KEEP Don’t remove duplicates
SRT_DUP_DISTINCT_ROW Remove identical rows
SRT_DUP_KEY_FIRST Remove duplicate key, keep first
SRT_DUP_KEY_LAST Remove duplicate key, keep last

By default duplicates are not removed (SRT_DUP_KEEP).

SRT_DUP_DISTINCT_ROW

Remove identical rows. This is similar to the SQL “DISTINCT” keyword. This setting will remove rows in which all rows are similar.

SRT_DUP_KEY_FIRST, SRT_DUP_KEY_LAST

Remove rows that have duplicate keys. This behaviour is more akin to an aggregator’s FIRST and LAST-functions. It will remove rows with an identical key. You can specify to keep the first or last row.

You can specify the de-duplication key as a subset of the sort key using then keys_dedup-option. It accepts a list of keys (str). If you don’t specify a ‘keys_dedup’ the first row will be kept, but this will give you less control and security as it will depend on how the rows enter this transformation.

Example:

'distinct'   = SRT_DUP_KEY_FIRST
'keys_sort'  = {'year':'ASC', 'month':'ASC', 'day':'ASC'}
'keys_dedup' = ['year', 'month']

Input rows:
        2019,02,15,'Friday'
        2019,02,16,'Saturday'
        2019,02,17,'Sunday'

Output rows:
        2019,02,15,'Friday'
case_sensitive
TODO!
null_is_last
This option will determine if the None/Null will be on top or on bottom of the sorted output. By default it’s True and the None value will be on the bottom.
ToDo:
  • [Q] (How) could we create a Deduplicator Class transformation as subclass of the sorter? Would that be nice?
Args:
  • d_row_in (dict)

d_row_in is a dictonary containing row data.

Returns:
  • d_row_out (dict)

d_row_out contains all the keys of the incoming dictionary without any changes or additions.

Only the order of the rows will change.

Parameters:
  • keys_sort (dict) –
  • name (str) –
  • distinct (int) –
  • keys_dedup (list) –
  • case_sensitive (bool) –
  • null_is_last (bool) –