Sorter¶

class bonobo_trans.sorter.Sorter(*args, **kwargs)¶

The Sorter transformation sorts rows and can de-duplicate data.

Configuration options

Required:

keys_sort (dict) {key:direction}

Optional:

name (str)

distinct (int) Default: SRT_DUP_KEEP

keys_dedup (list of str)

case_sensitive (bool) Default: False

null_is_last (bool) Default: True

Description of the options:
keys_sort
The sort_keys option is a dictionary where the keys refer to the keys in the incoming row. The direction indicates an ascending or descending sort.

Direction can be one of the following:

‘ASC’, ‘ASCENDING’, True, 1

‘DESC’, ‘DESCENDING’, False, any number except 1

Example:
{'year':'ASC', 'month':'DESC', 'day':'ASC'}
name

Name of the transformation. Mainly used for identification in logging.

distinct, keys_dedup
The sorter transformation allows for removal of duplicate rows. There are different strategies to choose from:

distinct Description

SRT_DUP_KEEP Don’t remove duplicates

SRT_DUP_DISTINCT_ROW Remove identical rows

SRT_DUP_KEY_FIRST Remove duplicate key, keep first

SRT_DUP_KEY_LAST Remove duplicate key, keep last

By default duplicates are not removed (SRT_DUP_KEEP).

SRT_DUP_DISTINCT_ROW

Remove identical rows. This is similar to the SQL “DISTINCT” keyword. This setting will remove rows in which all rows are similar.

SRT_DUP_KEY_FIRST, SRT_DUP_KEY_LAST

Remove rows that have duplicate keys. This behaviour is more akin to an aggregator’s FIRST and LAST-functions. It will remove rows with an identical key. You can specify to keep the first or last row.

You can specify the de-duplication key as a subset of the sort key using then keys_dedup-option. It accepts a list of keys (str). If you don’t specify a ‘keys_dedup’ the first row will be kept, but this will give you less control and security as it will depend on how the rows enter this transformation.

Example:
'distinct'   = SRT_DUP_KEY_FIRST
'keys_sort'  = {'year':'ASC', 'month':'ASC', 'day':'ASC'}
'keys_dedup' = ['year', 'month']

Input rows:
        2019,02,15,'Friday'
        2019,02,16,'Saturday'
        2019,02,17,'Sunday'

Output rows:
        2019,02,15,'Friday'
case_sensitive

TODO!

null_is_last

This option will determine if the None/Null will be on top or on bottom of the sorted output. By default it’s True and the None value will be on the bottom.
ToDo:

[Q] (How) could we create a Deduplicator Class transformation as subclass of the sorter? Would that be nice?

Args:

d_row_in (dict)

d_row_in is a dictonary containing row data.

Returns:

d_row_out (dict)

d_row_out contains all the keys of the incoming dictionary without any changes or additions.

Only the order of the rows will change.

Parameters:	keys_sort (dict) – name (str) – distinct (int) – keys_dedup (list) – case_sensitive (bool) – null_is_last (bool) –