Aggregator

class bonobo_trans.aggregator.Aggregator(*args, **kwargs)

The Aggregator transformation provides aggregates functions on row data.

Important

All input MUST BE SORTED prior to sending it to this transformation!

Configuration options

Required:

  • group (list of str)
  • aggregations (list of dict)

Optional:

  • name (str)
  • null_is_zero (bool) Default: False
  • return_all_rows (bool) Default: False

Option descriptions:

name
Name of the transformation, for identification and logging purposes.
null_is_zero
Set to true to treat NULL as zero.
return_all_rows
Set to True to return all incoming rows. If False (default) the transformation will only return the key group on which was aggregated.
return_all_cols

This setting will be ignored when return_all_rows is True.

When False the transformation will return the key columns plus the requested aggregations. When set to True, all columns will be returned, the values for these columns will be of the last row of the group.

group
A list of the columns to aggregate on. The incoming rows must have been sorted on these keys.
aggregations

A list of the aggregations (dict). An aggregation is a dictionary object of which the key is the output key to be appended to the outgoing row and the value is the aggregation, and must be one of the following:

  • AGG_MAX, AGG_MIN
  • AGG_FIRST, AGG_LAST
  • AGG_MEAN, AGG_MEAN_HARMONIC
  • AGG_MEDIAN, AGG_MEDIAN_HIGH, AGG_MEDIAN_LOW
  • AGG_PERCENTILE
  • AGG_SUM
  • AGG_STDDEV_S, AGG_STDDEV_P
  • AGG_VARIANCE_S, AGG_VARIANCE_P
  • AGG_COUNT

MAX, MIN

MAX returns the highest number, newest date or alphabetically last string. MIN does the reverse.

Example:

{ 'high_key': { AGG_MAX: 'col1' } }

FIRST, LAST

FIRST returns the first row of the group. LAST does the reverse.

Example:

{ 'last_key': { AGG_LAST: 'col1' } }

MEAN, HARMONIC MEAN

MEAN returns the average of all numeric values in specified column in the group. MEAN_HARMONIC returns the harmonic (subcontrary) mean. Values less than zero are not allowed for the harmonic mean.

Example:

{ 'sales_avg': { AGG_MEAN: 'sales_usd' } }

MEDIAN (HIGH/LOW)

MEDIAN returns the median of all numeric values in specified column in the group. If there is an odd number of values, the median is the middle number. If there is an even number of values, the median is the average of the middle two values when all values are placed ordinally on a number line. Use MEDIAN_HIGH or MEDIAN_LOW to not return the average middle (in case of an even number of values, but instead return the highest or lowest of the two middle values.

Example:

{ 'sales_med': { AGG_MED: 'sales_usd' } }

MODE

MODE returns the most common value, if any. If there is no exact most common value, None is returned. Values can be non-numeric.

PERCENTILE (not implemented yet)

Calculates the value that falls at a given percentile in specified column in the group. Column must be numeric.

Example:

{ 'percentile': { AGG_PERCENTILE: 'transaction_id', 'percentile': 25 } }

SUM

Returns the total of all numeric values in specified column in the group.

Example:

{ 'sales_total': { AGG_SUM: 'sales_usd' } }

STDDEV (SAMPLE/POPULATION)

Returns the standard deviation of the numeric values of the specifed column in the group. STDDEV is used to analyze statistical data. This aggregation will return None if less than two values provided. Use STDDEV_S or STDDEV_P to return either a sample or population standard deviation.

Example:

{ 'score_stdev': { AGG_STDDEV_S: 'score' } }

VARIANCE (SAMPLE/POPULATION)

Returns the variance of the numeric values of the specified column in the group. VARIANCE is used to analyze statistical data. This aggregation will return None if less than two values provided. Use VARIANCE_S or VARIANCE_P to return either a sample or population standard deviation.

Example:

{ 'score_var': { AGG_VARIANCE_S: 'score' } }

COUNT

Returns the number of records in the group. The specification is slightly different, instead of a dictionary, you only specify the aggregation, as shown in the exmple:

Example:

{ 'nr_of_transactions': AGG_COUNT }
Todo:
  • Filter conditions. E.g. SUM where summed > 100.
  • ABS(), ROUND() options
  • Do we need a null_is_last option for the first and last functions?
  • Percentile
Args:
  • d_row_in (dict)

d_row_in is a dictonary containing sorted row data.

Returns:
  • d_row_out (dict)

d_row_out contains all the keys specified in the group and aggregations options. If return_all_cols is set to True it will also include the non-key columns of incoming dictionary.

Attention

Any keys with the same name as the specified aggregation keys will be overwritten.

If return_all_rows is specified, all rows will be returned, else only one row per group.

Parameters:
  • group (list) –
  • aggregations (dict) –
  • name (str) –
  • null_is_zero (bool) –
  • return_all_rows (bool) –
  • return_all_cols (bool) –