Aggregator¶
-
class
bonobo_trans.aggregator.
Aggregator
(*args, **kwargs)¶ The Aggregator transformation provides aggregates functions on row data.
Important
All input MUST BE SORTED prior to sending it to this transformation!
Option descriptions:
- name
- Name of the transformation, for identification and logging purposes.
- null_is_zero
- Set to true to treat NULL as zero.
- return_all_rows
- Set to True to return all incoming rows. If False (default) the transformation will only return the key group on which was aggregated.
- return_all_cols
This setting will be ignored when
return_all_rows
is True.When False the transformation will return the key columns plus the requested aggregations. When set to True, all columns will be returned, the values for these columns will be of the last row of the group.
- group
- A list of the columns to aggregate on. The incoming rows must have been sorted on these keys.
- aggregations
A list of the aggregations (dict). An aggregation is a dictionary object of which the key is the output key to be appended to the outgoing row and the value is the aggregation, and must be one of the following:
- AGG_MAX, AGG_MIN
- AGG_FIRST, AGG_LAST
- AGG_MEAN, AGG_MEAN_HARMONIC
- AGG_MEDIAN, AGG_MEDIAN_HIGH, AGG_MEDIAN_LOW
- AGG_PERCENTILE
- AGG_SUM
- AGG_STDDEV_S, AGG_STDDEV_P
- AGG_VARIANCE_S, AGG_VARIANCE_P
- AGG_COUNT
MAX, MIN
MAX returns the highest number, newest date or alphabetically last string. MIN does the reverse.
Example:
{ 'high_key': { AGG_MAX: 'col1' } }
FIRST, LAST
FIRST returns the first row of the group. LAST does the reverse.
Example:
{ 'last_key': { AGG_LAST: 'col1' } }
MEAN, HARMONIC MEAN
MEAN returns the average of all numeric values in specified column in the group. MEAN_HARMONIC returns the harmonic (subcontrary) mean. Values less than zero are not allowed for the harmonic mean.
Example:
{ 'sales_avg': { AGG_MEAN: 'sales_usd' } }
MEDIAN (HIGH/LOW)
MEDIAN returns the median of all numeric values in specified column in the group. If there is an odd number of values, the median is the middle number. If there is an even number of values, the median is the average of the middle two values when all values are placed ordinally on a number line. Use MEDIAN_HIGH or MEDIAN_LOW to not return the average middle (in case of an even number of values, but instead return the highest or lowest of the two middle values.
Example:
{ 'sales_med': { AGG_MED: 'sales_usd' } }
MODE
MODE returns the most common value, if any. If there is no exact most common value, None is returned. Values can be non-numeric.
PERCENTILE (not implemented yet)
Calculates the value that falls at a given percentile in specified column in the group. Column must be numeric.
Example:
{ 'percentile': { AGG_PERCENTILE: 'transaction_id', 'percentile': 25 } }
SUM
Returns the total of all numeric values in specified column in the group.
Example:
{ 'sales_total': { AGG_SUM: 'sales_usd' } }
STDDEV (SAMPLE/POPULATION)
Returns the standard deviation of the numeric values of the specifed column in the group. STDDEV is used to analyze statistical data. This aggregation will return None if less than two values provided. Use STDDEV_S or STDDEV_P to return either a sample or population standard deviation.
Example:
{ 'score_stdev': { AGG_STDDEV_S: 'score' } }
VARIANCE (SAMPLE/POPULATION)
Returns the variance of the numeric values of the specified column in the group. VARIANCE is used to analyze statistical data. This aggregation will return None if less than two values provided. Use VARIANCE_S or VARIANCE_P to return either a sample or population standard deviation.
Example:
{ 'score_var': { AGG_VARIANCE_S: 'score' } }
COUNT
Returns the number of records in the group. The specification is slightly different, instead of a dictionary, you only specify the aggregation, as shown in the exmple:
Example:
{ 'nr_of_transactions': AGG_COUNT }
- Todo:
- Filter conditions. E.g. SUM where summed > 100.
- ABS(), ROUND() options
- Do we need a null_is_last option for the first and last functions?
- Percentile
- Args:
- d_row_in (dict)
d_row_in is a dictonary containing sorted row data.
- Returns:
- d_row_out (dict)
d_row_out contains all the keys specified in the
group
andaggregations
options. Ifreturn_all_cols
is set to True it will also include the non-key columns of incoming dictionary.Attention
Any keys with the same name as the specified aggregation keys will be overwritten.
If
return_all_rows
is specified, all rows will be returned, else only one row per group.
Parameters: - group (list) –
- aggregations (dict) –
- name (str) –
- null_is_zero (bool) –
- return_all_rows (bool) –
- return_all_cols (bool) –