Lookup

TODO list:
  • Implement EXP and REX
  • Implement case (in)sensitivity, both in comparison and in dict keys (PEP455)
  • Issue a WARNING if a key already exists and will be overwritten.
  • Prevent “SAWarning: Textual column expression”
  • Use bind parameters in SQL Order By
  • Stop on warnings?
  • Decide if we want to support an SQL-override which is an SQLAlchemy Select
class bonobo_trans.lookup.DbLookup(*args, **kwargs)

The DbLookup transformation looks up data in a SQL database.

Note

Database connectivity is provided by SQLAlchemy. The engine must be provided via a Bonobo Service, containing an ‘sqlalchemy.engine’.

Configuration options

Required:

  • comparison (list)

Optional:

  • table_name (str)
  • sql_override (str or sqlalchemy.sql.Select)
  • sql_override_cols (list of sqlalchemy.sql.schema.Column)
  • order_by (str, int, dict, list of str/int/dict)
  • ret_fields (str, dict, list of str)
  • ret_prefix (str)
  • name (str)
  • verbose_init (bool)
  • verbose_sql (bool)
  • verbose_data (bool)
  • multiple_match (int)
  • caching (int)

Description of the options:

Note

Not all options can be used together, generally, the transformation will issue a warning or error, depending on the severity, when invalid combinations are issued.

Note

The lookup transformation starts by loading the lookup data (“source-data”) in memory, unless the caching setting is set to LKP_CACHING_DISABLED. If disabled, the ‘sql_*’-options will be ignored.

table_name, sql_override
If the sql_override option is used, the table_name option will be ignored. The sql_override can either be a SQL statement (str) or a SQLAlchemy sql object. The SQL-override provides for some flexibility, however it also has downsides. One of them is that the SQLAlchemy database reflection is not longer limited to one table, causing an error if there is an, otherwise unrelated, invalid view in the database.
sql_override_cols

If an sql_override is provided as an SQL string (opposed to an SQLAlchemy Select object) we will not have a column list to create an SQLite memory table with. To enable caching using a SQLite memory table the sql_override_cols-option can be used to provide this list of columns.

This option must be a list of SQLAlchemy columns including datatype.

order_by

An ORDER BY is useful in a scenario where more than one row is expected and in conjunction with LKP_MM_FST or LKP_MM_LST. It may also be useful in conjunction with LKP_MM_LOV and LKP_MM_ALL to sort the output of multiple rows. The ‘order_by’ option can be specified as follows:

  • column position (int)

  • column name (str)

  • column name/position + direction (dict)

  • a list of any of the above, for example:

    [ 1,'user_id',{'user_type','DESC'},{6,'ASC'} ]
    

Caution

Don’t specify more than one column in a single dictionary, as Python does not guarantee the order of dict keys.

comparison

The search comparison lies at the core of the lookup. It consists of a list of lists of dictionaries. Example:

[ol1 [il1 {d1},{d2} ], [il2 {d3} ] ]

The two ‘inner lists’ (il1 and il2), containing the dicts, are AND- grouped together. The conditions within the inner lists are OR-grouped together. The example lookup would be created as follows:

WHERE (d1=X OR d2=Y) AND d3=Z

Each condition is a dictionary containing a key-value pair where the key represents a column in the lookup table and the value the value to compare with. There are also “special keys”, as described below.

  • Special key: _compare_with_

    The condition value can be of three different types:

    value Description
    LKP_VAL_COL The ‘value’ is actually a column name available in the d_row (default)
    LKP_VAL_VAL The ‘value’ is hardcoded (default fallback, if ‘value’ is not a valid column)
    LKP_VAL_EXP The ‘value’ is an expression [not implemented]
    LKP_VAL_REX The ‘value’ is a regular expression [not implemented]
  • Special key: _operator_

    By default the comparison operator is an ‘equal to’, but a different operator can be used by adding an _operator_ key, containing one of the following (string) values:

    == or !=                                Equality
    <, >, <=, >= or <>              Larger or smaller
    in or not in                    Sets
    
  • Special key: _case_sensitive_

    (not implemented)

More examples:

Simple OR-comparison (note one group!):

[{'target_field_name1':'matching_value', 'target_field_name2':'matching_value','_operator_':'<>'}]

Doing a between:

[{'target_field_name1':'matching_value', '_operator_':'>'},{'target_field_name1':'matching_value','operator':'<'}]

This will not work, because groups are AND-ed together, use IN instead:

[{'target_field_name1':'matching_value'},{'target_field_name1':'matching_value'}]

Do this instead:

[{'target_field_name1':'(matching_value1, matching_value2)','operator':'IN'}}]

AND-comparison (note two groups!):

[{'target_field_name1':'matching_value'}
,{'target_field_name2':'matching_value'}]

Composite AND+OR-comparison:

[ {'target_field_name1':'matching_value'
  ,'target_field_name2':'matching_value'}
, {'target_field_name3':'matching_value'}
, {'target_field_name4':'matching_value'}]

Multiple blocks result = ({1} OR {2}) AND {3} AND {4}

ret_fields

List of column(s) to return from lookup table. If not specified, all columns will be returned. Use a dictionary to output an alias, for example: {‘lkp_user_id’,’user_id’} will return a dict with the key ‘lkp_user_id’ (instead of ‘user_id’) and value of ‘user_id’.

It is allowed to return the same value multiple times, for example:

{'user_id_1':'user_id', 'user_id_2':'user_id'}

The ‘ret_fields’ option can be specified as follows:

  • None (return all fields)

  • column name (str)

  • alias(es) + column name(s) (dict, key=alias, val=source column)

  • a list of any of the above, for example:

    [ 'user_id', {'user_type_A','user_type'}, {'user_type_B':'user_type', 'user_description','user_desc'} ]
    
column_prefix
Prefix to apply to output column keys. For example: ‘lkp_users’ would output column ‘user_id’ as ‘lkp_users.user_id’. Note the separation dot.
name

If this option is specified, an extra special key will be added to the output, called __LKP_<name>__. This key is a dict containing misc. lookup details.

CACHING Caching strategy
MULTIPLE_MATCH Multiple match setting
LKP_ROWS_SOURCE Number of rows in the source-data
LKP_ROWS_MATCHED Number of rows matched (before selecting first or last)
LKP_ROW_NR Row index, in case of LKP_MM_ALL
verbose_init
If True, all interesting details are printed to the output.
verbose_sql
If True, the generated SQL statement for the source data and lookups will be printed.
verbose_data
If True, every outgoing row is printed.

Note

Notes on multiple match

The DbLookup can either return a single or multiple rows. However, often only one row is required or expected. This setting decides what happens when more than one row is received from the database.

Setting Result
LKP_MM_ALL Return all rows, this will generate additional rows in the Bonobo chain!
LKP_MM_LOV Return all rows as one list (List of Values)
LKP_MM_ANY Return first received row
LKP_MM_FST Return first row (more efficient than return last)
LKP_MM_LST Return last row (use first instead, if possible)
LKP_MM_ERR Raise error

Note

Notes on caching

The DbLookup first pulls in the table on which to perform the lookup. It stores this in one of the data structures below.

  • None / Disabled Don’t cache data. Do a query on the lookup table for every row passing through the lookup. Useful when the lookup table data changes during processing. You cannot specify any “source-data” ‘sql_*’-options.

  • SQLite memory table Requires datatypes, thus cannot be used in combination with a ‘raw’ SQL statement.

  • Pandas DataFrame

    SQLite vs Pandas: https://github.com/thedataincubator/data-science-blogs/blob/master/sqlite-vs-pandas.md

    Caching can be enabled via the ‘caching’-option. It accepts the following values:

    caching Description
    LKP_CACHING_DISABLED Don’t use caching
    LKP_CACHING_ENABLED Will select the default
    LKP_CACHING_SQLITE_MEM Force SQLite memory table
    LKP_CACHING_PANDAS Force Pandas DataFrame

    Tip

    To improve performance and reduce memory usage, always limit the source selection as much as possible using the ret_fields (limit columns) option and a where clauses (limit rows) whenever possible.

Todo:

Issues, todo’s and outstanding questions

    1. How much data can we handle before starting to run into problems?
Consider implementing Dask.
    1. What problems can we expect when the source data is big?
  • TODO) Cleaner import statements (what’s the best practice?)
  • TODO) Implement LKP_VAL_EXP and LKP_VAL_REX
Args:
  • d_row_in (dict)

d_row_in is a dictonary containing row data. It must contain all columns specified in the ‘comparison’.

Returns:
  • d_row_out (dict)

d_row_out contains all the keys of the incoming dictionary plus keys (columns) specified in ‘ret_fields’ (or all columns if not specified). Any keys in the incoming record that are identical to return columns are overwritten.

If ‘multiple_match’ = LKP_MM_ALL, the transformation may yield more than one row.

Parameters:
  • engine (str) –
  • comparison (list) –
  • table_name (str) –
  • ret_fields
  • ret_prefix (str) –
  • sql_override (str) –
  • sql_override_cols (list) –
  • order_by
  • name (str) –
  • multiple_match (int) –
  • caching (int) –
  • verbose_init (bool) –
  • verbose_sql (bool) –
  • verbose_data (bool) –