Lookup¶
- TODO list:
- Implement EXP and REX
- Implement case (in)sensitivity, both in comparison and in dict keys (PEP455)
- Issue a WARNING if a key already exists and will be overwritten.
- Prevent “SAWarning: Textual column expression”
- Use bind parameters in SQL Order By
- Stop on warnings?
- Decide if we want to support an SQL-override which is an SQLAlchemy Select
-
class
bonobo_trans.lookup.
DbLookup
(*args, **kwargs)¶ The DbLookup transformation looks up data in a SQL database.
Note
Database connectivity is provided by SQLAlchemy. The engine must be provided via a Bonobo Service, containing an ‘sqlalchemy.engine’.
Description of the options:
Note
Not all options can be used together, generally, the transformation will issue a warning or error, depending on the severity, when invalid combinations are issued.
Note
The lookup transformation starts by loading the lookup data (“source-data”) in memory, unless the
caching
setting is set to LKP_CACHING_DISABLED. If disabled, the ‘sql_*’-options will be ignored.- table_name, sql_override
- If the
sql_override
option is used, thetable_name
option will be ignored. Thesql_override
can either be a SQL statement (str) or a SQLAlchemy sql object. The SQL-override provides for some flexibility, however it also has downsides. One of them is that the SQLAlchemy database reflection is not longer limited to one table, causing an error if there is an, otherwise unrelated, invalid view in the database. - sql_override_cols
If an sql_override is provided as an SQL string (opposed to an SQLAlchemy Select object) we will not have a column list to create an SQLite memory table with. To enable caching using a SQLite memory table the
sql_override_cols
-option can be used to provide this list of columns.This option must be a list of SQLAlchemy columns including datatype.
- order_by
An ORDER BY is useful in a scenario where more than one row is expected and in conjunction with LKP_MM_FST or LKP_MM_LST. It may also be useful in conjunction with LKP_MM_LOV and LKP_MM_ALL to sort the output of multiple rows. The ‘order_by’ option can be specified as follows:
column position (int)
column name (str)
column name/position + direction (dict)
a list of any of the above, for example:
[ 1,'user_id',{'user_type','DESC'},{6,'ASC'} ]
Caution
Don’t specify more than one column in a single dictionary, as Python does not guarantee the order of dict keys.
- comparison
The search comparison lies at the core of the lookup. It consists of a list of lists of dictionaries. Example:
[ol1 [il1 {d1},{d2} ], [il2 {d3} ] ]
The two ‘inner lists’ (il1 and il2), containing the dicts, are AND- grouped together. The conditions within the inner lists are OR-grouped together. The example lookup would be created as follows:
WHERE (d1=X OR d2=Y) AND d3=Z
Each condition is a dictionary containing a key-value pair where the key represents a column in the lookup table and the value the value to compare with. There are also “special keys”, as described below.
Special key: _compare_with_
The condition value can be of three different types:
value Description LKP_VAL_COL The ‘value’ is actually a column name available in the d_row (default) LKP_VAL_VAL The ‘value’ is hardcoded (default fallback, if ‘value’ is not a valid column) LKP_VAL_EXP The ‘value’ is an expression [not implemented] LKP_VAL_REX The ‘value’ is a regular expression [not implemented] Special key: _operator_
By default the comparison operator is an ‘equal to’, but a different operator can be used by adding an _operator_ key, containing one of the following (string) values:
== or != Equality <, >, <=, >= or <> Larger or smaller in or not in Sets
Special key: _case_sensitive_
(not implemented)
More examples:
Simple OR-comparison (note one group!):
[{'target_field_name1':'matching_value', 'target_field_name2':'matching_value','_operator_':'<>'}]
Doing a between:
[{'target_field_name1':'matching_value', '_operator_':'>'},{'target_field_name1':'matching_value','operator':'<'}]
This will not work, because groups are AND-ed together, use IN instead:
[{'target_field_name1':'matching_value'},{'target_field_name1':'matching_value'}]
Do this instead:
[{'target_field_name1':'(matching_value1, matching_value2)','operator':'IN'}}]
AND-comparison (note two groups!):
[{'target_field_name1':'matching_value'} ,{'target_field_name2':'matching_value'}]
Composite AND+OR-comparison:
[ {'target_field_name1':'matching_value' ,'target_field_name2':'matching_value'} , {'target_field_name3':'matching_value'} , {'target_field_name4':'matching_value'}]
Multiple blocks result = ({1} OR {2}) AND {3} AND {4}
- ret_fields
List of column(s) to return from lookup table. If not specified, all columns will be returned. Use a dictionary to output an alias, for example: {‘lkp_user_id’,’user_id’} will return a dict with the key ‘lkp_user_id’ (instead of ‘user_id’) and value of ‘user_id’.
It is allowed to return the same value multiple times, for example:
{'user_id_1':'user_id', 'user_id_2':'user_id'}
The ‘ret_fields’ option can be specified as follows:
None (return all fields)
column name (str)
alias(es) + column name(s) (dict, key=alias, val=source column)
a list of any of the above, for example:
[ 'user_id', {'user_type_A','user_type'}, {'user_type_B':'user_type', 'user_description','user_desc'} ]
- column_prefix
- Prefix to apply to output column keys. For example: ‘lkp_users’ would output column ‘user_id’ as ‘lkp_users.user_id’. Note the separation dot.
- name
If this option is specified, an extra special key will be added to the output, called __LKP_<name>__. This key is a dict containing misc. lookup details.
CACHING Caching strategy MULTIPLE_MATCH Multiple match setting LKP_ROWS_SOURCE Number of rows in the source-data LKP_ROWS_MATCHED Number of rows matched (before selecting first or last) LKP_ROW_NR Row index, in case of LKP_MM_ALL - verbose_init
- If True, all interesting details are printed to the output.
- verbose_sql
- If True, the generated SQL statement for the source data and lookups will be printed.
- verbose_data
- If True, every outgoing row is printed.
Note
Notes on multiple match
The DbLookup can either return a single or multiple rows. However, often only one row is required or expected. This setting decides what happens when more than one row is received from the database.
Setting Result LKP_MM_ALL Return all rows, this will generate additional rows in the Bonobo chain! LKP_MM_LOV Return all rows as one list (List of Values) LKP_MM_ANY Return first received row LKP_MM_FST Return first row (more efficient than return last) LKP_MM_LST Return last row (use first instead, if possible) LKP_MM_ERR Raise error Note
Notes on caching
The DbLookup first pulls in the table on which to perform the lookup. It stores this in one of the data structures below.
None / Disabled Don’t cache data. Do a query on the lookup table for every row passing through the lookup. Useful when the lookup table data changes during processing. You cannot specify any “source-data” ‘sql_*’-options.
SQLite memory table Requires datatypes, thus cannot be used in combination with a ‘raw’ SQL statement.
Pandas DataFrame
SQLite vs Pandas: https://github.com/thedataincubator/data-science-blogs/blob/master/sqlite-vs-pandas.md
Caching can be enabled via the ‘caching’-option. It accepts the following values:
caching
Description LKP_CACHING_DISABLED Don’t use caching LKP_CACHING_ENABLED Will select the default LKP_CACHING_SQLITE_MEM Force SQLite memory table LKP_CACHING_PANDAS Force Pandas DataFrame Tip
To improve performance and reduce memory usage, always limit the source selection as much as possible using the
ret_fields
(limit columns) option and a where clauses (limit rows) whenever possible.
- Todo:
Issues, todo’s and outstanding questions
- How much data can we handle before starting to run into problems?
Consider implementing Dask.- What problems can we expect when the source data is big?
- TODO) Cleaner import statements (what’s the best practice?)
- TODO) Implement LKP_VAL_EXP and LKP_VAL_REX
- Args:
- d_row_in (dict)
d_row_in is a dictonary containing row data. It must contain all columns specified in the ‘comparison’.
- Returns:
- d_row_out (dict)
d_row_out contains all the keys of the incoming dictionary plus keys (columns) specified in ‘ret_fields’ (or all columns if not specified). Any keys in the incoming record that are identical to return columns are overwritten.
If ‘multiple_match’ = LKP_MM_ALL, the transformation may yield more than one row.
Parameters: - engine (str) –
- comparison (list) –
- table_name (str) –
- ret_fields –
- ret_prefix (str) –
- sql_override (str) –
- sql_override_cols (list) –
- order_by –
- name (str) –
- multiple_match (int) –
- caching (int) –
- verbose_init (bool) –
- verbose_sql (bool) –
- verbose_data (bool) –