Intersect Function
Users can use intersection function to calculate the value of the intersection of two data sets, with some same dimensions and one varied dimension, to analyze the retention or conversion rates.
kylin supports the following intersection function,
INTERSECT_COUNT
Description
- Returns the distinct count of the intersection of multiple result sets in different conditions
Syntax
intersect_count(column_to_count, column_to_filter, filter_value_list)
Parameters
column_to_count
, the column to be calculated and applied on distinct count, required to be added as Precise count distinct measurecolumn_to_filter
, the varied dimensionfilter_value_list
, the value of the varied dimensions listed inarray[]
, Whencolumn_to_filter
is of type varchar, A single element in an array can map multiple values. By default, the '|' is split. You can setkylin.query.intersect.separator
inkylin.properties
to configure the separator, Can take value '|' or ',', default is '|'(Currently this parameter does not support the use of subquery results as parameters).
Note: When the data type of a varied dimension is not varchar or integer, the values in 'filter_value_list' need to be explicitly cast, for example:
select intersect_count(column_to_count, column_to_filter, array[cast(3.53 as double), cast(5.79 as double)]) from TEST_TABLE
ORselect intersect_count(column_to_count, column_to_filter, array[TIMESTAMP'2012-01-02 11:23:45', TIMESTAMP'2012-01-01 11:23:45']) from TEST_TABLE;
Query Example 1
Take the sample dataset provided by kylin as an example, table
SSB.P_LINEORDER
simulates the online transaction data, and the following query can return the percentile of sellers who are trading day by day during 1992.01.01 to 1992.01.03.select LO_SHIPMODE,
intersect_count(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01']) as first_day,
intersect_count(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-02']) as second_day,
intersect_count(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-03']) as third_day,
intersect_count(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01',date'1992-01-02']) as retention_oneday,
intersect_count(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01',date'1992-01-02',date'1992-01-03']) as retention_twoday
from SSB.P_LINEORDER
where LO_ORDERDATE in (date'1992-01-01',date'1992-01-02',date'1992-01-03')
group by LO_SHIPMODEResponse Example 1
The result shows that there is no seller keeps trading constantly during this period.
Query Example 2
select
intersect_count(LO_CUSTKEY, LO_SHIPMODE, array['RAIL|SHIP|TRUCK', 'TRUCK']) as test_column
from SSB.P_LINEORDERResponse Example 2
INTERSECT_VALUE
Description
- Returns the values of the intersection of multiple result sets in different conditions. If the returned result is large, it may cause the analysis page browser to crash.
Syntax
intersect_value(column_to_count, column_to_filter, filter_value_list)
Parameters
column_to_count
, the column to be calculated and applied on distinct value required to be added as Precise count distinct measure. And only columns of type tinyint, smallint, or integer are supported.column_to_filter
, the varied dimensionfilter_value_list
, the value of the varied dimensions listed inarray[]
, Whencolumn_to_filter
is of type varchar, A single element in an array can map multiple values. By default, the '|' is split. You can setkylin.query.intersect.separator
inkylin.properties
to configure the separator, Can take value '|' or ',', default is '|'(Currently this parameter does not support the use of subquery results as parameters).
Note: When the data type of a varied dimension is not varchar or integer, the values in 'filter_value_list' need to be explicitly cast, for example:
select intersect_value(column_to_count, column_to_filter, array[cast(3.53 as double), cast(5.79 as double)]) from TEST_TABLE
ORselect intersect_value(column_to_count, column_to_filter, array[TIMESTAMP'2012-01-02 11:23:45', TIMESTAMP'2012-01-01 11:23:45']) from TEST_TABLE;
Query Example 1
Fact table
SSB.P_LINEORDER
simulates the online transaction data. And data type ofLO_CUSTKEY
column isinteger
. The following query can return the ids of sellers who are trading day by day during 1992.01.01 to 1992.01.03.select LO_SHIPMODE,
intersect_value(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01']) as first_day,
intersect_value(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-02']) as second_day,
intersect_value(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-03']) as third_day,
intersect_value(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01',date'1992-01-02']) as retention_oneday,
intersect_value(LO_CUSTKEY, LO_ORDERDATE, array[date'1992-01-01',date'1992-01-02',date'1992-01-03']) as retention_twoday
from SSB.P_LINEORDER
where PART_DT in (date'1992-01-01',date'1992-01-02',date'1992-01-03')
group by LO_SHIPMODEResponse Example 1
The result shows that set of keeping trading constantly's sellerId during this period.
Query Example 2
select
intersect_count(LO_CUSTKEY, LO_SHIPMODE, array['RAIL|SHIP|TRUCK', 'TRUCK']) as test_column
from SSB.P_LINEORDERResponse Example 2
INTERSECT_COUNT_V2
- Description
- Returns the distinct count of the intersection of multiple result sets in different conditions. Support Regexp in condition.
- Syntax
intersect_count_v2(column_to_count, column_to_filter, filter_value_list, filter_type)
- Parameters
column_to_count
, the column to be calculated and applied on distinct count, required to be added as Precise count distinct measurecolumn_to_filter
, the varied dimensionfilter_value_list
, the value of the varied dimensions listed inarray[]
,filter_type
, the data type is String, which identifies the filter mode. There are currently two optional values "RAWSTRING" and "REGEXP". When the parameter value is "RAWSTRING", the filtering mode is precise filtering. Whencolumn_to_filter
is a Varchar type, A single element in the array can be mapped with multiple values. By default, it is separated by '|'. You can usekylin.query.intersect.separator
to configure the separator. And only support configuration in thekylin.properties
file. (currently this parameter does not support using the results of subqueries as parameters). When the parameter value is "REGEXP", the filtering mode is regular matching, and only the value of the regular expression in column_to_filter that can match the filter_value_list will be filtered.
Note: When the filter_type is "RAWSTRING" and the data type of a varied dimension is not varchar or integer, the values in 'filter_value_list' need to be explicitly cast, for example:
select intersect_count_v2(column_to_count, column_to_filter, array[cast(3.53 as double), cast(5.79 as double)], 'RAWSTRING') from TEST_TABLE
ORselect intersect_count_v2(column_to_count, column_to_filter, array[TIMESTAMP'2012-01-02 11:23:45', TIMESTAMP'2012-01-01 11:23:45'], 'RAWSTRING') from TEST_TABLE;
Query Example 1
select intersect_count_v2(
LO_CUSTKEY, LO_SHIPMODE,
array['R*L', 'TRU.*'], 'SHIP')
from SSB.P_LINEORDERResponse Example 1
INTERSECT_VALUE_V2
- Description
- Returns the values of the intersection of multiple result sets in different conditions. If the returned result is large, it may cause the analysis page browser to crash. Support Regexp in condition.
- Syntax
intersect_value_v2(column_to_count, column_to_filter, filter_value_list, filter_type)
- Parameters
column_to_count
, the column to be calculated and applied on distinct value required to be added as Precise count distinct measure. Only when type of column is one of integer family(tinyint、smallint、integer、bigint) and override model propertieskylin.query.skip-encode-integer-enabled=true
, the values returned is actual. Otherwise, encoded value will be returned.column_to_filter
, the varied dimensionfilter_value_list
, the value of the varied dimensions listed inarray[]
,filter_type
, the data type is String, which identifies the filter mode. There are currently two optional values "RAWSTRING" and "REGEXP". When the parameter value is "RAWSTRING", the filtering mode is precise filtering. Whencolumn_to_filter
is a Varchar type, A single element in the array can be mapped with multiple values. By default, it is separated by '|'. You can usekylin.query.intersect.separator
to configure the separator. And only support configuration in thekylin.properties
file. (currently this parameter does not support using the results of subqueries as parameters). When the parameter value is "REGEXP", the filtering mode is regular matching, and only the value of the regular expression in column_to_filter that can match the filter_value_list will be filtered.
Note: When the filter_type is "RAWSTRING" and the data type of a varied dimension is not varchar or integer, the values in 'filter_value_list' need to be explicitly cast, for example:
select intersect_value_v2(column_to_count, column_to_filter, array[cast(3.53 as double), cast(5.79 as double)], 'RAWSTRING') from TEST_TABLE
ORselect intersect_value_v2(column_to_count, column_to_filter, array[TIMESTAMP'2012-01-02 11:23:45', TIMESTAMP'2012-01-01 11:23:45'], 'RAWSTRING') from TEST_TABLE;
Query Example 1
select intersect_value_v2(
LO_CUSTKEY, LO_SHIPMODE,
array['R*L', 'TRU.*'], 'SHIP')
from SSB.P_LINEORDERResponse Example 1
Limitations
- All the above functions don't support pushdown query
- All the above functions don't support detailed index answers (even with the switch kylin.query.use-tableindex-answer-non-raw-query = true)