Data Structure
Blur is a table based query system. So within a single shard cluster there can be many different tables, each with a different schema, shard size, analyzers, etc. Each table contains Rows. A Row contains a row id (Lucene StringField internally) and many Records. A record has a record id (Lucene StringField internally), a family (Lucene StringField internally), and many Columns. A column contains a name and value, both are Strings in the API but the value can be interpreted as different types. All base Lucene Field types are supported, Text, String, Long, Int, Double, and Float.
Starting with the most basic structure and building on it.
Columns
Columns contain a name and value, both are strings in the API but can be interpreted as an Integer, Float, Long, Double, String, or Text. All Column types default to Text and will be analyzed during the indexing process.
Column {"name" => "value"}
Records
Record contains a Record Id, Family, and one or more Columns
Record {
"recordId" => "1234",
"family" => "family1",
"columns" => [
Column {"column1" => "value1"},
Column {"column2" => "value2"},
Column {"column2" => "value3"},
Column {"column3" => "value4"}
]
}
Quick Tip!
The column names do not have to be unique within the Record. So you can treat multiple Columns with the same name as an array of values. Also the order of the values will be maintained.
Rows
Rows contain a row id and a list of Records.
Row {
"id" => "r-5678",
"records" => [
Record {
"recordId" => "1234",
"family" => "family1",
"columns" => [
Column {"column1" => "value1"},
Column {"column2" => "value2"},
Column {"column2" => "value3"},
Column {"column3" => "value4"}
]
},
Record {
"recordId" => "9012",
"family" => "family1",
"columns" => [
Column {"column1" => "value1"}
]
},
Record {
"recordId" => "4321",
"family" => "family2",
"columns" => [
Column {"column16" => "value1"}
]
}
]
}
Querying
All queries follow the basic Lucene query syntax see (http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html) for an extensive explanation on the syntax.
All queries can have boolean logic like:
+docs.body:hadoop +docs.author:jon
Which is the same as:
docs.body:hadoop AND docs.author:jon
Row Queries
Row queries allow you to execute queries across Records within the same Row. Row queries are a similar idea to an inner join. Let's say you want to find all the Rows that contain a Record with the family "author" and has a "name" Column that has that contains a term "Jon" and another Record with the family "docs" and has a "body" Column with a term of "Hadoop".
+<author.name:Jon> +<docs.body:Hadoop>
Text
Text fields are analyzed with Lucene's standard analyzers, which mean that the string is broken down into terms and the terms capitalization is removed as well as any special punctuation. See Lucene's documentation for further explanation.
Examples:
To run a query to find all the rows that contain a column with a term of "hadoop" where the family is "docs" and the column is "body".
docs.body:hadoop
To run a query to find all the rows that contain a column with a term of "hadoop" and "awesome" where the family is "docs" and the column is "body".
docs.body:(+hadoop +awesome)
To run a query to find all the rows that contain a column with a phrase of "hadoop is awesome" where the family is "docs" and the column is "body".
docs.body:"hadoop is awesome"
To run a query to find all the rows that contain a column with a word of "hadoop" and we want to deal with misspellings (or a Fuzzy query) where the family is "docs" and the column is "body".
docs.body:hadoop~
To run a query to find all the rows that contain a column with a word that matches a wildcard pattern of "h*d?op" where the family is "docs" and the column is "body".
docs.body:h*d?op
String
String fields are indexed Columns that are not analyzed they are indexed as is. Do not use String fields for large amount of text, this will increase the size of your index and probably not give you the desired behavior. So given the string "Hadoop" and "hadoop", these will be indexed as two different term because the String field is case sensitive. Also if the string contains "The cow jumps over the moon." the single term that will be placed into the index is "The cow jumps over the moon." as a single string. This field type is normally used for id or type lookups.
Examples:
To run a query to find all the rows that contain a column with a term of "Hadoop" where the family is "docs" and the column is "type".
docs.type:Hadoop
Numeric
The numerics types are:
- int
- long
- float
- double
- Exact Match
- Range
Examples:
To run a query to find all the rows that contain a column with a value of "12345" where the family is "docs" and the column is "id".
docs.id:12345
To run a query to find all the rows that contain a column with a starting value of "12345" and an ending values of "54321" where the family is "docs" and the column is "id".
docs.id:[12345 TO 54321]
To run a query to find all the rows that contain a column with a value less than "12345" where the family is "docs" and the column is "id".
docs.id:[MIN TO 12345}
To run a query to find all the rows that contain a column with a value less than or equal to "12345" where the family is "docs" and the column is "id".
docs.id:[MIN TO 12345]
To run a query to find all the rows that contain a column with a value great than to "12345" where the family is "docs" and the column is "id".
docs.id:{12345 TO MAX]
To run a query to find all the rows that contain a column with a value great than or equal to "12345" where the family is "docs" and the column is "id".
docs.id:[12345 TO MAX]
Date
Date types are basically a long field type with a date parser built-in. The date type can perform two types of queries:
- Exact Match
- Range
Examples:
To run a query to find all the rows that contain a column with a value of "2012-09-11" where the family is "docs" and the column is "published_date".
docs.published_date:2013-09-11
To run a query to find all the rows that contain a column with a starting value of "2012-09-11" and an ending values of "2013-09-11" where the family is "docs" and the column is "published_date".
docs.published_date:[2012-09-11 TO 2013-09-11]
Spatial
Spatial queries are supported through the Lucene spatial module. There are currently three different types of strategies:
- Point Vector
- Term Query Prefix Tree
- Recursive Prefix Tree
Examples:
To run a query to find all the rows that contain a location within 10 km (0.089932 10 km in degrees) of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".
docs.location:"Intersects(Circle(33.0, -88.0 d=0.089932))"
To run a query to find all the rows that contain a location within 10 km of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".
docs.location:"Intersects(Circle(33.0, -88.0 d=10.0km))"
To run a query to find all the rows that contain a location within 10 miles of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".
docs.location:"Intersects(Circle(33.0, -88.0 d=10.0m))"
Types
Text
The Text Type has the type name of:
text
Property Options:
- "stopWordPath" -Optional- default value is no stop words. This should be a HDFS path.
This will load stop words into the StandardAnalyzer for this field, one term per line. - "analyzerClass" -Optional- default value is a standard analyzer with no stop words.
This could be any Analyzer class that has a default constructor or one that takes a Lucene Version enum.
String
The String Type has the type name of:
string
Property Options:
- None
Long
The Long Type has the type name of:
long
Property Options:
- "numericPrecisionStep" -Optional- default value is "4"
Integer
The Integer Type has the type name of:
int
Property Options:
- "numericPrecisionStep" -Optional- default value is "4"
Float
The Float Type has the type name of:
float
Property Options:
- "numericPrecisionStep" -Optional- default value is "4"
Double
The Double Type has the type name of:
double
Property Options:
- "numericPrecisionStep" -Optional- default value is "4"
Date
The Date Type has the type name of:
date
Property Options:
- "dateFormat" -Required- Examples: "yyyy-MM-dd", "MM/dd/yyyy" or anything that SimpleDateFormat can parse.
- "timeUnit" -Optional- Default is SECONDS. Other options (DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS)
- "numericPrecisionStep" -Optional- default value is "4"
Stored
The Stored Type has the type name of:
stored
Property Options:
- None
Spatial
Point Vector
The Point Vector Spatial Type has the type name of:
geo-pointvector
Property Options:
- None
Supported Indexing Shapes:
- Point
Supported Querying Shapes:
- Circle
- Rectangle
Supported Querying Operations:
- Intersects
Term Prefix
The Term Prefix Spatial Type has the type name of:
geo-termprefix
Property Options:
- "spatialPrefixTree" can either equal to "GeohashPrefixTree" or "QuadPrefixTree"
- "maxLevels" -Optional- default value is "11"
Supported Indexing Shapes:
- Point
Supported Querying Shapes:
- Circle
- Rectangle
Supported Querying Operations:
- Intersects
Recursive Prefix
The Term Prefix Spatial Type has the type name of:
geo-recursiveprefix
Property Options:
- "spatialPrefixTree" can either equal to "GeohashPrefixTree" or "QuadPrefixTree"
- "maxLevels" -Optional- default value is "11"
Supported Indexing Shapes:
- Point
- Circle
- Rectangle
Supported Querying Shapes:
- Circle
- Rectangle
Supported Querying Operations:
- IsDisjointTo
- Intersects
- IsWithin
- Contains
Custom Types
Custom types in Blur allow you to create your own types in Lucene as well as plugging into the query parser so that you can use your custom type.
Creating
You will need to extend the "org.apache.blur.analysis.FieldTypeDefinition" class found in the blur-query module. If you need to use a different Analyzer than the StandardAnalyzer used in the "text" type just extend the "org.apache.blur.analysis.type.TextFieldTypeDefinition" and make the appropriate changes.
For types that require custom query parsing or custom "org.apache.lucene.index.IndexableField" manipulation without the use of an Analyzer. Please extend "org.apache.blur.analysis.type.CustomFieldTypeDefinition".
Example
Below is a simple type that is basically the same as a "string" type, however it's implemented by extending "org.apache.blur.analysis.type.CustomFieldTypeDefinition".public class ExampleType extends CustomFieldTypeDefinition {
private String _fieldNameForThisInstance;
/**
* Get the name of the type.
*
* @return the name.
*/
@Override
public String getName() {
return "example";
}
/**
* Configures this instance for the type.
*
* @param fieldNameForThisInstance
* the field name for this instance.
* @param properties
* the properties passed into this type definition from the
* {@link Blur.Iface#addColumnDefinition(String, ColumnDefinition)}
* method.
*/
@Override
public void configure(String fieldNameForThisInstance, Map properties,
Configuration configuration) {
_fieldNameForThisInstance = fieldNameForThisInstance;
}
/**
* Create {@link Field}s for the index as well as for storing the original
* data for retrieval.
*
* @param family
* the family name.
* @param column
* the column that holds the name and value.
*
* @return the {@link Iterable} of {@link Field}s.
*/
@Override
public Iterable<? extends Field> getFieldsForColumn(String family, Column column) {
String name = family + "." + column.getName();
String value = column.getValue();
return makeIterable(new StringField(name, value, Store.YES));
}
/**
* Create {@link Field}s for the index do NOT store the data because the is a
* sub column.
*
* @param family
* the family name.
* @param column
* the column that holds the name and value.
* @param subName
* the sub column name.
*
* @return the {@link Iterable} of {@link Field}s.
*/
@Override
public Iterable<? extends Field> getFieldsForSubColumn(String family, Column column,
String subName) {
String name = family + "." + column.getName() + "." + subName;
String value = column.getValue();
return makeIterable(new StringField(name, value, Store.NO));
}
/**
* Gets the query from the text provided by the query parser.
*
* @param text
* the text provided by the query parser.
* @return the {@link Query}.
*/
@Override
public Query getCustomQuery(String text) {
return new TermQuery(new Term(_fieldNameForThisInstance, text));
}
}
Distributing
Once you have created and tested your custom type you will need to copy the jar file containing your custom type to all the servers in the cluster. The jar file will need to be located within the $BLUR_HOME/lib directory. Once there all the servers will need to be restarted to have the jar file be picked up in the classpath.
In a later version of Blur we hope to have this be a dynamic operation that can be performed without restarting the cluster.
Using
You can either add your custom type to the entire cluster or per table.
Cluster Wide
For cluster wide configuration you will need to add the new field types into the blur-site.properties file on each server.
blur.fieldtype.customtype1=org.apache.blur.analysis.type.ExampleType1
blur.fieldtype.customtype2=org.apache.blur.analysis.type.ExampleType2
...
Please note that the prefix of "blur.fieldtype." is all that is used from the property name because the type gets it's name from the internal method of "getName". However the property names will need to be unique within the file.
Single Table
For a single table configuration you will need to add the new field types into the tableProperties map in the TableDescriptor as you define the table.
tableDescriptor.putToTableProperties("blur.fieldtype.customtype1",
"org.apache.blur.analysis.type.ExampleType1");
tableDescriptor.putToTableProperties("blur.fieldtype.customtype2",
"org.apache.blur.analysis.type.ExampleType2");
...
Please note that the prefix of "blur.fieldtype." is all that is used from the property name because the type gets it's name from the internal method of "getName". However the property names will need to be unique within the map.