Blur is a table based query system. So within a single shard cluster there can be many different tables, each with a different schema, shard size, analyzers, etc. Each table contains Rows. A Row contains a row id (Lucene StringField internally) and many Records. A record has a record id (Lucene StringField internally), a family (Lucene StringField internally), and many Columns. A column contains a name and value, both are Strings in the API but the value can be interpreted as different types. All base Lucene Field types are supported, Text, String, Long, Int, Double, and Float.

Starting with the most basic structure and building on it.

Columns

Columns contain a name and value, both are strings in the API but can be interpreted as an Integer, Float, Long, Double, String, or Text. All Column types default to Text and will be analyzed during the indexing process.

Column {"name" => "value"}

Records

Record contains a Record Id, Family, and one or more Columns

Record {
  "recordId" => "1234",
  "family" => "family1",
  "columns" => [
    Column {"column1" => "value1"},
    Column {"column2" => "value2"},
    Column {"column2" => "value3"},
    Column {"column3" => "value4"}
  ]
}

Quick Tip!

The column names do not have to be unique within the Record. So you can treat multiple Columns with the same name as an array of values. Also the order of the values will be maintained.

Rows

Rows contain a row id and a list of Records.

Row {
  "id" => "r-5678",
  "records" => [
    Record {
      "recordId" => "1234",
      "family" => "family1",
      "columns" => [
        Column {"column1" => "value1"},
        Column {"column2" => "value2"},
        Column {"column2" => "value3"},
        Column {"column3" => "value4"}
      ]
    },
    Record {
      "recordId" => "9012",
      "family" => "family1",
      "columns" => [
        Column {"column1" => "value1"}
      ]
    },
    Record {
      "recordId" => "4321",
      "family" => "family2",
      "columns" => [
        Column {"column16" => "value1"}
      ]
    }
  ]
}

Querying

All queries follow the basic Lucene query syntax see (http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html) for an extensive explanation on the syntax.

All queries can have boolean logic like:

+docs.body:hadoop +docs.author:jon

Which is the same as:

docs.body:hadoop AND docs.author:jon

Row Queries

Row queries allow you to execute queries across Records within the same Row. Row queries are a similar idea to an inner join. Let's say you want to find all the Rows that contain a Record with the family "author" and has a "name" Column that has that contains a term "Jon" and another Record with the family "docs" and has a "body" Column with a term of "Hadoop".

+<author.name:Jon> +<docs.body:Hadoop>

Text

Text fields are analyzed with Lucene's standard analyzers, which mean that the string is broken down into terms and the terms capitalization is removed as well as any special punctuation. See Lucene's documentation for further explanation.

Examples:

To run a query to find all the rows that contain a column with a term of "hadoop" where the family is "docs" and the column is "body".

docs.body:hadoop

To run a query to find all the rows that contain a column with a term of "hadoop" and "awesome" where the family is "docs" and the column is "body".

docs.body:(+hadoop +awesome)

To run a query to find all the rows that contain a column with a phrase of "hadoop is awesome" where the family is "docs" and the column is "body".

docs.body:"hadoop is awesome"

To run a query to find all the rows that contain a column with a word of "hadoop" and we want to deal with misspellings (or a Fuzzy query) where the family is "docs" and the column is "body".

docs.body:hadoop~

To run a query to find all the rows that contain a column with a word that matches a wildcard pattern of "h*d?op" where the family is "docs" and the column is "body".

docs.body:h*d?op

String

String fields are indexed Columns that are not analyzed they are indexed as is. Do not use String fields for large amount of text, this will increase the size of your index and probably not give you the desired behavior. So given the string "Hadoop" and "hadoop", these will be indexed as two different term because the String field is case sensitive. Also if the string contains "The cow jumps over the moon." the single term that will be placed into the index is "The cow jumps over the moon." as a single string. This field type is normally used for id or type lookups.

Examples:

To run a query to find all the rows that contain a column with a term of "Hadoop" where the family is "docs" and the column is "type".

docs.type:Hadoop

Numeric

The numerics types are:

int
long
float
double

All numerics types can perform two types of queries:

Exact Match
Range

Examples:

To run a query to find all the rows that contain a column with a value of "12345" where the family is "docs" and the column is "id".

docs.id:12345

To run a query to find all the rows that contain a column with a starting value of "12345" and an ending values of "54321" where the family is "docs" and the column is "id".

docs.id:[12345 TO 54321]

To run a query to find all the rows that contain a column with a value less than "12345" where the family is "docs" and the column is "id".

docs.id:[MIN TO 12345}

To run a query to find all the rows that contain a column with a value less than or equal to "12345" where the family is "docs" and the column is "id".

docs.id:[MIN TO 12345]

To run a query to find all the rows that contain a column with a value great than to "12345" where the family is "docs" and the column is "id".

docs.id:{12345 TO MAX]

To run a query to find all the rows that contain a column with a value great than or equal to "12345" where the family is "docs" and the column is "id".

docs.id:[12345 TO MAX]

Date

Date types are basically a long field type with a date parser built-in. The date type can perform two types of queries:

Exact Match
Range

Examples:

To run a query to find all the rows that contain a column with a value of "2012-09-11" where the family is "docs" and the column is "published_date".

docs.published_date:2013-09-11

To run a query to find all the rows that contain a column with a starting value of "2012-09-11" and an ending values of "2013-09-11" where the family is "docs" and the column is "published_date".

docs.published_date:[2012-09-11 TO 2013-09-11]

Spatial

Spatial queries are supported through the Lucene spatial module. There are currently three different types of strategies:

Point Vector
Term Query Prefix Tree
Recursive Prefix Tree

Currently all of the built in spatial types are using a GEO spatial context, meaning the assumed coordinates are on the planet Earth. Each of these strategies support different indexing features such as the shapes they can index, the shapes they can query with, and the operations they support during the query. See the type section below to get a list of supported operations and shapes per strategy.

Examples:

To run a query to find all the rows that contain a location within 10 km (0.089932 10 km in degrees) of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".

docs.location:"Intersects(Circle(33.0, -88.0 d=0.089932))"

To run a query to find all the rows that contain a location within 10 km of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".

docs.location:"Intersects(Circle(33.0, -88.0 d=10.0km))"

To run a query to find all the rows that contain a location within 10 miles of gis coordinate "33.0, -88.0" where the family is "docs" and the column is "location".

docs.location:"Intersects(Circle(33.0, -88.0 d=10.0m))"

Types

Text

The Text Type has the type name of:

text

Property Options:

"stopWordPath" -Optional- default value is no stop words. This should be a HDFS path.
This will load stop words into the StandardAnalyzer for this field, one term per line.
"analyzerClass" -Optional- default value is a standard analyzer with no stop words.
This could be any Analyzer class that has a default constructor or one that takes a Lucene Version enum.

String

The String Type has the type name of:

string

Property Options:

None

Long

The Long Type has the type name of:

long

Property Options:

"numericPrecisionStep" -Optional- default value is "4"

Integer

The Integer Type has the type name of:

int

Property Options:

"numericPrecisionStep" -Optional- default value is "4"

Float

The Float Type has the type name of:

float

Property Options:

"numericPrecisionStep" -Optional- default value is "4"

Double

The Double Type has the type name of:

double

Property Options:

"numericPrecisionStep" -Optional- default value is "4"

Date

The Date Type has the type name of:

date

Property Options:

"dateFormat" -Required- Examples: "yyyy-MM-dd", "MM/dd/yyyy" or anything that SimpleDateFormat can parse.
"timeUnit" -Optional- Default is SECONDS. Other options (DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS)
"numericPrecisionStep" -Optional- default value is "4"

Stored

The Stored Type has the type name of:

stored

Property Options:

None

Spatial

Point Vector

The Point Vector Spatial Type has the type name of:

geo-pointvector

Property Options:

None

Supported Indexing Shapes:

Point

Supported Querying Shapes:

Circle
Rectangle

Supported Querying Operations:

Intersects

Term Prefix

The Term Prefix Spatial Type has the type name of:

geo-termprefix

Property Options:

"spatialPrefixTree" can either equal to "GeohashPrefixTree" or "QuadPrefixTree"
"maxLevels" -Optional- default value is "11"

Supported Indexing Shapes:

Point

Supported Querying Shapes:

Circle
Rectangle

Supported Querying Operations:

Intersects

Recursive Prefix

The Term Prefix Spatial Type has the type name of:

geo-recursiveprefix

Property Options:

"spatialPrefixTree" can either equal to "GeohashPrefixTree" or "QuadPrefixTree"
"maxLevels" -Optional- default value is "11"

Supported Indexing Shapes:

Point
Circle
Rectangle

Supported Querying Shapes:

Circle
Rectangle

Supported Querying Operations:

IsDisjointTo
Intersects
IsWithin
Contains

Custom types in Blur allow you to create your own types in Lucene as well as plugging into the query parser so that you can use your custom type.

Creating

You will need to extend the "org.apache.blur.analysis.FieldTypeDefinition" class found in the blur-query module. If you need to use a different Analyzer than the StandardAnalyzer used in the "text" type just extend the "org.apache.blur.analysis.type.TextFieldTypeDefinition" and make the appropriate changes.

For types that require custom query parsing or custom "org.apache.lucene.index.IndexableField" manipulation without the use of an Analyzer. Please extend "org.apache.blur.analysis.type.CustomFieldTypeDefinition".

Example

Below is a simple type that is basically the same as a "string" type, however it's implemented by extending "org.apache.blur.analysis.type.CustomFieldTypeDefinition".

public class ExampleType extends CustomFieldTypeDefinition {

  private String _fieldNameForThisInstance;

  /**
   * Get the name of the type.
   * 
   * @return the name.
   */
  @Override
  public String getName() {
    return "example";
  }

  /**
   * Configures this instance for the type.
   * 
   * @param fieldNameForThisInstance
   *          the field name for this instance.
   * @param properties
   *          the properties passed into this type definition from the
   *          {@link Blur.Iface#addColumnDefinition(String, ColumnDefinition)}
   *          method.
   */
  @Override
  public void configure(String fieldNameForThisInstance, Map properties, 
                        Configuration configuration) {
    _fieldNameForThisInstance = fieldNameForThisInstance;
  }

  /**
   * Create {@link Field}s for the index as well as for storing the original
   * data for retrieval.
   * 
   * @param family
   *          the family name.
   * @param column
   *          the column that holds the name and value.
   * 
   * @return the {@link Iterable} of {@link Field}s.
   */
  @Override
  public Iterable<? extends Field> getFieldsForColumn(String family, Column column) {
    String name = family + "." + column.getName();
    String value = column.getValue();
    return makeIterable(new StringField(name, value, Store.YES));
  }

  /**
   * Create {@link Field}s for the index do NOT store the data because the is a
   * sub column.
   * 
   * @param family
   *          the family name.
   * @param column
   *          the column that holds the name and value.
   * @param subName
   *          the sub column name.
   * 
   * @return the {@link Iterable} of {@link Field}s.
   */
  @Override
  public Iterable<? extends Field> getFieldsForSubColumn(String family, Column column, 
       String subName) {
    String name = family + "." + column.getName() + "." + subName;
    String value = column.getValue();
    return makeIterable(new StringField(name, value, Store.NO));
  }

  /**
   * Gets the query from the text provided by the query parser.
   * 
   * @param text
   *          the text provided by the query parser.
   * @return the {@link Query}.
   */
  @Override
  public Query getCustomQuery(String text) {
    return new TermQuery(new Term(_fieldNameForThisInstance, text));
  }

}

Distributing

Once you have created and tested your custom type you will need to copy the jar file containing your custom type to all the servers in the cluster. The jar file will need to be located within the $BLUR_HOME/lib directory. Once there all the servers will need to be restarted to have the jar file be picked up in the classpath.

In a later version of Blur we hope to have this be a dynamic operation that can be performed without restarting the cluster.

Using

You can either add your custom type to the entire cluster or per table.

Cluster Wide

For cluster wide configuration you will need to add the new field types into the blur-site.properties file on each server.

blur.fieldtype.customtype1=org.apache.blur.analysis.type.ExampleType1
blur.fieldtype.customtype2=org.apache.blur.analysis.type.ExampleType2
...

Please note that the prefix of "blur.fieldtype." is all that is used from the property name because the type gets it's name from the internal method of "getName". However the property names will need to be unique within the file.

Single Table

For a single table configuration you will need to add the new field types into the tableProperties map in the TableDescriptor as you define the table.

tableDescriptor.putToTableProperties("blur.fieldtype.customtype1", 
	"org.apache.blur.analysis.type.ExampleType1");
tableDescriptor.putToTableProperties("blur.fieldtype.customtype2", 
	"org.apache.blur.analysis.type.ExampleType2");
...

Data Structure

Columns

Records

Quick Tip!

Rows

Querying

Row Queries

Text

Examples:

String

Examples:

Numeric

Examples:

Date

Examples:

Spatial

Examples:

Types

Text

Property Options:

String

Property Options:

Long

Property Options:

Integer

Property Options:

Float

Property Options:

Double

Property Options:

Date

Property Options:

Stored

Property Options:

Spatial

Point Vector

Property Options:

Supported Indexing Shapes:

Supported Querying Shapes:

Supported Querying Operations:

Term Prefix

Property Options:

Supported Indexing Shapes:

Supported Querying Shapes:

Supported Querying Operations:

Recursive Prefix

Property Options:

Supported Indexing Shapes:

Supported Querying Shapes:

Supported Querying Operations:

Custom Types

Creating

Example

Distributing

Using

Cluster Wide

Single Table