Zebra Reference Guide

Zebra supports simple types (int, long, float, double, string, bytes), complex types (record, collection, map) and Booleans.

Zebra Type	Description
int	signed 32-bit integer
long	signed 64-bit integer
float	32-bit floating point
double	64-bit floating point
string	character array (string) in Unicode UTF-8 format
bytes	byte array (blob)
record	An ordered set of fields. A field can be any Zebra type.
collection	A set of records.
map	A set of key/value pairs. The key is type string; the value can be any Zebra type.
bool	Boolean {0,1} false/true

Zebra type names are chosen to be as “technology neutral” as possible and are in resemblance to native types in modern programming languages.

Zebra	Pig	Avro	SQL
int	int	int	integer
long	long	long	long
float	float	float	float,real
double	double	double	double precision
string	chararray	string	varchar
bytes	bytearray	bytes	raw
record	tuple	record	hash
collection	bag	array	list
map	map	map	hasp
bool	boolean	boolean	bool

Use the Zebra store schema to write or store Zebra columns and to specify column types. The schema supports data type compatibility and conversion between Zebra/Pig, Zebra/MapReduce, and Zebra/Streaming. (In a future release, the schema will also support type compatibility between Zebra/Pig-SQL and will guide the underlying serialization formats provided by Avro for projection, filtering, and so on. )

The basic format for the store schema is shown here. The type name is optional; if not specified, the column defaults to type bytes.

column_name[:type_name] [, column_name[:type_name] ... ]

Schemas for Simple Data Types

Simple data types include int, long, float, double, string, and bytes. The following syntax also applies to Booleans.

Syntax

field_alias[:type] [, field_alias[:type] …]

Terms

field_alias	The name assigned to the field column.
:type	(Optional) The simple data type assigned to the field. The alias and type are separated by a colon ( : ). If the type is omitted, the field defaults to type bytes.
,	Multiple fields are separated by commas.

Examples

In this example the schema specifies names and types for 3 columns.

ZebraSchema.createZebraSchema(JobContext, “s1:string, f1:float, i1:int”);

In this example the schema specifies names for 3 columns; all 3 columns default to type bytes.

ZebraSchema.createZebraSchema(JobContext, “f1, f2, f3”);

Schemas for Records

A record is an ordered set of fields. A field can be any Zebra type.

Syntax

record_alias:record (field_alias[:type]) [,(field_alias[:type]) …] )

Terms

record_alias	The name assigned to the record column.
:record	The record designator.
( )	The record notation, a set of parentheses.
field_alias	The name assigned to the field.
:type	(Optional) The type assigned to a field (can be any Zebra type).
,	Multiple fields are separated by commas.

Examples

In this example the schema specifies a record with two fields.

ZebraSchema.createZebraSchema(JobContext, “r1:record(f1:int,f2:long)”);

In this example the schema specifies a record with two fields. Note that f2 will default to type bytes.

ZebraSchema.createZebraSchema(JobContext, “r1:record(r2:record(f1:int,f2)”);

Schemas for Collections

A collection is a set of records.

Syntax

collection_alias:collection ([record_alias:]record(...))

Terms

collection_alias	The name assigned to the collection.
:collection	The collection designator.
( )	The collection notation, a set of parentheses.
record_alias	The name assigned to the record.
record	The record designator. The record can be specified with or without the record alias: c1:collection(r1:record(f1:int,f2:long)); c1:collection(record(f1:int,f2:long));

Examples

In this example the schema specifies a collection of records, each consisting of two fields.

ZebraSchema.createZebraSchema(jobContext, “c1:collection(r1:record(f1:int,f2:long))”);

Schemas for Maps

A map is a set of key/value pairs.

Syntax

map_alias:map (type)

Terms

map_alias	The name assigned to the map column.
:map	The map designator.
( )	The map notation, a set of parentheses.
type	The type assigned to the map’s value (can be any Zebra type). Note that the map’s key is always type string and is not specified.

Examples

In this example the schema specifies a map with value of type string.

ZebraSchema.createZebraSchema(jJobContext, “m1:map(string)”);

In this example the schema specifies a map with value of type map (with a value of type int).

ZebraSchema.createZebraSchema(JobContext, “m2:map(map(int))”);

Use the Zebra storage specification to define Zebra column groups. The storage specification, when combined with a STORE statement, describes the physical structure of a Zebra table. Suppose we have the following statement:

STORE A INTO '$PATH/mytable' USING org.apache.hadoop.zebra.pig.TableStorer('[a1, a2] AS cg1; [a3, a4, a5] AS cg2');

The statement describes a table that has two column groups; the first column group has two columns, the second column group has three columns. The statement can be interpreted as follows:

$PATH/mytable - the table, a file path to a directory named mytable
$PATH/mytable/cg1 - the first column group, a subdirectory named cg1 under directory mytable
$PATH/mytable/cg1/part00001 - a file consisting, conceptually, of columns a1 and a2
$PATH/mytable/cg2 - the second column group, a subdirectory named cg2 under directory mytable
$PATH/mytable/cg2/part00001 - a file consisting, conceptually, of columns a3, a4, and a5

Specification

The basic format for the Zebra storage specification is shown here. For this specification, note that the straight brackets [ ] designate a column group and the curly brackets { } designate an optional syntax component.

Syntax

[column_name {, column_name ...} ] {AS column_group_name} {COMPRESS BY compressor_name} {SERIALIZE BY serializer_name}
{; [column_name {, column_name ...} ] {AS column_group_name} {COMPRESS BY compressor_name} {SERIALIZE BY serializer_name} ... }

Terms

[ ]	Brackets designate a column group. Multiple column groups are separated by semi-colons.
AS column_group_name	Optional. The name of the column group. Column group names are unique within one table and are case sensitive: c1 and C1 are different. Column group names are used as the physical column group directory path names. If specified, the AS clause must immediately follow the column group [ ]. If not specified, Zebra will assign unique default names for the table: CG0, CG1, CG2 ... (If CGx is specified by the programmer, then it cannot be used by Zebra.)
COMPRESS BY compressor_name	Optional. Valid values for compressor_name include gz (default) and lzo. If not specified, gz is used.
SERIALIZE BY serializer_name	Optional. Valid values for serializer_name include pig (default). (In a future release, Avro will be available.) If not specified, pig is used.
column_name	The name of one or more columns that form the column group.

Examples

In this example, one column group is specified; the two statements are equivalent.

STORE A INTO '$PATH' USING org.apache.hadoop.zebra.pig.TableStorer('[c1]');

STORE A INTO '$PATH' USING org.apache.hadoop.zebra.pig.TableStorer('[c1] AS CG0 COMPRESS BY gz SERIALIZE BY pig;');

In this example, two column groups are specified. The first column group, C1, has two columns. The second column group, C2, has three columns.

STORE A INTO '$PATH' USING org.apache.hadoop.zebra.pig.TableStorer('[a1, a2] AS C1; [a3, a4, a5] AS C2');

Use the Zebra load schema to load or read table columns.

Schema

The basic format for the Zebra load (read) schema is shown here. The column name can be any valid Zebra type. If no columns are specified, the entire Zebra table is loaded.

column_name [, column_name ... ]

Terms

column_name

The column name. Multiple columns are separated by commas.

Example

Three Pig examples are shown here.

-- All columns are loaded
A = LOAD '$PATH/tbl1' USING org.apache.hadoop.pig.zebra.pig.TableLoader();
   
-- Two columns are projected
B = LOAD '$PATH/tbl2' USING org.apache.hadoop.zebra.pig.TableLoader('c1,c2');
   
-- Three columns are projected: a simple field, a map, and a record
C = LOAD '$PATH/tbl3' USING org.apache.hadoop.zebra.pig.TableLoader('c1,c2#{key1},col4.{f1}')

Zebra Reference Guide

Zebra Types

Store Schema

Schemas for Simple Data Types

Syntax

Terms

Examples

Schemas for Records

Syntax

Terms

Examples

Schemas for Collections

Syntax

Terms

Examples

Schemas for Maps

Syntax

Terms

Examples

Storage Specification

Specification

Syntax

Terms

Examples

Load Schema

Schema

Terms

Example