Executing a Query in HDFS

1. Connecting VXQuery with HDFS

In order to read HDFS data, VXQuery needs access to the HDFS configuration directory, which contains:

core-site.xml hdfs-site.xml mapred-site.xml

Some systems may automatically set this directory as a system environment variable ("HADOOP_CONF_DIR"). If this is the case, VXQuery will retrieve this automatically when attempting to perform HDFS queries.

When this variable is not set, users will need to provide this directory as a Command Line Option when executing VXQuery: -hdfs-conf /path/to/hdfs/conf_folder

2. Running the Query

For files stored in HDFS there are 2 ways to access them from VXQuery.

  1. Reading them as whole files.
  2. Reading them block by block.

a. Reading them as whole files.

For this option you only need to change the path to files. To define that your file(s) exist and should be read from HDFS you must add "hdfs:/" in front of the path. VXQuery will read the path of the files you request in your query and try to locate them.

So in order to run a query that will read the input files from HDFS you need to make sure that

a) The environmental variable is set for "HADOOP_CONF_DIR" or you pass the directory location using -hdfs-conf

b) The path defined in your query begins with hdfs:// and the full path to the file(s).

c) The path exists on HDFS and the user that runs the query has read permission to these files.

Example

I want to find all the books that are published after 2004.

The file is located in HDFS in this path /user/hduser/store/books.xml

My query will look like this:

for $x in collection("hdfs://user/hduser/store")
where $x/year>2004
return $x/title

If I want only one file, the books.xml to be parsed from HDFS, my query will look like this:

for $x in doc("hdfs://user/hduser/store/books.xml")
where $x/year>2004
return $x/title

b. Reading them block by block

In order to use that option you need to modify your query. Instead of using the collection or doc function to define your input file(s) you need to use collection-with-tag.

collection-with-tag accepts two arguments, one is the path to the HDFS directory you have stored your input files, and the second is a specific tag that exists in the input file(s). This is the tag of the element that contains the fields that your query is looking for.

Other than these arguments, you do not need to change anything else in the query.

Note: since this strategy is optimized to read block by block, the result will include all elements with the given tag, regardless of depth within the xml tree.

Example

The same example, using collection-with-tag.

My input file books.xml:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>

<book>
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book>
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book>
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>

My query will look like this:

for $x in collection-with-tag("hdfs://user/hduser/store","book")/book
where $x/year>2004
return $x/title

Take notice that I defined the path to the directory containing the file(s) and not the file, collection-with-tag expects the path to the directory. I also added the /book after the function. This is also needed, like collection and doc functions, for the query to be parsed correctly.