The hadoop-azure-datalake module provides support for integration with Azure Data Lake Store. The jar file is named azure-datalake-store.jar.
Partial or no support for the following operations :
Azure Data Lake Storage access path syntax is
adl://<Account Name>.azuredatalakestore.net/
Get started with azure data lake account with https://azure.microsoft.com/en-in/documentation/articles/data-lake-store-get-started-portal/
Usage of Azure Data Lake Storage requires OAuth2 bearer token to be present as part of the HTTPS header as per OAuth2 specification. Valid OAuth2 bearer token should be obtained from Azure Active Directory for valid users who have access to Azure Data Lake Storage Account.
Azure Active Directory (Azure AD) is Microsoft’s multi-tenant cloud based directory and identity management service. See https://azure.microsoft.com/en-in/documentation/articles/active-directory-whatis/
Following sections describes on OAuth2 configuration in core-site.xml.
Credentials can be configured using either a refresh token (associated with a user) or a client credential (analogous to a service principal).
Add the following properties to your core-site.xml
<property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>RefreshToken</value> </property>
Application require to set Client id and OAuth2 refresh token from Azure Active Directory associated with client id. See https://github.com/AzureAD/azure-activedirectory-library-for-java.
Do not share client id and refresh token, it must be kept secret.
<property> <name>dfs.adls.oauth2.client.id</name> <value></value> </property> <property> <name>dfs.adls.oauth2.refresh.token</name> <value></value> </property>
Add the following properties to your core-site.xml
<property> <name>dfs.adls.oauth2.refresh.url</name> <value>TOKEN ENDPOINT FROM STEP 7 ABOVE</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>CLIENT ID FROM STEP 7 ABOVE</value> </property> <property> <name>dfs.adls.oauth2.credential</name> <value>PASSWORD FROM STEP 7 ABOVE</value> </property>
In many Hadoop clusters, the core-site.xml file is world-readable. To protect these credentials from prying eyes, it is recommended that you use the credential provider framework to securely store them and access them through configuration.
All ADLS credential properties can be protected by credential providers. For additional reading on the credential provider API, see Credential Provider API.
% hadoop credential create dfs.adls.oauth2.refresh.token -value 123 -provider localjceks://file/home/foo/adls.jceks % hadoop credential create dfs.adls.oauth2.credential -value 123 -provider localjceks://file/home/foo/adls.jceks
<property> <name>hadoop.security.credential.provider.path</name> <value>localjceks://file/home/foo/adls.jceks</value> <description>Path to interrogate for protected credentials.</description> </property>
% hadoop distcp [-D hadoop.security.credential.provider.path=localjceks://file/home/user/adls.jceks] hdfs://<NameNode Hostname>:9001/user/foo/007020615 adl://<Account Name>.azuredatalakestore.net/testDir/
NOTE: You may optionally add the provider path property to the distcp command line instead of added job specific configuration to a generic core-site.xml. The square brackets above illustrate this capability.
For ADL FileSystem to take effect. Update core-site.xml with
<property> <name>fs.adl.impl</name> <value>org.apache.hadoop.fs.adl.AdlFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.adl.impl</name> <value>org.apache.hadoop.fs.adl.Adl</value> </property>
After credentials are configured in core-site.xml, any Hadoop component may reference files in that Azure Data Lake Storage account by using URLs of the following format:
adl://<Account Name>.azuredatalakestore.net/<path>
The schemes adl identify a URL on a file system backed by Azure Data Lake Storage. adl utilizes encrypted HTTPS access for all interaction with the Azure Data Lake Storage API.
For example, the following FileSystem Shell commands demonstrate access to a storage account named youraccount.
> hadoop fs -mkdir adl://yourcontainer.azuredatalakestore.net/testDir > hadoop fs -put testFile adl://yourcontainer.azuredatalakestore.net/testDir/testFile > hadoop fs -cat adl://yourcontainer.azuredatalakestore.net/testDir/testFile test file content
The hadoop-azure-datalake module provides support for configuring how User/Group information is represented during getFileStatus/listStatus/getAclStatus.
Add the following properties to your core-site.xml
<property> <name>adl.feature.ownerandgroup.enableupn</name> <value>true</value> <description> When true : User and Group in FileStatus/AclStatus response is represented as user friendly name as per Azure AD profile. When false (default) : User and Group in FileStatus/AclStatus response is represented by the unique identifier from Azure AD profile (Object ID as GUID). For performance optimization, Recommended default value. </description> </property>
The hadoop-azure module includes a full suite of unit tests. Most of the tests will run without additional configuration by running mvn test. This includes tests against mocked storage, which is an in-memory emulation of Azure Data Lake Storage.
A selection of tests can run against the Azure Data Lake Storage. To run these tests, please create src/test/resources/auth-keys.xml with Adl account information mentioned in the above sections and the following properties.
<property> <name>dfs.adl.test.contract.enable</name> <value>true</value> </property> <property> <name>test.fs.adl.name</name> <value>adl://yourcontainer.azuredatalakestore.net</value> </property>