Deploy kylin on AWS EC2 without hadoop
Compared with Kylin 3.x, Kylin 4.0 implements a new Spark build engine and parquet storage, making it possible for Kylin to deploy without Hadoop environment. Compared with deploying Kylin 3.x on AWS EMR, deploying kylin4 directly on AWS EC2 instances has the following advantages:
- Cost saving. Compared with AWS EMR node, AWS EC2 node has lower cost.
- More flexible. On the EC2 node, users can more independently select the services and components they need for installation and deployment.
- Remove Hadoop dependency. Hadoop ecosystem is heavy and needs to be maintained at a certain labor cost. Remove hadoop can be closer to the cloud-native.
After realizing the feature of supporting build and query in Spark Standalone mode, we tried to deploy Kylin 4.0 without Hadoop on the EC2 instance of AWS, and successfully built the cube and query.
Environment preparation
- Apply for AWS EC2 Linux instances as required
- Create Amazon RDS for MySQL as kylin and hive metabases
- S3 as kylin's storage
Component version information
The component version information provided here is that we selected during the test. If users need to use other versions for deployment, you can replace them by yourself and ensure the compatibility between component versions.
- JDK 1.8
- Hive 2.3.9
- Zookeeper 3.4.13
- Kylin 4.0 for spark3
- Spark 3.1.1
- Hadoop 3.2.0(No startup required)
Deployment process
1 Configure environment variables
-
Modify profile
vim /etc/profile
# Add the following at the end of the profile file
export JAVA_HOME=/usr/local/java/jdk1.8.0_291
export JRE_HOME=${JAVA_HOME}/jre
export HADOOP_HOME=/etc/hadoop/hadoop-3.2.0
export HIVE_HOME=/etc/hadoop/hive
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:${JAVA_HOME}/bin:$PATH
# Execute after saving the contents of the above file
source /etc/profile
2 Install JDK 1.8
-
Download JDK1.8 to the prepared EC2 instance and unzip it to the
/usr/local/Java
directory:mkdir /usr/local/java
tar -xvf java-1.8.0-openjdk.tar -C /usr/local/java
3 Config Hadoop
-
Download Hadoop and unzip it
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
mkdir /etc/hadoop
tar -xvf hadoop-3.2.0.tar.gz -C /etc/hadoop -
Copy the jar package required by S3 to the Hadoop class loading path, otherwise an error of
ClassNotFound
type may occurcd /etc/hadoop
cp hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar hadoop-3.2.0/share/hadoop/common/lib/
cp hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar hadoop-3.2.0/share/hadoop/common/lib/ -
Modify
core-site.xml
,config AWS account information and endpoint. The following is an example:<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>SESSION-ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SESSION-SECRET-KEY</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.$REGION.amazonaws.com</value>
</property>
</configuration>
4 Install Hive
-
Download Hive and unzip it
wget https://downloads.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
tar -xvf apache-hive-2.3.9-bin.tar.gz -C /etc/hadoop
mv /etc/hadoop/apache-hive-2.3.9-bin /etc/hadoop/hive -
Configure environment variables
vim /etc/profile
# Add the following at the end of the profile file
export HIVE_HOME=/etc/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf
# Execute after saving the contents of the above file
source /etc/profile -
Modify hive-site.xml,
vim ${HIVE_HOME}/conf/hive-site.xml
. Please start Amazon RDS for MySQL database in advance to obtain the mysql connection URI, user name and password.Note: Please configure VPC and security group correctly to ensure that EC2 instances can access the database normally.
The sample content of
hive-site.xml
is as follows:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
<!-- Hive Execution Parameters -->
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://host-name:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>admin</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
<description>
Enforce metastore schema version consistency.
True: Verify that version information stored in metastore matches with one from Hive jars. Also disable automatic
schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
proper metastore schema migration. (Default)
False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
</description>
</property>
</configuration> -
Hive metadata initialization
# Download the jar package of MySQL JDBC and place it in $HIVE_HOME/lib directory
cp mysql-connector-java-5.1.47.jar $HIVE_HOME/lib
bin/schematool -dbType mysql -initSchema
mkdir $HIVE_HOME/logs
nohup $HIVE_HOME/bin/hive --service metastore >> $HIVE_HOME/logs/hivemetastorelog.log 2>&1 &Note:If the following error is reported in this step:
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
This is caused by the inconsistency between the guava version in hive2 and the guava version in Hadoop3. Please replace the guava jar in directory
$HIVE_HOME/lib
with the guava jar in directory$HADOOP_HOME/share/hadoop/common/lib/
. -
To prevent jar package conflicts in the subsequent process, you need to remove some spark and scala related jar packages from hive's class loading path:
mkdir $HIVE_HOME/spark_jar
mv $HIVE_HOME/lib/spark-* $HIVE_HOME/spark_jar
mv $HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar $HIVE_HOME/spark_jarNote: Here just lists the conflicting jar packages encountered during the test. If users encounter problems similar to jar package conflicts, you can judge which jar packages have conflicts according to the class loading path and remove the relevant jar packages. It is recommended to keep the jar package version under the spark class loading path when the same jar package has version conflicts.
5 Deploy Spark Standalone
-
Download Spark 3.1.1 and unzip it
wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
tar -xvf spark-3.1.1-bin-hadoop3.2.tgz -C /etc/hadoop
mv /etc/hadoop/spark-3.1.1-bin-hadoop3.2 /etc/hadoop/spark
export SPARK_HOME=/etc/hadoop/spark -
Copy jar package required by S3:
cp $HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar $SPARK_HOME/jars
cp $HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
cp mysql-connector-java-5.1.47.jar $SPARK_HOME/jars -
Copy hive-site.xml and mysql-jdbc
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf
-
Setup Spark master and worker
$SPARK_HOME/bin/start-master.sh
$SPARK_HOME/bin/start-worker.sh spark://hostname:7077
6 Deploy Zookeeper
-
Download zookeeper and unzip it
wget http://archive.apache.org/dist/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
tar -xvf zookeeper-3.4.13.tar.gz -C /etc/hadoop
mv /etc/hadoop/zookeeper-3.4.13 /etc/hadoop/zookeeper -
Preparing the zookeeper configuration file. Since only one EC2 node is used in the test, the zookeeper pseudo cluster is deployed here.
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo1.cfg
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo2.cfg
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg /etc/hadoop/zookeeper/conf/zoo3.cfg -
Modify the above three configuration files in sequence and add the following contents, note that change the directory name to a different directory:
server.1=localhost:2287:3387
server.2=localhost:2288:3388
server.3=localhost:2289:3389
dataDir=/tmp/zookeeper/zk1/data
dataLogDir=/tmp/zookeeper/zk1/log
clientPort=2181 -
Create the required folders and files:
mkdir /tmp/zookeeper/zk1/data
mkdir /tmp/zookeeper/zk1/log
mkdir /tmp/zookeeper/zk2/data
mkdir /tmp/zookeeper/zk2/log
mkdir /tmp/zookeeper/zk3/data
mkdir /tmp/zookeeper/zk3/log
vim /tmp/zookeeper/zk1/data/myid
vim /tmp/zookeeper/zk2/data/myid
vim /tmp/zookeeper/zk3/data/myid -
Setup zookeeper cluster
/etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo1.cfg
/etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo2.cfg
/etc/hadoop/zookeeper/bin/zkServer.sh start /etc/hadoop/zookeeper/conf/zoo3.cfg
7 Setup kylin
-
Download kylin 4.0 binary package and unzip it
wget https://mirror-hk.koddos.net/apache/kylin/apache-kylin-4.0.0/apache-kylin-4.0.0-bin.tar.gz
tar -xvf apache-kylin-4.0.0-bin.tar.gz /etc/hadoop
export KYLIN_HOME=/etc/hadoop/apache-kylin-4.0.0-bin
mkdir $KYLIN_HOME/ext
cp mysql-connector-java-5.1.47.jar $KYLIN_HOME/ext -
Modify kylin.properties
vim $KYLIN_HOME/conf/kylin.properties
kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://hostname:3306/kylin,username=root,password=password,maxActive=10,maxIdle=10
kylin.env.zookeeper-connect-string=hostname
kylin.engine.spark-conf.spark.master=spark://hostname:7077
kylin.engine.spark-conf.spark.submit.deployMode=client
kylin.env.hdfs-working-dir=s3://bucket/kylin
kylin.engine.spark-conf.spark.eventLog.dir=s3://bucket/kylin/spark-history
kylin.engine.spark-conf.spark.history.fs.logDirectory=s3://bucket/kylin/spark-history
kylin.engine.spark-conf.spark.yarn.jars=s3://bucket/spark2_jars/*
kylin.query.spark-conf.spark.master=spark://hostname:7077
kylin.query.spark-conf.spark.yarn.jars=s3://bucket/spark2_jars/* -
Execute
bin/kylin.sh start
-
Kylin may encounter ClassNotFound type errors during startUp. Please refer to the following methods to restart kylin:
# Download commons-collections-3.2.2.jar
cp commons-collections-3.2.2.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
# Download commons-configuration-1.3.jar
cp commons-configuration-1.3.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp $HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-1.11.563.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp $HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar $HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/