Set Up Hadoop Fully Distributed Cluster On AWS -- Advanced

0. Overview for the first Blog of Hadoop Cluster set up.

1). Hadoop-env.xml,mapred-env.xml,yarn-en.xml (change JAVA_HOME)

2). core-site.xml,yarn-site,hdfs-site,mapred-site

core-site (fs.defaultFS = hdfs://address:8020

Hadoop.tmp.dir (Create the folder by yourself)

3). hdfs-site.xml

dfs.replication(default = 3)

dfs.permissions.enables = false (for testing only)

dfs.namenode.secondary.http-address = address:50090

4). slaves include all the hostname (copy from hosts and remove the ip address)

1. Add JobHistory

1). Modify mapred-site.xml

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>http://ec2-52-54-83-82.compute-1.amazonaws.com:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>http://ec2-52-54-83-82.compute-1.amazonaws.com:19888</value>
</property>

2). Modify yarn-site.xml

<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>

(Below is optional)
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>106800</value>
</property>

Distributed all configuration to every node.

Use “sbin/mr-jobhistory-daemon.sh start historyserver” to start the jobHistory Server.

2. Install ZooKeeper

1). Download ZooKeeper

2). Upload zookeeper-3.4.10.tar.gz to the machine.

scp -i "HadoopTest.pem" zookeeper-3.4.10.tar.gz ubuntu@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:~/ (HadoopTest.pem is the keyPair file download from aws when the instance is launched.)

3). Rename zoo.smaple.cfg to conf/zoo.cfg

mv conf/zoo.smaple.cfg conf/zoo.cfg

4). Modify zoo.cfg

dataDir=/home/ubuntu/zookeeper-3.4.10/data/zkdata (This depends on youreslf)
server.1=ec2-xx-xx-xx-xx.compute-1.amazonaws.com:2888:3888
server.2=ec2-xx-xx-xx-xx.compute-1.amazonaws.com:2888:3888
server.3=ec2-xx-xx-xx-xx.compute-1.amazonaws.com:2888:3888

5). Create zkdata folder

Based on the configuration of zoo.cfd, create folder under path /home/ubuntu/zookeeper-3.4.10/data/zkdata

6). Add myid files

Create "myid" file under zkdata folder. and then add the corresponding number such as 1,2,3 to the corresponding machine.

3. Implement HA

0). The import part of HA Set up a. share edits

b. NameNode(Active, StandBy)

c. Client (Proxy)

d. Fence, There is only one NameNode which provides service for others.

1). How to set up

This is reference by the Hadoop Offical Website.

2). Modify hdfs-site.xml

<property>
  <name>dfs.nameservices</name>
  <value>ns1</value>
</property>

<property>
  <name>dfs.ha.namenodes.ns1</name>
  <value>nn1,nn2</value>
</property>

<property>
  <name>dfs.namenode.rpc-address.ns1.nn1</name>
  <value>machine1.example.com:8020</value>
</property>

<property>
  <name>dfs.namenode.rpc-address.ns1.nn2</name>
  <value>machine2.example.com:8020</value>
</property>

##Namenode HTTP WEB ADDRESS
<property>
  <name>dfs.namenode.http-address.ns1.nn1</name>
  <value>machine1.example.com:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.ns1.nn2</name>
  <value>machine2.example.com:50070</value>
</property>

##shared edits ADDRESS
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/ns1</value>
</property>

###CLIENT PROXY
<property>
  <name>dfs.client.failover.proxy.provider.ns1</name>  
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
  	  <name>dfs.ha.fencing.methods</name>
  <value>shell(/bin/true)</value>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/path/to/journal/node/local/data</value>
</property>

3). Modify core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ns1</value>
</property>

Distributed configuration files to all machines.

4). How to trigger the set up

step1:For every journalnode, input below command.

sbin/hadoop-daemon.sh start journalnode

step2:In nn1, format the namenode, and then start

bin/hdfs namenode -format
sbin/hadoop-daemon.sh start namenode

step3:In machine nn2, Synchronize the meta data information

bin/hdfs namenode –bootstrapStandby

step4:Start machine nn2

sbin/hadoop-daemon.sh start namenode

step5:Start all DataNode

step6:Choose one machine to be the active machine (manually start up)

bin/hdfs haadmin -transitionToActive nn1

5). Automatic Failover

After Start up, every machine is in standby status, and then choose one be the active by election.

Monitor by using ZKFC failover.

Modify the hdfs-site.xml

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
</property>

Modify the core-site.xml

<property>
  <name>ha.zookeeper.quorum</name>
  <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>

How to start the set up

sbin/stop-dfs.sh

under zookeeper
bin/zkServer.sh start

under hadoop
bin/hdfs zkfc –formatZK
sbin/start-dfs.sh