Hadoop Cluster Setup

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.
Typically one machine in the cluster is designated as the NameNode and another machine as the ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastructure, depending upon the load.

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the workers.
HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN daemons are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.
Administrators should use the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts to do site-specific customization of the Hadoop daemons’ process environment.
For example, To configure Namenode to use parallelGC and a 4GB Java Heap, the following statement should be added in hadoop-env.sh :
  export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC -Xmx4g"

Many Hadoop components are rack-aware and take advantage of the network topology for performance and safety. It is highly recommended configuring rack awareness prior to starting HDFS.

Default HTTP port is 9870.
Default HTTP port is 8088.
MapReduce JobHistory Server
Default HTTP port is 19888.

Now before you start installation of Hadoop below setup need prepared

Virtual Memory swappiness

vm.swappiness is a Linux kernel parameter that controls how aggressively memory pages are swapped to disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk.

On most systems, vm.swappiness is set to 60 by default. This is not suitable for Hadoop clusters because processes are sometimes swapped even when enough memory is available. This can cause lengthy garbage collection pauses for important system daemons, affecting stability and performance.

To view the current vm.swappiness use below:

cat /proc/sys/vm/swappiness

Cloudera recommends to set this value to 1 or max up to 10

below command will set the values of vm.swappiness but it will require restart of host

echo "vm.swappiness = 10" >> /etc/sysctl.conf

Use below command to set it to 0 and this will not require reboot will take effect from existing

sudo sysctl vm.swappiness=0
sudo sysctl -p /etc/sysctl.conf         -----> will update, not require restart

Disable THP

Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages.

Cloudera recommends to disable THP, you can use below command to disable it

echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled

To take effect after system reboot add this to RC local file

echo "echo never > /sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local
echo "echo never > /sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local

To verify use below cat command make sure [never] is highlighted: 

cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag

Disabling the Firewall

To disable Firewall on RHEL7 use below:

systemctl stop firewalld
systemctl disable firewalld

Enable an NTP Service

CDH requires that you configure a Network Time Protocol (NTP), to synchronize the clock of diffrent host on cluster:

First install ntp using below and then start/enable the service:

yum -y install ntp
systemctl start ntpd
systemctl enable ntpd

To configure NTP edit below file:

server 0.pool.ntp.org
server 1.pool.ntp.org
server 2.pool.ntp.org

Synchronize the hardware clock to the system clock using below:

hwclock --systohc

NOTE: To prepare the hodoop cluster, perform the all above steps on each host. Along with these 2 more steps need to perform i.e. disable SELinux and SSHless/SUDOless authentication. For this please refer the details in Linux section over here for SELinux and over here for SSHless/SUDOless.

