Installing a Hadoop
cluster typically involves unpacking the software on all the machines in the
cluster or installing it via a packaging system as appropriate for your
operating system. It is important to divide up the hardware into functions.
Typically one machine in
the cluster is designated as the NameNode and another machine as the
ResourceManager, exclusively. These are the masters. Other services (such as
Web App Proxy Server and MapReduce Job History server) are usually run either
on dedicated hardware or on shared infrastructure, depending upon the load.
The rest of the machines
in the cluster act as both DataNode and NodeManager. These are the workers.
HDFS daemons are NameNode,
SecondaryNameNode, and DataNode. YARN daemons are ResourceManager, NodeManager,
and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History
Server will also be running. For large installations, these are generally
running on separate hosts.
Administrators should use
the etc/hadoop/hadoop-env.sh and optionally the etc/hadoop/mapred-env.sh and etc/hadoop/yarn-env.sh scripts
to do site-specific customization of the Hadoop daemons’ process environment.
For example, To configure
Namenode to use parallelGC and a 4GB Java Heap, the following statement should
be added in hadoop-env.sh :
export HDFS_NAMENODE_OPTS="-XX:+UseParallelGC
-Xmx4g"
Many Hadoop components are
rack-aware and take advantage of the network topology for performance and
safety. It is highly recommended configuring rack awareness prior to
starting HDFS.
# +-----------+
# |core router|
# +-----------+
# / \
# +-----------+ +-----------+
# |rack switch| |rack switch|
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
Daemon
|
Notes
|
NameNode
|
Default
HTTP port is 9870.
|
ResourceManager
|
Default
HTTP port is 8088.
|
MapReduce
JobHistory Server
|
Default
HTTP port is 19888.
|
Now before you start
installation of Hadoop below setup need prepared
vm.swappiness is a Linux kernel parameter that controls how
aggressively memory pages are swapped to disk. It can be set to a value between
0-100; the higher the value, the more aggressive the kernel is in seeking out
inactive memory pages and swapping them to disk.
On most systems, vm.swappiness is set to 60 by
default. This is not suitable for Hadoop clusters because processes are
sometimes swapped even when enough memory is available. This can cause lengthy
garbage collection pauses for important system daemons, affecting stability and
performance.
To view the current vm.swappiness use below:
cat /proc/sys/vm/swappiness
Cloudera recommends to set this value to 1 or max up to 10
below command will set the values of vm.swappiness but it will
require restart of host
echo "vm.swappiness =
10" >> /etc/sysctl.conf
Use below command to set it to 0 and this will not require
reboot will take effect from existing
sudo sysctl vm.swappiness=0
sudo sysctl -p
/etc/sysctl.conf -----> will update, not require restart
Disable THP
Transparent Huge Pages (THP) is a Linux memory management system
that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on
machines with large amounts of memory by using larger memory pages.
Cloudera recommends to disable THP, you can use below command to
disable it
echo never >
/sys/kernel/mm/transparent_hugepage/defrag
echo never >
/sys/kernel/mm/transparent_hugepage/enabled
To take effect after system reboot add this to RC local file
echo "echo never >
/sys/kernel/mm/transparent_hugepage/defrag" >> /etc/rc.local
echo "echo never >
/sys/kernel/mm/transparent_hugepage/enabled" >> /etc/rc.local
To verify use below cat command make sure [never] is
highlighted:
cat /sys/kernel/mm/transparent_hugepage/enabled
cat
/sys/kernel/mm/transparent_hugepage/defrag
Disabling the Firewall
To disable Firewall on RHEL7 use below:
systemctl stop firewalld
systemctl disable firewalld
Enable an NTP Service
CDH requires that you configure a Network Time Protocol (NTP),
to synchronize the clock of diffrent host on cluster:
First install ntp using below and then start/enable the service:
yum -y install ntp
systemctl start ntpd
systemctl enable ntpd
To configure NTP edit below file:
/etc/ntp.conf
server 0.pool.ntp.org
server 1.pool.ntp.org
server 2.pool.ntp.org
Synchronize the hardware clock to the system clock using below:
hwclock --systohc
NOTE: To prepare the hodoop cluster, perform
the all above steps on each host. Along with these 2 more steps need to
perform i.e. disable SELinux and SSHless/SUDOless authentication. For this
please refer the details in Linux section over here for SELinux and over here for SSHless/SUDOless.
Post a Comment
Post a Comment
Thanks for your comment !
I will review your this and will respond you as soon as possible.