Microsoft Azure Cloud | Big Data

Showing posts with the label hadoopShow All

Configure Proxy for HiveServer2 and Impala

September 05, 2020 Post a Comment

We will use here HAProxy which is an open-source HA load balancer and proxy server for TCP and HTTP based applications. Ngnix is not recommended as we do not have webserver traffic to load balance. Let’s first install HAProxy on the proxy server. yum -y inst…

Real-Time Data Stream Processing In Azure Part: 1

July 11, 2020 Post a Comment

There are multiple ways to do real time analytics on Azure, depends on source type and the analytics which we want to perform. This article will introduce the major services in Azure which are involved for real time data solution and at the end will compare all to kno…

Lambda vs Delta Architecture - Realtime Analytics on Delta Lake

June 30, 2020 Post a Comment

Before I start details for Delta Architecture lets recap Lambda Architecture first, then you will be able to appreciate the beauty of delta Architecture. Lambda architecture is a popular technique where records are processed by a batch system and streaming system…

Capture a Network Trace

May 15, 2020 Post a Comment

If you need to capture a network trace/TCP Dump of a client or server here are some simple ways using which usually I do this: Capture fiddler trace 1) Install Fiddler from http://www.telerik.com/download/fiddler/fiddler4 if not already done 2) Launch fid…

Hadoop Cluster Setup

March 07, 2020 Post a Comment

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions. Typically one machin…

Setting up SELinux mode

February 15, 2020 Post a Comment

Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including mandatory access controls (MAC). Without SELinux enabled, only traditional discretionary access control (DAC) …

Benchmark Hadoop

January 25, 2020 Post a Comment

When we install hadoop we get few jars to test the installation and for benchmarking. In Cloudera distribution: /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.…

HDFS ACL

February 23, 2019 Post a Comment

Before start using ACL, make sure it is enable. If you are using Cloudera distribution use below property in HDFS configuration: Alternatively you can find in yarn-site.xml

Apache Sentry Basics and Setup

June 16, 2018 Post a Comment

RPC server stores authorization in relation database. Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoo…

Kerberos and HDFS Encryption

April 21, 2018 Post a Comment

User security contains three parts: Authentication, Authorization and Audit Authentication simply means verifying who the user claims to be. There are three factors of authentication: Who you are, What you know and What you have You may have heard the term two-facto…

SSH Key Authentication for Hadoop

July 20, 2017 Post a Comment

Hadoop cluster setup requires SSH key based authentication among master and slave nodes. Using SSH key based authentication, master node can connect to slave nodes or secondary nodes to start/stop the daemons\processes without any password.

HDFS Erasure Coding and RAID basics

July 15, 2017 Post a Comment

HDFS Replication is expensive – the default 3x replication scheme in HDFS has 200% overhead in storage space and other resources (e.g., network bandwidth). However, for warm and cold datasets with relatively low I/O activities, additional block replicas are rarely acc…

Hadoop Architecture, HA, Failover

May 26, 2017 1 Comments

MRv1 daemon • Namenode • Secondary namenode • Jobtracker • Datanode • Tasktracker The jobtracker daemon had these two parts tightly coupled within itself and was responsible for managing the tasks and all its related operations by interacting with th…

Contact Form