Introduction to Hadoop

Hadoop is a open source framework software for distributed storage and processing on large clusters of commodity computers.

Let us check the parameters by which Hadoop is taking the world of data storage and processing.

Parameter	Database	Hadoop
Volume	Limited data storage	Unlimited data storage
Variety	Only structured data	Structured + Unstructured data
Velocity	Low speed	High speed
Architecture	Designed on shared disc architecture	Designed on shared nothing disc architecture
Rule	Satisfy RDBMS Rules + ACID property	Developed on CAP Theorem (Consistency, Availability, Partitions)
Processing	Serial processing	Massive parallel processing
Schema	Designed on schema	Schema free approach
Cost	Software is costly	Software is free
	Integration Services are licenced	Integration Services are free

Origins of Hadoop:
Dough Cutting and Mike Cafarella set out to improve Nutch(Apache product). What they needed, as a foundation of the system, was a distributed storage layer that satisfied the following requirements:
schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes).
durable once data is written it should never be lost.
capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB).
automatically rebalanced to even out disk space consumption throughout cluster.

Started in 2002 in Yahoo to then in Apache, Hadoop 1 stable version (Hadoop Map Reduce) released in April 2009 and Hadoop 2 version(Hadoop Yarn) released in May 2013.

Introduction to Hadoop

Post a Comment

Post a Comment

Contact Form

Introduction to Hadoop

You might like

Post a Comment

Post a Comment

Contact Form