Introduction to Hadoop

Hadoop is a open source framework software for distributed storage and processing on large clusters of commodity computers.

Let us check the parameters by which Hadoop is taking the world of data storage and processing.




Parameter
Database
Hadoop
Volume
Limited data storage
Unlimited data storage
Variety
Only structured data
Structured + Unstructured data
Velocity
Low speed
High speed
Architecture
Designed on shared disc architecture
Designed on shared nothing disc architecture
Rule
Satisfy RDBMS Rules + ACID property
Developed on CAP Theorem
(Consistency, Availability, Partitions)
Processing
Serial processing
Massive parallel processing
Schema
Designed on schema
Schema free approach
Cost
Software is costly
Software is free

Integration Services are licenced
Integration Services are free

Origins of Hadoop:
         Dough Cutting and Mike Cafarella set out to improve Nutch(Apache product). What they needed, as a foundation of the system, was a distributed storage layer that satisfied the following requirements:
schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes).
durable once data is written it should never be lost.
capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB).
automatically rebalanced to even out disk space consumption throughout cluster.

Started in 2002 in Yahoo to then in Apache, Hadoop 1 stable version (Hadoop Map Reduce) released in April 2009 and Hadoop 2 version(Hadoop Yarn) released in May 2013.

Post a Comment

Thanks for your comment !
I will review your this and will respond you as soon as possible.