HDFS Replication is expensive –
the default 3x replication scheme in HDFS has 200% overhead in storage space
and other resources (e.g., network bandwidth). However, for warm and cold
datasets with relatively low I/O activities, additional block replicas are
rarely accessed during normal operations, but still consume the same amount of
resources as the first replica.
RAID is a disk or solid state drive (SSD) subsystem that increases performance or provides fault tolerance or both. In the past, RAID was also accomplished by software only but was much slower. In the late 1980s, the "I" in RAID stood for "inexpensive" but was later changed to "independent."
Therefore, a natural improvement
is to use Erasure Coding (EC) in place of replication, which provides the same
level of fault-tolerance with much less storage space. In typical Erasure
Coding (EC) setups, the storage overhead is no more than 50%. Replication
factor of an EC file is meaningless. It is always 1 and cannot be changed via
-setrep command.
In storage systems, the most
notable usage of EC is Redundant Array of Inexpensive Disks (RAID). RAID
implements EC through striping, which divides logically sequential data (such
as a file) into smaller units (such as bit, byte, or block) and stores
consecutive units on different disks.
Integrating EC with HDFS can
improve storage efficiency while still providing similar data durability as
traditional replication-based HDFS deployments. As an example, a 3x replicated
file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6
data, 3 parity) deployment, it will only consume 9 blocks of disk space.
Parity computations are used in
RAID drive arrays for fault tolerance by calculating the data in two drives and
storing the results on a third. The parity is computed by XOR'ing a bit from
drive 1 with a bit from drive 2 and storing the result on drive 3 (to learn
about XOR, see OR). After a failed drive is replaced, the RAID controller
rebuilds the lost data from the other two drives. RAID systems often have a
"hot" spare drive ready and waiting to replace a drive that fails.
See RAID.
An exclusive OR (XOR) is true if
only one of the inputs is true, but not both.
RAID is a disk or solid state drive (SSD) subsystem that increases performance or provides fault tolerance or both. In the past, RAID was also accomplished by software only but was much slower. In the late 1980s, the "I" in RAID stood for "inexpensive" but was later changed to "independent."
RAID 0 - Striping for Performance (Popular)
Widely used for gaming, striping
interleaves data across multiple drives for performance. However, there are no
safeguards against failure.
The more drives in a RAID 0 array,
the higher the probability of array failure.
RAID 1 - Mirroring for Fault Tolerance (Popular)
Widely used, RAID 1 writes two
drives at the same time. It provides the highest reliability but doubles the
number of drives needed.
RAID 10 combines RAID 1 mirroring with RAID 0 striping for both safety
and performance.
The more drives in a RAID 1 array,
the lower the probability of failure.
RAID 3 - Speed and Fault Tolerance
Data are striped across three or
more drives for performance, and parity is computed for safety. RAID 3 achieves
the highest data transfer rate because all drives operate in parallel. Using
byte level striping, parity bits are stored on separate, dedicated drives.
RAID 5 - Speed and Fault Tolerance (Popular)
Data are striped across three or
more drives for performance, and parity is computed for safety. RAID 5 is
similar to RAID 3, except that the parity is distributed to all drives.
Post a Comment
Post a Comment
Thanks for your comment !
I will review your this and will respond you as soon as possible.