Delta Lake in Apache Spark - Basics

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Delta Lake offers:
  • ACID Transactions
  • Scalable Metadata Handling
  • Time Travel (data versioning)
  • Open Format
  • Unified Batch and Streaming Source & Sink
  • Schema Enforcement
  • Schema Evolution
  • Updates and Delete
Start spark Shell with Delta
Just add io.delta:delta-core_2.12:0.1.0 package and you are ready to use Delta:
spark-shell --packages io.delta:delta-core_2.12:0.1.0

Save as Delta format
To save as Delta, you just have to add format("delta") to your dataset:
dataframe.write.format("delta").save("/path/to/delta")

Read Delta format
To read a Delta file, you just have to add format("delta") to your dataset:
spark.read.format("delta").load("/path/to/delta")

Create Delta Table
To Create a Delta table:
CREATE TABLE table_name(
date DATE,
eventId STRING,
eventType STRING,
data STRING)
USING DELTA
PARTITIONED BY (date)
LOCATION '/path/to/delta/table_name'
Or
CREATE TABLE table_name
USING delta
AS SELECT *
FROM parquet.`/path/of/parquet_file/`

Read a Delta table
You can access data in Delta tables either by specifying the path (/path/to/delta/table_name) or the table name (table_name):
SELECT * FROM table_name  
or 
SELECT * FROM delta. `/path/to/delta/table_name`

Display table history
To view the history of a table, use the DESCRIBE HISTORY statement, which provides provenance information, including the table version, operation, user, and so on, for each write to a table.
DESCRIBE HISTORY table_name

Query an earlier version of the table (time travel) 
To query an older version of a table, specify a version or timestamp in a SELECT statement.
For example, to query version 0 from the history above, use:
SELECT  *  FROM table_name VERSION AS OF 0

Follow here if you interested to develop Modern Data Warehouse solution using Delta Lake.

Post a Comment

Thanks for your comment !
I will review your this and will respond you as soon as possible.