Delta Table in Azure Databricks
Delta Tables in Azure Databricks leverage the Delta Lake technology to provide ACID transactions, time travel, and other advanced features for data reliability and performance. These tables can be created and managed within Databricks, stored on Azure Data Lake Storage (ADLS), and integrated into your broader data ecosystem.
Types of Delta Tables
- Managed Delta Table:
- Databricks manages both the metadata and the underlying data files.
- The data is stored in the Databricks default storage location.
- External Delta Table:
- The metadata is managed by Databricks, but the data is stored in an external storage (e.g., ADLS).
Creating a Delta Table
Method 1: Writing to a Delta Table
You can create a Delta Table by saving a DataFrame in the Delta format.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("DeltaTableExample").getOrCreate()
# Create a DataFrame
data = spark.range(0, 5)
# Write the DataFrame as a Delta Table
data.write.format("delta").save("abfss://<container>@<storage_account>.dfs.core.windows.net/<path>")
Method 2: SQL Syntax
CREATE TABLE delta_table
USING DELTA
AS SELECT * FROM parquet.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path_to_parquet>`;
Querying a Delta Table
You can use either Spark or SQL to query Delta Tables.
Using PySpark
delta_table = spark.read.format("delta").load("abfss://<container>@<storage_account>.dfs.core.windows.net/<path>")
delta_table.show()
Using SQL
SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`;
Updating a Delta Table
Delta Lake allows for updates, merges, and deletes:
Update Example
UPDATE delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`
SET column_name = new_value
WHERE condition;
Merge (Upsert) Example
MERGE INTO delta_table AS target
USING updates_table AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.value = source.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (source.id, source.value);
Delete Example
DELETE FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`
WHERE condition;
Time Travel
Delta Lake provides the ability to query historical versions of data.
Query a Previous Version
SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>` VERSION AS OF 1;
Query by Timestamp
SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>` TIMESTAMP AS OF '2023-11-01T00:00:00Z';
Best Practices
- Enable Delta Caching: Improves query performance.
- Partition Your Data: Speeds up queries on large datasets.
- Optimize Your Table:
- Use
OPTIMIZE
to compact files.
OPTIMIZE delta_table;
- Use
- Vacuum Old Files: Removes files no longer referenced by Delta Lake (e.g., after updates/deletes).
VACUUM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`;
Integrating with Azure Data Services
- Azure Synapse Analytics: Query Delta Tables via Synapse Spark or SQL pools.
- Power BI: Connect Power BI to Delta tables using Databricks SQL Analytics.
Would you like help setting up Delta Tables for a specific workflow?