Delta Table in Azure Databricks

Delta Tables in Azure Databricks leverage the Delta Lake technology to provide ACID transactions, time travel, and other advanced features for data reliability and performance. These tables can be created and managed within Databricks, stored on Azure Data Lake Storage (ADLS), and integrated into your broader data ecosystem.


Types of Delta Tables

  1. Managed Delta Table:
    • Databricks manages both the metadata and the underlying data files.
    • The data is stored in the Databricks default storage location.
  2. External Delta Table:
    • The metadata is managed by Databricks, but the data is stored in an external storage (e.g., ADLS).

Creating a Delta Table

Method 1: Writing to a Delta Table

You can create a Delta Table by saving a DataFrame in the Delta format.

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("DeltaTableExample").getOrCreate()

# Create a DataFrame
data = spark.range(0, 5)

# Write the DataFrame as a Delta Table
data.write.format("delta").save("abfss://<container>@<storage_account>.dfs.core.windows.net/<path>")

Method 2: SQL Syntax

CREATE TABLE delta_table
USING DELTA
AS SELECT * FROM parquet.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path_to_parquet>`;

Querying a Delta Table

You can use either Spark or SQL to query Delta Tables.

Using PySpark

delta_table = spark.read.format("delta").load("abfss://<container>@<storage_account>.dfs.core.windows.net/<path>")
delta_table.show()

Using SQL

SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`;

Updating a Delta Table

Delta Lake allows for updates, merges, and deletes:

Update Example

UPDATE delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`
SET column_name = new_value
WHERE condition;

Merge (Upsert) Example

MERGE INTO delta_table AS target
USING updates_table AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.value = source.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (source.id, source.value);

Delete Example

DELETE FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`
WHERE condition;

Time Travel

Delta Lake provides the ability to query historical versions of data.

Query a Previous Version

SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>` VERSION AS OF 1;

Query by Timestamp

SELECT * FROM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>` TIMESTAMP AS OF '2023-11-01T00:00:00Z';

Best Practices

  1. Enable Delta Caching: Improves query performance.
  2. Partition Your Data: Speeds up queries on large datasets.
  3. Optimize Your Table:
    • Use OPTIMIZE to compact files.
    OPTIMIZE delta_table;
    
  4. Vacuum Old Files: Removes files no longer referenced by Delta Lake (e.g., after updates/deletes).
    VACUUM delta.`abfss://<container>@<storage_account>.dfs.core.windows.net/<path>`;
    

Integrating with Azure Data Services

  • Azure Synapse Analytics: Query Delta Tables via Synapse Spark or SQL pools.
  • Power BI: Connect Power BI to Delta tables using Databricks SQL Analytics.

Would you like help setting up Delta Tables for a specific workflow?