Monday, 26 January 2026

How to Create a Delta Table from S3 Data in Databricks

How to Create a Delta Table from S3 Data in Databricks

Introduction

Delta tables are preferred in Databricks because they provide ACID transactions, schema enforcement, and better performance. This guide explains how to create a Delta table from files stored in S3.

Step 1: Read the Source File from S3

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/transactions.csv")

Step 2: Clean or Transform the Data

Apply any needed business rules before storing the data as Delta.

clean_df = df.dropDuplicates().filter("transaction_id IS NOT NULL")

Step 3: Write the Data as Delta Format

Save the transformed data in Delta format, either to a path or a named table.

clean_df.write.format("delta").mode("overwrite") \
  .save("s3a://your-bucket-name/delta/transactions_delta")

Step 4: Register the Delta Table

You can create a SQL table pointing to the Delta location.

CREATE TABLE transactions_delta
USING DELTA
LOCATION "s3a://your-bucket-name/delta/transactions_delta";

Step 5: Query the Delta Table

SELECT * FROM transactions_delta LIMIT 20;

Step 6: Benefit from Delta Features

Once stored as Delta, the table supports features like time travel, schema evolution, and optimized merges.

Conclusion

Creating Delta tables from S3 data is a best practice in Databricks because it improves reliability, query performance, and pipeline maintenance in real-world environments.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...