How to Create a Delta Table from S3 Data in Databricks
Introduction
Delta tables are preferred in Databricks because they provide ACID transactions, schema enforcement, and better performance. This guide explains how to create a Delta table from files stored in S3.
Step 1: Read the Source File from S3
df = spark.read.option("header", "true").option("inferSchema", "true") \
.csv("s3a://your-bucket-name/input/transactions.csv")
Step 2: Clean or Transform the Data
Apply any needed business rules before storing the data as Delta.
clean_df = df.dropDuplicates().filter("transaction_id IS NOT NULL")
Step 3: Write the Data as Delta Format
Save the transformed data in Delta format, either to a path or a named table.
clean_df.write.format("delta").mode("overwrite") \
.save("s3a://your-bucket-name/delta/transactions_delta")
Step 4: Register the Delta Table
You can create a SQL table pointing to the Delta location.
CREATE TABLE transactions_delta
USING DELTA
LOCATION "s3a://your-bucket-name/delta/transactions_delta";
Step 5: Query the Delta Table
SELECT * FROM transactions_delta LIMIT 20;
Step 6: Benefit from Delta Features
Once stored as Delta, the table supports features like time travel, schema evolution, and optimized merges.
Conclusion
Creating Delta tables from S3 data is a best practice in Databricks because it improves reliability, query performance, and pipeline maintenance in real-world environments.
No comments:
Post a Comment