Thursday, 5 March 2026

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

Introduction

An end-to-end Databricks S3 pipeline often includes four major tasks: connecting to S3, reading source files, creating tables, and archiving or moving processed files. This guide brings all of those steps together into one practical workflow.

Step 1: Connect Databricks to S3

Configure credentials securely using secret scopes or IAM-based access.

Step 2: Read Input Files

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://source-bucket/input/sales.csv")

Step 3: Validate and Transform the Data

clean_df = df.dropDuplicates().filter("sales_id IS NOT NULL")

Step 4: Create a Delta Table

clean_df.write.format("delta").mode("overwrite").saveAsTable("sales_delta_table")

Step 5: Archive the Source File

dbutils.fs.cp("s3a://source-bucket/input/sales.csv",
              "s3a://archive-bucket/sales-archive/sales.csv")

Step 6: Delete or Move the Original File

dbutils.fs.rm("s3a://source-bucket/input/sales.csv")

Step 7: Schedule the Pipeline

Use Databricks Jobs or Workflows to run the entire notebook on a schedule.

Step 8: Monitor and Audit

Maintain logs for file names, table loads, archive location, and job status so the pipeline remains easy to support.

Conclusion

This end-to-end pattern is one of the most useful Databricks designs for cloud data engineering. It starts with S3 connectivity, transforms raw files into queryable tables, and ends with safe archival or movement of processed data.

Monday, 2 March 2026

Automating S3 File Cleanup and Archival in Databricks

Automating S3 File Cleanup and Archival in Databricks

Introduction

Manual file movement and deletion can become difficult when data arrives every hour or every day. Databricks workflows and notebooks can automate file cleanup and archive logic so pipelines stay consistent and low-maintenance.

Step 1: Create a Notebook for File Operations

Write a Databricks notebook that lists source files, copies them to archive, validates the copy, and then deletes the originals.

Step 2: Parameterize the Paths

Use notebook widgets or variables for source bucket, archive bucket, and process date.

dbutils.widgets.text("source_path", "s3a://source-bucket/input/")
dbutils.widgets.text("archive_path", "s3a://archive-bucket/input-archive/")

source_path = dbutils.widgets.get("source_path")
archive_path = dbutils.widgets.get("archive_path")

Step 3: Process Files with a Loop

files = dbutils.fs.ls(source_path)

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)
    dbutils.fs.rm(file.path)

Step 4: Add Logging

Store moved file names, timestamps, and statuses in a Delta log table so every operation is traceable.

Step 5: Schedule the Notebook

Create a Databricks Workflow or Job to run this notebook daily or hourly.

Step 6: Add Alerting

Enable notifications or error handling so failures are reported immediately to the support team.

Conclusion

Automating S3 archival and cleanup in Databricks reduces manual work, improves reliability, and creates a repeatable process for enterprise data pipelines.

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...