Monday, 2 March 2026

Automating S3 File Cleanup and Archival in Databricks

Automating S3 File Cleanup and Archival in Databricks

Introduction

Manual file movement and deletion can become difficult when data arrives every hour or every day. Databricks workflows and notebooks can automate file cleanup and archive logic so pipelines stay consistent and low-maintenance.

Step 1: Create a Notebook for File Operations

Write a Databricks notebook that lists source files, copies them to archive, validates the copy, and then deletes the originals.

Step 2: Parameterize the Paths

Use notebook widgets or variables for source bucket, archive bucket, and process date.

dbutils.widgets.text("source_path", "s3a://source-bucket/input/")
dbutils.widgets.text("archive_path", "s3a://archive-bucket/input-archive/")

source_path = dbutils.widgets.get("source_path")
archive_path = dbutils.widgets.get("archive_path")

Step 3: Process Files with a Loop

files = dbutils.fs.ls(source_path)

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)
    dbutils.fs.rm(file.path)

Step 4: Add Logging

Store moved file names, timestamps, and statuses in a Delta log table so every operation is traceable.

Step 5: Schedule the Notebook

Create a Databricks Workflow or Job to run this notebook daily or hourly.

Step 6: Add Alerting

Enable notifications or error handling so failures are reported immediately to the support team.

Conclusion

Automating S3 archival and cleanup in Databricks reduces manual work, improves reliability, and creates a repeatable process for enterprise data pipelines.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...