Kavyas Tutorials: Automating S3 File Cleanup and Archival in Databricks

Monday, 2 March 2026

Automating S3 File Cleanup and Archival in Databricks

Introduction

Manual file movement and deletion can become difficult when data arrives every hour or every day. Databricks workflows and notebooks can automate file cleanup and archive logic so pipelines stay consistent and low-maintenance.

Step 1: Create a Notebook for File Operations

Write a Databricks notebook that lists source files, copies them to archive, validates the copy, and then deletes the originals.

Step 2: Parameterize the Paths

Use notebook widgets or variables for source bucket, archive bucket, and process date.

dbutils.widgets.text("source_path", "s3a://source-bucket/input/")
dbutils.widgets.text("archive_path", "s3a://archive-bucket/input-archive/")

source_path = dbutils.widgets.get("source_path")
archive_path = dbutils.widgets.get("archive_path")

Step 3: Process Files with a Loop

files = dbutils.fs.ls(source_path)

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)
    dbutils.fs.rm(file.path)

Step 4: Add Logging

Store moved file names, timestamps, and statuses in a Delta log table so every operation is traceable.

Step 5: Schedule the Notebook

Create a Databricks Workflow or Job to run this notebook daily or hourly.

Step 6: Add Alerting

Enable notifications or error handling so failures are reported immediately to the support team.

Conclusion

Automating S3 archival and cleanup in Databricks reduces manual work, improves reliability, and creates a repeatable process for enterprise data pipelines.

Kavyas Tutorials

Monday, 2 March 2026

Automating S3 File Cleanup and Archival in Databricks

Automating S3 File Cleanup and Archival in Databricks

Introduction

Step 1: Create a Notebook for File Operations

Step 2: Parameterize the Paths

Step 3: Process Files with a Loop

Step 4: Add Logging

Step 5: Schedule the Notebook

Step 6: Add Alerting

Conclusion

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files