Automating S3 File Cleanup and Archival in Databricks
Introduction
Manual file movement and deletion can become difficult when data arrives every hour or every day. Databricks workflows and notebooks can automate file cleanup and archive logic so pipelines stay consistent and low-maintenance.
Step 1: Create a Notebook for File Operations
Write a Databricks notebook that lists source files, copies them to archive, validates the copy, and then deletes the originals.
Step 2: Parameterize the Paths
Use notebook widgets or variables for source bucket, archive bucket, and process date.
dbutils.widgets.text("source_path", "s3a://source-bucket/input/")
dbutils.widgets.text("archive_path", "s3a://archive-bucket/input-archive/")
source_path = dbutils.widgets.get("source_path")
archive_path = dbutils.widgets.get("archive_path")
Step 3: Process Files with a Loop
files = dbutils.fs.ls(source_path)
for file in files:
dbutils.fs.cp(file.path, archive_path + file.name)
dbutils.fs.rm(file.path)
Step 4: Add Logging
Store moved file names, timestamps, and statuses in a Delta log table so every operation is traceable.
Step 5: Schedule the Notebook
Create a Databricks Workflow or Job to run this notebook daily or hourly.
Step 6: Add Alerting
Enable notifications or error handling so failures are reported immediately to the support team.
Conclusion
Automating S3 archival and cleanup in Databricks reduces manual work, improves reliability, and creates a repeatable process for enterprise data pipelines.
No comments:
Post a Comment