How to Archive Processed Files from One S3 Bucket to Another Using Databricks
Introduction
Many organizations separate active and archived data into different S3 buckets. This approach improves data organization, cost management, and security. In this guide, we will move processed files from a source bucket into a dedicated archive bucket.
Step 1: Identify Processed Files
Processed files may be marked by folder structure, file name pattern, or a successful ETL status.
Step 2: Set Source and Archive Bucket Paths
processed_path = "s3a://raw-data-bucket/processed/"
archive_path = "s3a://archive-data-bucket/processed-archive/"
Step 3: Read the File List
processed_files = dbutils.fs.ls(processed_path)
Step 4: Copy Files into the Archive Bucket
for file in processed_files:
dbutils.fs.cp(file.path, archive_path + file.name)
Step 5: Validate the Archive Operation
Check whether every required file exists in the archive bucket before deletion from the source.
Step 6: Remove the Original Files
for file in processed_files:
dbutils.fs.rm(file.path)
Step 7: Organize Archive by Date
It is better to use a dated folder structure like processed-archive/year=2026/month=03/day=06 to simplify traceability.
Conclusion
Archiving processed files from one S3 bucket to another is a practical design for stable data engineering systems. It separates active workloads from historical storage while keeping recovery possible.