Monday, 23 February 2026

How to Archive Processed Files from One S3 Bucket to Another Using Databricks

How to Archive Processed Files from One S3 Bucket to Another Using Databricks

Introduction

Many organizations separate active and archived data into different S3 buckets. This approach improves data organization, cost management, and security. In this guide, we will move processed files from a source bucket into a dedicated archive bucket.

Step 1: Identify Processed Files

Processed files may be marked by folder structure, file name pattern, or a successful ETL status.

Step 2: Set Source and Archive Bucket Paths

processed_path = "s3a://raw-data-bucket/processed/"
archive_path = "s3a://archive-data-bucket/processed-archive/"

Step 3: Read the File List

processed_files = dbutils.fs.ls(processed_path)

Step 4: Copy Files into the Archive Bucket

for file in processed_files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 5: Validate the Archive Operation

Check whether every required file exists in the archive bucket before deletion from the source.

Step 6: Remove the Original Files

for file in processed_files:
    dbutils.fs.rm(file.path)

Step 7: Organize Archive by Date

It is better to use a dated folder structure like processed-archive/year=2026/month=03/day=06 to simplify traceability.

Conclusion

Archiving processed files from one S3 bucket to another is a practical design for stable data engineering systems. It separates active workloads from historical storage while keeping recovery possible.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...