How to Archive Files in S3 Using Databricks
Introduction
Archiving is a safer alternative to immediate deletion. Instead of removing processed files, many teams move them into an archive folder or bucket for future recovery, auditing, or compliance. This guide explains how to archive files in S3 using Databricks.
Step 1: Define Source and Archive Paths
For example, you may have an input folder and an archive folder inside the same bucket.
source_path = "s3a://your-bucket-name/input/"
archive_path = "s3a://your-bucket-name/archive/"
Step 2: List Source Files
files = dbutils.fs.ls(source_path)
display(files)
Step 3: Copy Files to Archive Location
Databricks supports copy commands through filesystem utilities in many workflows.
for file in files:
dbutils.fs.cp(file.path, archive_path + file.name)
Step 4: Validate Archive Copy
Check that the files exist in the archive location before removing them from the source folder.
display(dbutils.fs.ls(archive_path))
Step 5: Delete Source Files After Successful Archive
for file in files:
dbutils.fs.rm(file.path)
Step 6: Add Date-Based Archive Folders
A cleaner approach is to store archived files in folders like archive/2026/03/06/ so retrieval becomes easier.
Conclusion
Archiving files in S3 using Databricks improves data retention and reduces risk. It is especially useful in pipelines where source files should not be lost immediately after processing.
No comments:
Post a Comment