Monday, 9 February 2026

How to Archive Files in S3 Using Databricks

How to Archive Files in S3 Using Databricks

Introduction

Archiving is a safer alternative to immediate deletion. Instead of removing processed files, many teams move them into an archive folder or bucket for future recovery, auditing, or compliance. This guide explains how to archive files in S3 using Databricks.

Step 1: Define Source and Archive Paths

For example, you may have an input folder and an archive folder inside the same bucket.

source_path = "s3a://your-bucket-name/input/"
archive_path = "s3a://your-bucket-name/archive/"

Step 2: List Source Files

files = dbutils.fs.ls(source_path)
display(files)

Step 3: Copy Files to Archive Location

Databricks supports copy commands through filesystem utilities in many workflows.

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 4: Validate Archive Copy

Check that the files exist in the archive location before removing them from the source folder.

display(dbutils.fs.ls(archive_path))

Step 5: Delete Source Files After Successful Archive

for file in files:
    dbutils.fs.rm(file.path)

Step 6: Add Date-Based Archive Folders

A cleaner approach is to store archived files in folders like archive/2026/03/06/ so retrieval becomes easier.

Conclusion

Archiving files in S3 using Databricks improves data retention and reduces risk. It is especially useful in pipelines where source files should not be lost immediately after processing.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...