How to Move Files from One S3 Bucket to Another Using Databricks
Introduction
Moving files between S3 buckets is a common requirement in enterprise pipelines. For example, raw files may land in one bucket, then after validation they must be moved to a processed or archive bucket. Databricks can help automate this flow.
Step 1: Define Source and Destination Buckets
source_bucket = "s3a://source-bucket-name/input/"
target_bucket = "s3a://target-bucket-name/archive/"
Step 2: List the Source Files
source_files = dbutils.fs.ls(source_bucket)
display(source_files)
Step 3: Copy Files to the Target Bucket
for file in source_files:
dbutils.fs.cp(file.path, target_bucket + file.name)
Step 4: Validate the Target Bucket
Always confirm the copied files are available in the destination bucket.
display(dbutils.fs.ls(target_bucket))
Step 5: Delete Files from the Source Bucket
Once validation is complete, remove the original files so the move operation is complete.
for file in source_files:
dbutils.fs.rm(file.path)
Step 6: Add Logging and Error Handling
In production, add try-except blocks, audit logs, and row/file counts to avoid accidental data loss.
Conclusion
Moving files from one S3 bucket to another in Databricks is usually handled as a copy-then-delete operation. This pattern is reliable and works well for archive, backup, and multi-stage ingestion pipelines.
No comments:
Post a Comment