How to Delete Files from an S3 Bucket Using Databricks
Introduction
In many data pipelines, old files must be removed from S3 after processing. Databricks provides filesystem utilities that can help manage files stored in cloud buckets. This guide shows the step-by-step process for deleting files from S3.
Step 1: List Files Before Deletion
Always inspect the target path before deleting any file.
display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))
Step 2: Identify the Exact File or Folder
Make sure you are pointing to the correct file path, especially in production environments.
Step 3: Delete a Single File
dbutils.fs.rm("s3a://your-bucket-name/archive-test/file1.csv", False)
Step 4: Delete an Entire Folder
Use recursive deletion for folders.
dbutils.fs.rm("s3a://your-bucket-name/archive-test/old_files/", True)
Step 5: Recheck the Path
List files again to confirm the deletion worked as expected.
display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))
Important Precautions
- Never run recursive delete on the wrong root folder
- Test in non-production first
- Keep backups or archive copies before permanent removal
- Control delete permissions using IAM policies
Conclusion
Deleting S3 files from Databricks is straightforward, but it must be done carefully. A good practice is to archive files first and permanently delete them only after validation.
No comments:
Post a Comment