Monday, 23 February 2026

How to Archive Processed Files from One S3 Bucket to Another Using Databricks

How to Archive Processed Files from One S3 Bucket to Another Using Databricks

Introduction

Many organizations separate active and archived data into different S3 buckets. This approach improves data organization, cost management, and security. In this guide, we will move processed files from a source bucket into a dedicated archive bucket.

Step 1: Identify Processed Files

Processed files may be marked by folder structure, file name pattern, or a successful ETL status.

Step 2: Set Source and Archive Bucket Paths

processed_path = "s3a://raw-data-bucket/processed/"
archive_path = "s3a://archive-data-bucket/processed-archive/"

Step 3: Read the File List

processed_files = dbutils.fs.ls(processed_path)

Step 4: Copy Files into the Archive Bucket

for file in processed_files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 5: Validate the Archive Operation

Check whether every required file exists in the archive bucket before deletion from the source.

Step 6: Remove the Original Files

for file in processed_files:
    dbutils.fs.rm(file.path)

Step 7: Organize Archive by Date

It is better to use a dated folder structure like processed-archive/year=2026/month=03/day=06 to simplify traceability.

Conclusion

Archiving processed files from one S3 bucket to another is a practical design for stable data engineering systems. It separates active workloads from historical storage while keeping recovery possible.

Monday, 16 February 2026

How to Move Files from One S3 Bucket to Another Using Databricks

How to Move Files from One S3 Bucket to Another Using Databricks

Introduction

Moving files between S3 buckets is a common requirement in enterprise pipelines. For example, raw files may land in one bucket, then after validation they must be moved to a processed or archive bucket. Databricks can help automate this flow.

Step 1: Define Source and Destination Buckets

source_bucket = "s3a://source-bucket-name/input/"
target_bucket = "s3a://target-bucket-name/archive/"

Step 2: List the Source Files

source_files = dbutils.fs.ls(source_bucket)
display(source_files)

Step 3: Copy Files to the Target Bucket

for file in source_files:
    dbutils.fs.cp(file.path, target_bucket + file.name)

Step 4: Validate the Target Bucket

Always confirm the copied files are available in the destination bucket.

display(dbutils.fs.ls(target_bucket))

Step 5: Delete Files from the Source Bucket

Once validation is complete, remove the original files so the move operation is complete.

for file in source_files:
    dbutils.fs.rm(file.path)

Step 6: Add Logging and Error Handling

In production, add try-except blocks, audit logs, and row/file counts to avoid accidental data loss.

Conclusion

Moving files from one S3 bucket to another in Databricks is usually handled as a copy-then-delete operation. This pattern is reliable and works well for archive, backup, and multi-stage ingestion pipelines.

Tuesday, 10 February 2026

Databricks Security Best Practices

Databricks Security Best Practices

Introduction

Security is a critical aspect of modern data platforms. Databricks provides multiple layers of security including authentication, access control, data encryption, and governance.

Step 1: Enable Role-Based Access Control

Use role-based access control to limit access to data and compute resources.

  • Restrict cluster access
  • Limit notebook permissions
  • Use Unity Catalog permissions

Step 2: Secure Data Access

Use Unity Catalog to enforce table-level and column-level permissions.


GRANT SELECT ON TABLE sales_data TO analyst_role;

Step 3: Encrypt Data

Ensure encryption is enabled for both data at rest and data in transit.

Step 4: Monitor Access Logs

Audit logs help organizations track who accessed which datasets.

Conclusion

Implementing security best practices in Databricks helps organizations protect sensitive data while maintaining regulatory compliance.

Monday, 9 February 2026

How to Archive Files in S3 Using Databricks

How to Archive Files in S3 Using Databricks

Introduction

Archiving is a safer alternative to immediate deletion. Instead of removing processed files, many teams move them into an archive folder or bucket for future recovery, auditing, or compliance. This guide explains how to archive files in S3 using Databricks.

Step 1: Define Source and Archive Paths

For example, you may have an input folder and an archive folder inside the same bucket.

source_path = "s3a://your-bucket-name/input/"
archive_path = "s3a://your-bucket-name/archive/"

Step 2: List Source Files

files = dbutils.fs.ls(source_path)
display(files)

Step 3: Copy Files to Archive Location

Databricks supports copy commands through filesystem utilities in many workflows.

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 4: Validate Archive Copy

Check that the files exist in the archive location before removing them from the source folder.

display(dbutils.fs.ls(archive_path))

Step 5: Delete Source Files After Successful Archive

for file in files:
    dbutils.fs.rm(file.path)

Step 6: Add Date-Based Archive Folders

A cleaner approach is to store archived files in folders like archive/2026/03/06/ so retrieval becomes easier.

Conclusion

Archiving files in S3 using Databricks improves data retention and reduces risk. It is especially useful in pipelines where source files should not be lost immediately after processing.

Monday, 2 February 2026

How to Delete Files from an S3 Bucket Using Databricks

How to Delete Files from an S3 Bucket Using Databricks

Introduction

In many data pipelines, old files must be removed from S3 after processing. Databricks provides filesystem utilities that can help manage files stored in cloud buckets. This guide shows the step-by-step process for deleting files from S3.

Step 1: List Files Before Deletion

Always inspect the target path before deleting any file.

display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))

Step 2: Identify the Exact File or Folder

Make sure you are pointing to the correct file path, especially in production environments.

Step 3: Delete a Single File

dbutils.fs.rm("s3a://your-bucket-name/archive-test/file1.csv", False)

Step 4: Delete an Entire Folder

Use recursive deletion for folders.

dbutils.fs.rm("s3a://your-bucket-name/archive-test/old_files/", True)

Step 5: Recheck the Path

List files again to confirm the deletion worked as expected.

display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))

Important Precautions

  • Never run recursive delete on the wrong root folder
  • Test in non-production first
  • Keep backups or archive copies before permanent removal
  • Control delete permissions using IAM policies

Conclusion

Deleting S3 files from Databricks is straightforward, but it must be done carefully. A good practice is to archive files first and permanently delete them only after validation.

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...