Kavyas Tutorials

Thursday, 5 March 2026

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

Introduction

An end-to-end Databricks S3 pipeline often includes four major tasks: connecting to S3, reading source files, creating tables, and archiving or moving processed files. This guide brings all of those steps together into one practical workflow.

Step 1: Connect Databricks to S3

Configure credentials securely using secret scopes or IAM-based access.

Step 2: Read Input Files

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://source-bucket/input/sales.csv")

Step 3: Validate and Transform the Data

clean_df = df.dropDuplicates().filter("sales_id IS NOT NULL")

Step 4: Create a Delta Table

clean_df.write.format("delta").mode("overwrite").saveAsTable("sales_delta_table")

Step 5: Archive the Source File

dbutils.fs.cp("s3a://source-bucket/input/sales.csv",
              "s3a://archive-bucket/sales-archive/sales.csv")

Step 6: Delete or Move the Original File

dbutils.fs.rm("s3a://source-bucket/input/sales.csv")

Step 7: Schedule the Pipeline

Use Databricks Jobs or Workflows to run the entire notebook on a schedule.

Step 8: Monitor and Audit

Maintain logs for file names, table loads, archive location, and job status so the pipeline remains easy to support.

Conclusion

This end-to-end pattern is one of the most useful Databricks designs for cloud data engineering. It starts with S3 connectivity, transforms raw files into queryable tables, and ends with safe archival or movement of processed data.

Monday, 2 March 2026

Automating S3 File Cleanup and Archival in Databricks

Introduction

Manual file movement and deletion can become difficult when data arrives every hour or every day. Databricks workflows and notebooks can automate file cleanup and archive logic so pipelines stay consistent and low-maintenance.

Step 1: Create a Notebook for File Operations

Write a Databricks notebook that lists source files, copies them to archive, validates the copy, and then deletes the originals.

Step 2: Parameterize the Paths

Use notebook widgets or variables for source bucket, archive bucket, and process date.

dbutils.widgets.text("source_path", "s3a://source-bucket/input/")
dbutils.widgets.text("archive_path", "s3a://archive-bucket/input-archive/")

source_path = dbutils.widgets.get("source_path")
archive_path = dbutils.widgets.get("archive_path")

Step 3: Process Files with a Loop

files = dbutils.fs.ls(source_path)

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)
    dbutils.fs.rm(file.path)

Step 4: Add Logging

Store moved file names, timestamps, and statuses in a Delta log table so every operation is traceable.

Step 5: Schedule the Notebook

Create a Databricks Workflow or Job to run this notebook daily or hourly.

Step 6: Add Alerting

Enable notifications or error handling so failures are reported immediately to the support team.

Conclusion

Automating S3 archival and cleanup in Databricks reduces manual work, improves reliability, and creates a repeatable process for enterprise data pipelines.

Monday, 23 February 2026

How to Archive Processed Files from One S3 Bucket to Another Using Databricks

Introduction

Many organizations separate active and archived data into different S3 buckets. This approach improves data organization, cost management, and security. In this guide, we will move processed files from a source bucket into a dedicated archive bucket.

Step 1: Identify Processed Files

Processed files may be marked by folder structure, file name pattern, or a successful ETL status.

Step 2: Set Source and Archive Bucket Paths

processed_path = "s3a://raw-data-bucket/processed/"
archive_path = "s3a://archive-data-bucket/processed-archive/"

Step 3: Read the File List

processed_files = dbutils.fs.ls(processed_path)

Step 4: Copy Files into the Archive Bucket

for file in processed_files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 5: Validate the Archive Operation

Check whether every required file exists in the archive bucket before deletion from the source.

Step 6: Remove the Original Files

for file in processed_files:
    dbutils.fs.rm(file.path)

Step 7: Organize Archive by Date

It is better to use a dated folder structure like processed-archive/year=2026/month=03/day=06 to simplify traceability.

Conclusion

Archiving processed files from one S3 bucket to another is a practical design for stable data engineering systems. It separates active workloads from historical storage while keeping recovery possible.

Monday, 16 February 2026

How to Move Files from One S3 Bucket to Another Using Databricks

Introduction

Moving files between S3 buckets is a common requirement in enterprise pipelines. For example, raw files may land in one bucket, then after validation they must be moved to a processed or archive bucket. Databricks can help automate this flow.

Step 1: Define Source and Destination Buckets

source_bucket = "s3a://source-bucket-name/input/"
target_bucket = "s3a://target-bucket-name/archive/"

Step 2: List the Source Files

source_files = dbutils.fs.ls(source_bucket)
display(source_files)

Step 3: Copy Files to the Target Bucket

for file in source_files:
    dbutils.fs.cp(file.path, target_bucket + file.name)

Step 4: Validate the Target Bucket

Always confirm the copied files are available in the destination bucket.

display(dbutils.fs.ls(target_bucket))

Step 5: Delete Files from the Source Bucket

Once validation is complete, remove the original files so the move operation is complete.

for file in source_files:
    dbutils.fs.rm(file.path)

Step 6: Add Logging and Error Handling

In production, add try-except blocks, audit logs, and row/file counts to avoid accidental data loss.

Conclusion

Moving files from one S3 bucket to another in Databricks is usually handled as a copy-then-delete operation. This pattern is reliable and works well for archive, backup, and multi-stage ingestion pipelines.

Tuesday, 10 February 2026

Databricks Security Best Practices

Introduction

Security is a critical aspect of modern data platforms. Databricks provides multiple layers of security including authentication, access control, data encryption, and governance.

Step 1: Enable Role-Based Access Control

Use role-based access control to limit access to data and compute resources.

Restrict cluster access
Limit notebook permissions
Use Unity Catalog permissions

Step 2: Secure Data Access

Use Unity Catalog to enforce table-level and column-level permissions.


GRANT SELECT ON TABLE sales_data TO analyst_role;

Step 3: Encrypt Data

Ensure encryption is enabled for both data at rest and data in transit.

Step 4: Monitor Access Logs

Audit logs help organizations track who accessed which datasets.

Conclusion

Implementing security best practices in Databricks helps organizations protect sensitive data while maintaining regulatory compliance.

Monday, 9 February 2026

How to Archive Files in S3 Using Databricks

Introduction

Archiving is a safer alternative to immediate deletion. Instead of removing processed files, many teams move them into an archive folder or bucket for future recovery, auditing, or compliance. This guide explains how to archive files in S3 using Databricks.

Step 1: Define Source and Archive Paths

For example, you may have an input folder and an archive folder inside the same bucket.

source_path = "s3a://your-bucket-name/input/"
archive_path = "s3a://your-bucket-name/archive/"

Step 2: List Source Files

files = dbutils.fs.ls(source_path)
display(files)

Step 3: Copy Files to Archive Location

Databricks supports copy commands through filesystem utilities in many workflows.

for file in files:
    dbutils.fs.cp(file.path, archive_path + file.name)

Step 4: Validate Archive Copy

Check that the files exist in the archive location before removing them from the source folder.

display(dbutils.fs.ls(archive_path))

Step 5: Delete Source Files After Successful Archive

for file in files:
    dbutils.fs.rm(file.path)

Step 6: Add Date-Based Archive Folders

A cleaner approach is to store archived files in folders like archive/2026/03/06/ so retrieval becomes easier.

Conclusion

Archiving files in S3 using Databricks improves data retention and reduces risk. It is especially useful in pipelines where source files should not be lost immediately after processing.

Monday, 2 February 2026

How to Delete Files from an S3 Bucket Using Databricks

Introduction

In many data pipelines, old files must be removed from S3 after processing. Databricks provides filesystem utilities that can help manage files stored in cloud buckets. This guide shows the step-by-step process for deleting files from S3.

Step 1: List Files Before Deletion

Always inspect the target path before deleting any file.

display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))

Step 2: Identify the Exact File or Folder

Make sure you are pointing to the correct file path, especially in production environments.

Step 3: Delete a Single File

dbutils.fs.rm("s3a://your-bucket-name/archive-test/file1.csv", False)

Step 4: Delete an Entire Folder

Use recursive deletion for folders.

dbutils.fs.rm("s3a://your-bucket-name/archive-test/old_files/", True)

Step 5: Recheck the Path

List files again to confirm the deletion worked as expected.

display(dbutils.fs.ls("s3a://your-bucket-name/archive-test/"))

Important Precautions

Never run recursive delete on the wrong root folder
Test in non-production first
Keep backups or archive copies before permanent removal
Control delete permissions using IAM policies

Conclusion

Deleting S3 files from Databricks is straightforward, but it must be done carefully. A good practice is to archive files first and permanently delete them only after validation.

Thursday, 29 January 2026

Databricks Certification Preparation Guide

Introduction

Databricks certifications validate your knowledge in data engineering, machine learning, and analytics using the Lakehouse platform. Preparing properly increases your chances of passing the exam on the first attempt.

Step 1: Understand the Exam Topics

Lakehouse Architecture
Delta Lake
Data Engineering Pipelines
Databricks SQL
Unity Catalog Governance

Step 2: Practice with Databricks Workspace

Create clusters and run notebooks to gain hands-on experience.


df = spark.read.csv("/mnt/data/sales.csv", header=True)
display(df)

Step 3: Learn Optimization Techniques

OPTIMIZE command
Z-Ordering
Partitioning

Step 4: Practice Scenario Questions

Most certification exams include real-world scenarios requiring architecture and pipeline decisions.

Conclusion

Consistent practice, understanding Lakehouse concepts, and hands-on experimentation are the best ways to prepare for Databricks certification exams.

Monday, 26 January 2026

How to Create a Delta Table from S3 Data in Databricks

Introduction

Delta tables are preferred in Databricks because they provide ACID transactions, schema enforcement, and better performance. This guide explains how to create a Delta table from files stored in S3.

Step 1: Read the Source File from S3

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/transactions.csv")

Step 2: Clean or Transform the Data

Apply any needed business rules before storing the data as Delta.

clean_df = df.dropDuplicates().filter("transaction_id IS NOT NULL")

Step 3: Write the Data as Delta Format

Save the transformed data in Delta format, either to a path or a named table.

clean_df.write.format("delta").mode("overwrite") \
  .save("s3a://your-bucket-name/delta/transactions_delta")

Step 4: Register the Delta Table

You can create a SQL table pointing to the Delta location.

CREATE TABLE transactions_delta
USING DELTA
LOCATION "s3a://your-bucket-name/delta/transactions_delta";

Step 5: Query the Delta Table

SELECT * FROM transactions_delta LIMIT 20;

Step 6: Benefit from Delta Features

Once stored as Delta, the table supports features like time travel, schema evolution, and optimized merges.

Conclusion

Creating Delta tables from S3 data is a best practice in Databricks because it improves reliability, query performance, and pipeline maintenance in real-world environments.

Monday, 19 January 2026

How to Create a Table in Databricks from S3 Files

Introduction

Creating tables from S3 data is a core Databricks workflow. Instead of reading files every time, you can create managed or external tables so analysts and engineers can query the data more easily using SQL.

Step 1: Load Data from S3

Read the S3 file into a Spark DataFrame.

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/products.csv")

Step 2: Review the Schema

Verify column names and data types before table creation.

df.printSchema()

Step 3: Create a Temporary View

A temporary view helps you validate the data with SQL before creating a permanent table.

df.createOrReplaceTempView("products_temp")

Step 4: Query the Temporary View

SELECT * FROM products_temp LIMIT 10;

Step 5: Create a Managed Table

If you want Databricks to manage storage metadata, create a managed table.

df.write.mode("overwrite").saveAsTable("products_table")

Step 6: Create an External Table

If you want the underlying files to remain in S3, create an external table pointing to that S3 path.

CREATE TABLE products_external
USING CSV
OPTIONS (
  path "s3a://your-bucket-name/input/products.csv",
  header "true"
);

Step 7: Query the Table

SELECT COUNT(*) FROM products_table;

Conclusion

Creating Databricks tables from S3 files makes data easier to manage, query, and govern. It is a practical step for building reusable analytics and ETL pipelines.

Monday, 12 January 2026

How to Read Files from S3 in Databricks

Introduction

After connecting Databricks to an S3 bucket, the next step is reading files for processing. Databricks supports multiple formats such as CSV, JSON, Parquet, and Delta. This guide shows how to load S3 data into Databricks step by step.

Step 1: Confirm S3 Connectivity

Before reading files, verify that the bucket path is accessible from Databricks.

display(dbutils.fs.ls("s3a://your-bucket-name/input/"))

Step 2: Read a CSV File

CSV is one of the most common formats for raw data ingestion.

df_csv = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/customer_data.csv")
display(df_csv)

Step 3: Read a JSON File

JSON files are widely used in APIs and application logs.

df_json = spark.read.json("s3a://your-bucket-name/input/events.json")
display(df_json)

Step 4: Read a Parquet File

Parquet is a columnar format optimized for analytics.

df_parquet = spark.read.parquet("s3a://your-bucket-name/input/orders/")
display(df_parquet)

Step 5: Inspect Schema and Quality

Always review schema and null values before transforming the data.

df_csv.printSchema()
df_csv.describe().show()

Step 6: Filter or Transform the Data

Once the file is loaded, you can apply filtering, joins, and aggregations using Spark.

filtered_df = df_csv.filter("amount > 1000")
display(filtered_df)

Conclusion

Reading files from S3 in Databricks is simple once the connection is configured. The key is choosing the right file format and validating the data early so downstream tables and reports remain accurate.