Thursday, 25 December 2025

AWS EC2 — Complete Beginner Guide (Instances, Pricing, Use Cases)

What Is EC2?

Amazon EC2 (Elastic Compute Cloud) provides virtual servers known as instances. It allows you to run applications without managing physical hardware.

Types of EC2 Instances

General Purpose: t3, t4g
Compute Optimized: c6i
Memory Optimized: r6g
GPU Instances: p4, g5
Storage Optimized: i4

EC2 Pricing Models

On-Demand

Pay per second/hour. Most flexible but expensive.

Reserved Instances

Commit 1–3 years. Up to 72% cheaper.

Spot Instances

Use AWS unused capacity. Up to 90% cheaper. Best for batch jobs & ML training.

Key EC2 Features

Security Groups
EBS Block Storage
Elastic Load Balancing
Auto Scaling

When to Use EC2?

Web applications
Backend APIs
Gaming servers
Databases
Machine learning workloads

Conclusion

EC2 is a core AWS service. Knowing its pricing and instance types is essential for cloud beginners.

Friday, 19 December 2025

Databricks Important Commands Cheat Sheet (SQL + Python)

This post is a quick Databricks commands cheat sheet for certification exam preparation. It covers the most important SQL and Python (PySpark) commands used with Delta Lake, Lakehouse, Unity Catalog, Auto Loader, Structured Streaming and optimization.

1. Basic Spark & DataFrame Commands (Python)

Start Spark Session (usually auto in Databricks)

# Spark session is usually available as `spark` in Databricks
spark.range(5).show()

Read CSV File

df = spark.read.option("header", "true").csv("/mnt/data/sales.csv")
df.show()

Write DataFrame as Parquet

df.write.mode("overwrite").parquet("/mnt/data/sales_parquet")

Display DataFrame in Notebook

display(df)

2. Delta Lake – Table Creation & Writes

Create Delta Table from DataFrame (Path)

df.write.format("delta").mode("overwrite").save("/mnt/delta/sales")

Create Delta Table as Managed Table

df.write.format("delta").mode("overwrite").saveAsTable("sales_delta")

SQL – Create Delta Table

CREATE TABLE sales_delta_sql (
  id BIGINT,
  amount DOUBLE,
  country STRING
)
USING DELTA;

SQL – Insert into Delta Table

INSERT INTO sales_delta_sql VALUES (1, 100.0, 'SG'), (2, 250.5, 'IN');

3. Delta Lake – Time Travel & History

View Table History

DESCRIBE HISTORY sales_delta_sql;

Time Travel by Version

SELECT * FROM sales_delta_sql VERSION AS OF 2;

Time Travel by Timestamp

SELECT * FROM sales_delta_sql TIMESTAMP AS OF '2026-02-28T10:00:00Z';

4. Delta Lake – Update, Merge & Delete

SQL – UPDATE

UPDATE sales_delta_sql
SET amount = amount * 1.1
WHERE country = 'SG';

SQL – DELETE

DELETE FROM sales_delta_sql
WHERE amount < 50;

SQL – MERGE (Upsert)

MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.amount = s.amount
WHEN NOT MATCHED THEN
  INSERT (id, amount, country) VALUES (s.id, s.amount, s.country);

5. Optimization – OPTIMIZE, Z-ORDER, VACUUM

OPTIMIZE Delta Table

OPTIMIZE sales_delta_sql;

OPTIMIZE with Z-ORDER

OPTIMIZE sales_delta_sql
ZORDER BY (country);

VACUUM to Remove Old Files

VACUUM sales_delta_sql RETAIN 168 HOURS;  -- 7 days

6. Auto Loader – Incremental Ingestion

Python – Auto Loader from Cloud Storage

from pyspark.sql.functions import col

df_auto = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("header", "true")
  .load("/mnt/raw/sales/"))

(df_auto
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/mnt/checkpoints/sales_autoloader")
  .outputMode("append")
  .start("/mnt/delta/sales_autoloader"))

7. Structured Streaming with Delta

Read Stream from Delta

stream_df = (spark.readStream
  .format("delta")
  .load("/mnt/delta/sales_stream"))

Write Stream to Delta

(stream_df
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/mnt/checkpoints/sales_stream_out")
  .outputMode("append")
  .start("/mnt/delta/sales_stream_out"))

SQL – Streaming Table (Simplified)

CREATE OR REFRESH STREAMING LIVE TABLE sales_stream_silver
AS SELECT * FROM cloud_files("/mnt/raw/sales", "csv");

8. Delta Live Tables (DLT) – Basic Commands

Python DLT Example

import dlt
from pyspark.sql.functions import *

@dlt.table
def sales_bronze():
    return spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "csv") \
        .load("/mnt/raw/sales")

@dlt.table
def sales_silver():
    return dlt.read("sales_bronze").select("id", "amount", "country")

9. Unity Catalog – Databases, Tables & Grants

List Catalogs

SHOW CATALOGS;

Set Current Catalog & Schema

USE CATALOG main;
USE SCHEMA main.sales_db;

Create Schema

CREATE SCHEMA IF NOT EXISTS main.sales_db;

Grant Permissions on Table

GRANT SELECT ON TABLE main.sales_db.sales_delta_sql TO `analyst_role`;

Revoke Permission

REVOKE SELECT ON TABLE main.sales_db.sales_delta_sql FROM `analyst_role`;

10. Useful Utility Commands for Exams

Describe Table

DESCRIBE EXTENDED sales_delta_sql;

Show Tables

SHOW TABLES IN main.sales_db;

Convert Parquet to Delta

CONVERT TO DELTA parquet.`/mnt/data/sales_parquet`;

Python – Convert to Delta Using Command

spark.sql("""
  CONVERT TO DELTA parquet.`/mnt/data/sales_parquet`
""")

Conclusion

This Databricks commands cheat sheet covers the most frequently used SQL and Python snippets for Delta Lake, Lakehouse, Auto Loader, DLT, Unity Catalog and optimization. These commands are highly relevant for Databricks certification exams and real-world projects. Use this page as a quick reference while practicing in Databricks notebooks.

Thursday, 11 December 2025

Databricks Scenario-Based Q&A (Certification Point of View)

Databricks Scenario-Based Q&A (Certification Exam Point of View)

This post contains the most frequently asked Databricks scenario-based questions and answers useful for Databricks Data Engineer Associate, Professional Data Engineer, and Lakehouse platform exams. All scenarios are short, practical, and certification-focused.

1. Delta Lake & Data Quality Scenarios

Scenario 1:

Your raw data contains duplicate rows and schema mismatches. How do you load it safely?

Answer: Load into a Bronze Delta table with schema enforcement ON and use DROP DUPLICATES during Silver transformation.

Scenario 2:

You received corrupt JSON files in storage. Your job fails during ingestion. What’s the best solution?

Answer: Use Auto Loader with cloudFiles.allowOverwrites and cloudFiles.schemaHints to safely ingest corrupted data and isolate bad records.

Scenario 3:

You want to track historical versions of a Delta table for audits. What feature do you use?

Answer: Use Delta Lake Time Travel with VERSION AS OF or TIMESTAMP AS OF.

2. Performance Optimization Scenarios

Scenario 4:

Your table has millions of small Parquet files causing slow queries. What should you do?

Answer: Run OPTIMIZE table_name to compact files.

Scenario 5:

Your WHERE queries on "country" column are extremely slow. What improves performance?

Answer: Use Z-Ordering: OPTIMIZE table ZORDER BY (country).

Scenario 6:

You want to reduce storage usage and clean up obsolete Delta files.

Answer: Run VACUUM table RETAIN 168 HOURS (default 7 days).

3. Streaming & Ingestion Scenarios

Scenario 7:

You need to incrementally ingest thousands of new files daily with schema evolution.

Answer: Use Auto Loader with cloudFiles.inferColumnTypes and cloudFiles.schemaEvolutionMode.

Scenario 8:

Your streaming job restarts and reprocesses old data. How to fix it?

Answer: Set a correct checkpointLocation for exactly-once processing.

Scenario 9:

Your batch job must be converted to streaming with minimal code.

Answer: Use Structured Streaming with readStream and writeStream.

4. Job & Workflow Scenarios

Scenario 10:

You want to run a notebook daily at 12 AM without manual intervention.

Answer: Create a Databricks Job with scheduled triggering.

Scenario 11:

Multiple tasks must run sequentially (Bronze → Silver → Gold). What do you use?

Answer: Use Workflows with task dependencies.

Scenario 12:

You want temporary compute that shuts down automatically after job completion.

Answer: Use a Job Cluster instead of All-Purpose Cluster.

5. Unity Catalog & Governance Scenarios

Scenario 13:

Your company wants centralized access control across multiple workspaces.

Answer: Use Unity Catalog with a single metastore.

Scenario 14:

You need to restrict a sensitive column from analysts.

Answer: Apply column-level permissions or dynamic views.

Scenario 15:

Audit team needs full change history of a table.

Answer: Use DESCRIBE HISTORY table.

6. Machine Learning & MLflow Scenarios

Scenario 16:

You want to track model parameters, metrics, and artifacts.

Answer: Use MLflow Tracking.

Scenario 17:

You want version-controlled models with Staging → Production workflow.

Answer: Use MLflow Model Registry.

Scenario 18:

Two data scientists want to collaborate on the same model codebase.

Answer: Use Repos with Git integration.

7. File System & Utilities Scenarios

Scenario 19:

You want to list files in DBFS.

Answer: Use dbutils.fs.ls("/mnt/...").

Scenario 20:

You need to remove a corrupted file from DBFS.

Answer: Use dbutils.fs.rm(path, recurse=True).

8. Exam-Oriented High-Value Scenarios (Must Know)

Scenario 21:

You want to merge CDC (change data capture) data efficiently.

Answer: Use MERGE INTO with Delta Lake.

Scenario 22:

Your logic requires ensuring no duplicates based on a key column.

Answer: Use PRIMARY KEY with constraint or dropDuplicates() during Silver processing.

Scenario 23:

The business requires hourly incremental refresh of dashboards.

Answer: Create a Workflow with scheduled SQL tasks.

Conclusion

These scenario-based Q&A examples are extremely useful for Databricks certification exams because the tests focus heavily on real-world data engineering decisions. The more scenarios you practice, the easier it becomes to choose the correct solution during the exam. Use this guide as a quick-revision reference before your exam.

Monday, 8 December 2025

Databricks Performance Optimization Techniques

Introduction

Optimizing Databricks workloads improves query performance and reduces costs.

Step 1: OPTIMIZE Command

Compacts small files.

Step 2: Z-ORDER

Improves query performance on specific columns.

Step 3: Partitioning

Improves data access efficiency.

Conclusion

Optimization techniques are essential for efficient big data workloads.