Monday, 19 January 2026

How to Create a Table in Databricks from S3 Files

How to Create a Table in Databricks from S3 Files

Introduction

Creating tables from S3 data is a core Databricks workflow. Instead of reading files every time, you can create managed or external tables so analysts and engineers can query the data more easily using SQL.

Step 1: Load Data from S3

Read the S3 file into a Spark DataFrame.

df = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/products.csv")

Step 2: Review the Schema

Verify column names and data types before table creation.

df.printSchema()

Step 3: Create a Temporary View

A temporary view helps you validate the data with SQL before creating a permanent table.

df.createOrReplaceTempView("products_temp")

Step 4: Query the Temporary View

SELECT * FROM products_temp LIMIT 10;

Step 5: Create a Managed Table

If you want Databricks to manage storage metadata, create a managed table.

df.write.mode("overwrite").saveAsTable("products_table")

Step 6: Create an External Table

If you want the underlying files to remain in S3, create an external table pointing to that S3 path.

CREATE TABLE products_external
USING CSV
OPTIONS (
  path "s3a://your-bucket-name/input/products.csv",
  header "true"
);

Step 7: Query the Table

SELECT COUNT(*) FROM products_table;

Conclusion

Creating Databricks tables from S3 files makes data easier to manage, query, and govern. It is a practical step for building reusable analytics and ETL pipelines.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...