How to Create a Table in Databricks from S3 Files
Introduction
Creating tables from S3 data is a core Databricks workflow. Instead of reading files every time, you can create managed or external tables so analysts and engineers can query the data more easily using SQL.
Step 1: Load Data from S3
Read the S3 file into a Spark DataFrame.
df = spark.read.option("header", "true").option("inferSchema", "true") \
.csv("s3a://your-bucket-name/input/products.csv")
Step 2: Review the Schema
Verify column names and data types before table creation.
df.printSchema()
Step 3: Create a Temporary View
A temporary view helps you validate the data with SQL before creating a permanent table.
df.createOrReplaceTempView("products_temp")
Step 4: Query the Temporary View
SELECT * FROM products_temp LIMIT 10;
Step 5: Create a Managed Table
If you want Databricks to manage storage metadata, create a managed table.
df.write.mode("overwrite").saveAsTable("products_table")
Step 6: Create an External Table
If you want the underlying files to remain in S3, create an external table pointing to that S3 path.
CREATE TABLE products_external
USING CSV
OPTIONS (
path "s3a://your-bucket-name/input/products.csv",
header "true"
);
Step 7: Query the Table
SELECT COUNT(*) FROM products_table;
Conclusion
Creating Databricks tables from S3 files makes data easier to manage, query, and govern. It is a practical step for building reusable analytics and ETL pipelines.
No comments:
Post a Comment