End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files
Introduction
An end-to-end Databricks S3 pipeline often includes four major tasks: connecting to S3, reading source files, creating tables, and archiving or moving processed files. This guide brings all of those steps together into one practical workflow.
Step 1: Connect Databricks to S3
Configure credentials securely using secret scopes or IAM-based access.
Step 2: Read Input Files
df = spark.read.option("header", "true").option("inferSchema", "true") \
.csv("s3a://source-bucket/input/sales.csv")
Step 3: Validate and Transform the Data
clean_df = df.dropDuplicates().filter("sales_id IS NOT NULL")
Step 4: Create a Delta Table
clean_df.write.format("delta").mode("overwrite").saveAsTable("sales_delta_table")
Step 5: Archive the Source File
dbutils.fs.cp("s3a://source-bucket/input/sales.csv",
"s3a://archive-bucket/sales-archive/sales.csv")
Step 6: Delete or Move the Original File
dbutils.fs.rm("s3a://source-bucket/input/sales.csv")
Step 7: Schedule the Pipeline
Use Databricks Jobs or Workflows to run the entire notebook on a schedule.
Step 8: Monitor and Audit
Maintain logs for file names, table loads, archive location, and job status so the pipeline remains easy to support.
Conclusion
This end-to-end pattern is one of the most useful Databricks designs for cloud data engineering. It starts with S3 connectivity, transforms raw files into queryable tables, and ends with safe archival or movement of processed data.