How to Connect Databricks to an AWS S3 Bucket (Step-by-Step Guide)
Introduction
Connecting Databricks to an AWS S3 bucket is one of the most common tasks in modern data engineering. Once the connection is configured, Databricks can read raw files from S3, process them with Apache Spark, and write the output back to S3 or Delta tables. This guide explains the connection process in a simple step-by-step way.
Step 1: Understand the Basic Requirement
Databricks needs permission to access files stored in Amazon S3. This is usually done using an IAM role, access keys, or instance profiles depending on your cloud setup and security standards.
Step 2: Prepare the S3 Bucket
Create an S3 bucket in AWS and upload sample files such as CSV, JSON, or Parquet. Make sure the bucket policy allows the required Databricks access.
Step 3: Configure Credentials
You can configure AWS credentials in Databricks using Spark configuration or secrets. For example, teams often store access keys securely in a Databricks secret scope instead of hardcoding them inside notebooks.
spark.conf.set("fs.s3a.access.key", "YOUR_ACCESS_KEY")
spark.conf.set("fs.s3a.secret.key", "YOUR_SECRET_KEY")
spark.conf.set("fs.s3a.endpoint", "s3.amazonaws.com")
Step 4: Test the Connection
Once the credentials are configured, test the connection by listing files from the bucket.
display(dbutils.fs.ls("s3a://your-bucket-name/"))
Step 5: Read Data from S3
After a successful connection, read the files into a Spark DataFrame.
df = spark.read.option("header", "true").csv("s3a://your-bucket-name/input/sales.csv")
display(df)
Step 6: Validate the Data
Check the schema, row count, and sample records before using the data for downstream processing.
df.printSchema()
df.count()
Best Practices
- Use secret scopes instead of hardcoding credentials
- Prefer IAM roles where possible
- Limit S3 permissions to only required paths
- Test with small files first
Conclusion
Connecting Databricks to S3 is the foundation for many cloud data engineering workflows. Once access is configured correctly, you can build ingestion pipelines, create tables, archive old files, and automate data movement across buckets with ease.