Monday, 12 January 2026

How to Read Files from S3 in Databricks

How to Read Files from S3 in Databricks

Introduction

After connecting Databricks to an S3 bucket, the next step is reading files for processing. Databricks supports multiple formats such as CSV, JSON, Parquet, and Delta. This guide shows how to load S3 data into Databricks step by step.

Step 1: Confirm S3 Connectivity

Before reading files, verify that the bucket path is accessible from Databricks.

display(dbutils.fs.ls("s3a://your-bucket-name/input/"))

Step 2: Read a CSV File

CSV is one of the most common formats for raw data ingestion.

df_csv = spark.read.option("header", "true").option("inferSchema", "true") \
  .csv("s3a://your-bucket-name/input/customer_data.csv")
display(df_csv)

Step 3: Read a JSON File

JSON files are widely used in APIs and application logs.

df_json = spark.read.json("s3a://your-bucket-name/input/events.json")
display(df_json)

Step 4: Read a Parquet File

Parquet is a columnar format optimized for analytics.

df_parquet = spark.read.parquet("s3a://your-bucket-name/input/orders/")
display(df_parquet)

Step 5: Inspect Schema and Quality

Always review schema and null values before transforming the data.

df_csv.printSchema()
df_csv.describe().show()

Step 6: Filter or Transform the Data

Once the file is loaded, you can apply filtering, joins, and aggregations using Spark.

filtered_df = df_csv.filter("amount > 1000")
display(filtered_df)

Conclusion

Reading files from S3 in Databricks is simple once the connection is configured. The key is choosing the right file format and validating the data early so downstream tables and reports remain accurate.

No comments:

Post a Comment

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files

End-to-End Databricks S3 Workflow: Connect, Create Tables, Archive, and Move Files Introduction An end-to-end Databricks S3 pipeline ofte...