How to Read Files from S3 in Databricks
Introduction
After connecting Databricks to an S3 bucket, the next step is reading files for processing. Databricks supports multiple formats such as CSV, JSON, Parquet, and Delta. This guide shows how to load S3 data into Databricks step by step.
Step 1: Confirm S3 Connectivity
Before reading files, verify that the bucket path is accessible from Databricks.
display(dbutils.fs.ls("s3a://your-bucket-name/input/"))
Step 2: Read a CSV File
CSV is one of the most common formats for raw data ingestion.
df_csv = spark.read.option("header", "true").option("inferSchema", "true") \
.csv("s3a://your-bucket-name/input/customer_data.csv")
display(df_csv)
Step 3: Read a JSON File
JSON files are widely used in APIs and application logs.
df_json = spark.read.json("s3a://your-bucket-name/input/events.json")
display(df_json)
Step 4: Read a Parquet File
Parquet is a columnar format optimized for analytics.
df_parquet = spark.read.parquet("s3a://your-bucket-name/input/orders/")
display(df_parquet)
Step 5: Inspect Schema and Quality
Always review schema and null values before transforming the data.
df_csv.printSchema()
df_csv.describe().show()
Step 6: Filter or Transform the Data
Once the file is loaded, you can apply filtering, joins, and aggregations using Spark.
filtered_df = df_csv.filter("amount > 1000")
display(filtered_df)
Conclusion
Reading files from S3 in Databricks is simple once the connection is configured. The key is choosing the right file format and validating the data early so downstream tables and reports remain accurate.
No comments:
Post a Comment