← Back to Projects
StreamingSparkDelta LakeReal-time
Real-Time Analytics Platform
2024
Project Overview
Designed and implemented a real-time analytics platform to process high-volume event streams with sub-second latency.
Business Requirements
- Process 10M+ events daily
- Sub-second latency for critical metrics
- 99.9% uptime SLA
- Real-time dashboards
- Historical data analysis
Architecture Design
Streaming Ingestion
- Apache Kafka for event streaming
- Structured Streaming for processing
- Auto Loader for batch catchup
Processing Pipeline
# Streaming aggregation example
from pyspark.sql.functions import *
events = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "events") \
.load()
metrics = events \
.select(from_json(col("value").cast("string"), schema).alias("data")) \
.select("data.*") \
.withWatermark("timestamp", "10 seconds") \
.groupBy(
window("timestamp", "1 minute"),
"event_type",
"user_segment"
) \
.agg(
count("*").alias("event_count"),
avg("value").alias("avg_value"),
percentile_approx("value", 0.95).alias("p95_value")
)
metrics.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/metrics") \
.table("gold.real_time_metrics")
Storage Strategy
- Hot path: Delta Lake (last 30 days)
- Warm path: Compressed Delta (31-365 days)
- Cold path: Azure Storage (>365 days)
Performance Optimization
Partitioning Strategy
CREATE TABLE gold.real_time_metrics
PARTITIONED BY (date DATE, hour INT)
AS SELECT * FROM streaming_metrics
Checkpointing
- Separate checkpoint locations per stream
- Azure Storage for durability
- Regular cleanup of old checkpoints
Results
Performance
- Average latency: 500ms (p95: 1.2s)
- Throughput: 12M events/day
- 99.95% uptime achieved
Business Impact
- Real-time fraud detection
- Dynamic pricing based on live metrics
- Instant customer insights
Tech Stack
- Streaming: Apache Kafka, Spark Structured Streaming
- Storage: Delta Lake
- Platform: Databricks on Azure
- Monitoring: Datadog, Grafana
- Orchestration: Airflow