Job Description
• Perform 24×7 monitoring of Databricks clusters, jobs, workflows, repos, and data pipelines.
• Alert Monitoring and first level of resolution thereof
• First Level issue troubleshooting/analysis related to
Cluster failures or auto-scaling issues
Job failures (PySpark/Scala/Spark SQL/Delta Live Tables)
Workspace availability issues
• Debug Databricks notebook failures and job errors (Spark, SQL, Delta Lake)
• Rerun/retrigger failed jobs as per SOP.
• Monitor data ingestion pipelines (Streaming & Batch).
• Perform daily health checks
• Prepare incident summary reports, and daily operational dashboards.
• Escalate high severity incidents to L3/Platform Engineering as per SLA.
• Handle workspace/user access requests as per RBAC policies
• Identify recurring issues and report to L3/Platform Engineering
• First level -Analyze driver/executor logs
• 2–5+ years of experience in Big Data / Cloud Data Platform Support.
• Hands-on knowledge of Databricks platform (clusters, jobs, repos, MLflow, warehouse)
• Hands on experience in UNIX, SQL, Shell Scripting.
• Hands on experience in Spark UI & job debugging
• Understanding of CI/CD pipelines (Azure DevOps)
• Understanding of Apache Spark , Azure Cloud