• Perform 24×7 monitoring of Databricks clusters, jobs, workflows, repos, and data pipelines.

• Alert Monitoring and first level of resolution thereof

• First Level issue troubleshooting/analysis related to

 Cluster failures or auto-scaling issues

 Job failures (PySpark/Scala/Spark SQL/Delta Live Tables)

 Workspace availability issues

• Debug Databricks notebook failures and job errors (Spark, SQL, Delta Lake)

• Rerun/retrigger failed jobs as per SOP.

• Monitor data ingestion pipelines (Streaming & Batch).

• Perform daily health checks

• Prepare incident summary reports, and daily operational dashboards.

• Escalate high severity incidents to L3/Platform Engineering as per SLA.

• Handle workspace/user access requests as per RBAC policies

• Identify recurring issues and report to L3/Platform Engineering

• First level -Analyze driver/executor logs

• 2–5+ years of experience in Big Data / Cloud Data Platform Support.

• Hands-on knowledge of Databricks platform (clusters, jobs, repos, MLflow, warehouse)

• Hands on experience in UNIX, SQL, Shell Scripting.

• Hands on experience in Spark UI & job debugging

• Understanding of CI/CD pipelines (Azure DevOps)

• Understanding of Apache Spark , Azure Cloud

SME Application Support Engineer – Databricks (24×7 Operations)

Job Description