VidPro Consultancy Services

Operations & Service Support Lead - Cloud Infrastructure

Job Location

noida, India

Job Description

1. Role Overview : The Operations & Service Support Manager ensures 247 operational excellence and customer satisfaction for our cloud infrastructure offerings, including GPU accelerated compute solutions. This role oversees day to day operations, manages support teams (Tier 23), and collaborates closely with product and engineering teams to maintain high availability, performance, and robust service for enterprise customers running AI, HPC, or other mission critical workloads. 2. Key Responsibilities : 1. Operational Oversight & Service Management : - Lead and coordinate daily operations in multi cloud or hybrid environments (e.g., AWS, Azure, GCP, on prem HPC). - Maintain operational dashboards (uptime, ticket volumes, SLAs) and proactively address performance or capacity bottlenecks. - Ensure adherence to ITIL or other standard frameworks for incident, change, and problem management. 2. Team Leadership & Support Structure : - Manage Support Tiers ( L2, L3) and operations staff (NOC, monitoring specialists) to handle escalations, incident triage, and rootcause analysis. - Set clear KPIs and SOPs for the team, focusing on quick resolution times, high firstcontact resolution rates, and continuous improvements. - Coordinate training, runbooks, and knowledge transfer to ensure each tier has the expertise needed for AI/GPU workloads. 3. Incident & Problem Management : - Oversee major incidents and ensure timely resolution for critical outages or severe performance degradationsespecially in GPUbased clusters. - Chair regular postincident reviews (RCAs), track corrective actions, and drive improvements to reduce recurrence. - Maintain strong collaboration with product and SRE/engineering teams to address underlying code or architectural issues. 4. Service Assurance & Continuous Improvement : - Proactively monitor metrics and logs (e.g., GPU utilization, HPC job performance, cost anomalies) to spot potential issues before they escalate. - Drive automation initiatives (in partnership with DevOps or SRE) to reduce manual toil, improve deployment flows, and streamline maintenance tasks. - Champion reliability best practices and riskmitigation strategies aligned with organizational SLAs and error budgets. 5. Stakeholder & Customer Engagement : - Act as a liaison between support/ops teams and key customers, ensuring visibility into operational performance and planned maintenance windows. - Support customer success teams by providing insights on usage trends, capacity needs, and support ticket data. - Escalate customer concerns and feedback to the product roadmap when recurring patterns emerge. 6. Resource & Vendor Management : - Manage relationships with external vendors and partners (e.g., GPU hardware providers, colocation/DC hosts, cloud service providers). - Ensure optimal resource allocationwhether GPU nodes, highspeed storage, or other HPC componentsto meet service demands and cost targets. - Track operational budgets, negotiate contracts, and control OPEX/CAPEX in alignment with company goals. 7. Compliance & Security : - Implement and enforce security policies (access controls, patching, vulnerability management) for HPC/GPU clusters and cloud environments. - Work with InfoSec teams to maintain compliance (SOC?2, ISO?27001, etc.) and manage data governance or audit requirements. 3. Qualifications & Skills : 1. Education & Experience : - Bachelors or Masters in Computer Science, Engineering, or related field. - 8 years in operations / support management roles, with 3 years in cloud infrastructure or HPC/AI environments. 2. Technical & Domain Expertise : - Strong understanding of cloud computing concepts (IaaS, PaaS, containers, virtualization) and GPU accelerated computing (NVIDIA GPUs, HPC schedulers). - Familiarity with infrastructure automation (Terraform, Ansible) and observability tools (Prometheus, Grafana, Datadog, etc.). - Knowledge of distributed systems, HPC clusters, performance tuning, and relevant DevOps/SRE practices. 3. Operations Management : - Proven track record implementing ITIL / ITSM frameworks for incident, change, and problem management at scale. - Experience running 247 support teams, establishing SLAs, and delivering on operational KPIs. 4. Leadership & Communication : - Excellent people management and coaching skills, able to motivate diverse teams across geographies and time zones. - Strong communicationcapable of engaging both senior executive stakeholders and frontline support engineers. - Adept at crisis management; calm under pressure and systematic in escalation protocols. 5. Analytical & Continuous Improvement : - Experience with data analysis of operational metrics to identify trends, reduce MTTR (Mean Time to Recovery), and drive reliability enhancements. - Ability to propose and execute process improvements or automation that optimize operational overhead and costs. 6. Preferred Certifications : - ITIL Foundation / Intermediate, or PMP/PRINCE2 for project oversight. - Exposure to cloud certifications (AWS, Azure, GCP) or HPC vendor certifications beneficial. (ref:hirist.tech)

Location: noida, IN

Posted Date: 5/1/2025
View More VidPro Consultancy Services Jobs

Contact Information

Contact Human Resources
VidPro Consultancy Services

Posted

May 1, 2025
UID: 5152413421

AboutJobs.com does not guarantee the validity or accuracy of the job information posted in this database. It is the job seeker's responsibility to independently review all posting companies, contracts and job offers.