Open to Lead · Principal · Staff · DevOps/SRE Manager roles

LEAD DEVOPS · SRE · PLATFORM · OPEN TO PRINCIPAL & MANAGER ROLES

Building reliable platforms for thousands of engineers.

8+ years across Paytm, Airtel, PayU and Sprinklr — payments, telecom and SaaS. I design, scale and operate Kubernetes fleets (1000+ clusters, 5000+ engineers), petabyte observability, multi-cloud (AWS · GCP · Azure), and AI-assisted SRE at 99.99% SLOs. Open to Lead · Principal · Staff · DevOps/SRE Manager roles.

Kubernetes @ scaleSRE · 99.99% SLOPrometheus · Thanos · VMArgoCD · GitOpsTerraform · AnsibleAWS · GCP · AzureAI-assisted SRETeam Lead · Mentor
Vivek Raj
Lead DevOps Engineer
@ Sprinklr
1000+
Clusters operated
5000+
Platform users
0%
Uptime SLO
0+
Years in trenches
ABOUT

I turn fragile infrastructure into boring, predictable platforms — and grow the teams that run them.

I'm Vivek — a Lead DevOps / SRE engineer with 8+ years shipping production systems across payments (Paytm, PayU), telecom (Airtel) and SaaS (Sprinklr). I've owned cloud migrations (Ali Cloud → AWS, on-prem DC → AWS), shipped Kubernetes Operators that automate toil, and built observability stacks that page humans only when humans are actually needed.

I lead from the keyboard: I write the design doc, ship the first version, then mentor the team that owns it. I've operated platforms at 1000+ cluster / 5000+ engineer scale, run on-call rotations, hardened security and compliance for regulated payment workloads, and driven 30%+ cloud cost reductions through right-sizing and FinOps.

Looking ahead, I'm open to Lead, Principal, Staff, or DevOps/SRE Manager roles where reliability, platform engineering and AI-assisted operations are first-class problems.

Platform at scale

1000+ Kubernetes clusters across 14 regions on EKS · GKE · AKS. GitOps-first with ArgoCD ApplicationSets; zero-drift policy enforced in CI.

Observability

Prometheus + Thanos + VictoriaMetrics + Grafana + Loki/ELK at PB scale. Per-tenant SLOs, burn-rate alerting, runbook-linked dashboards.

AI for Ops

LLM-backed SRE copilot (Langfuse-traced, RAG over runbooks) that triages alerts, proposes remediations, and auto-closes low-risk incidents — 1200+/week.

Security & compliance

IAM, KMS, Secret Manager, GuardDuty, CloudTrail, WAF, Shield, SonarQube, Qualys, Cortex. Shift-left policies in CI/CD; PCI-DSS-aware payment workloads.

Leadership & mentoring

Lead cross-functional pods of 6–12 engineers, run incident reviews and on-call rotations, mentor SRE-2/SRE-3 ICs, partner with product on platform roadmaps.

Cost & FinOps

30% cloud spend reductions through right-sizing, spot/savings plans, multi-tenant cluster packing, and a Release Engineering UI that cut deploy toil 70%.

Core pillars
self-assessed

Depth across platform disciplines — not a scorecard.

Kubernetes & Operators96%
Observability & SRE94%
GitOps / CI/CD92%
Multi-cloud (AWS · GCP · Azure)90%
IaC / Terraform / Ansible88%
Security & Compliance86%
Leadership & Mentoring88%
AI-assisted DevOps82%
Operating principles
6 principles
  • 01Reliability is a product, not a ticket queue.
  • 02Every alert should be actionable — or deleted.
  • 03If it isn't in Git, it doesn't exist.
  • 04Automate the 2 a.m. page away.
  • 05Hire for slope, not intercept — and unblock relentlessly.
  • 06Cost is a reliability signal: cheap is calm.
CAPABILITIES

A full-stack DevOps toolbox.

From the kernel to the dashboard, IC to tech-lead — battle-tested across payments (Paytm, PayU), telecom (Airtel) and SaaS (Sprinklr).

  • 96

    Kubernetes

    Expert
    Orchestration·96/100
    Field notes

    1000+ clusters, custom operators, CRDs

    • Unified GitOps for EKS, GKE, and AKS
    • CRDs and admission policies as platform contracts
  • 95

    Docker

    Expert
    Orchestration·95/100
    Field notes

    multi-stage, distroless, BuildKit

    • Reproducible image builds; SBOM in CI
    • Hardening base images and supply chain
  • 92

    Helm

    Expert
    Orchestration·92/100
    Field notes

    chart libraries, umbrella charts

  • 90

    ECS / EKS / GKE / AKS

    Expert
    Orchestration·90/100
    Field notes

    prod across AWS, GCP, Azure

  • 95

    Prometheus

    Expert
    Observability·95/100
    Operations notes

    federation, recording rules at PB scale

    • Federation and HA pairs for global views
    • Recording rules to protect query paths
  • 92

    Thanos

    Expert
    Observability·92/100
    Field notes

    long-term storage, global querier

  • 90

    VictoriaMetrics

    Expert
    Observability·90/100
    Field notes

    vmagent, vmalert fleet

  • 94

    Grafana

    Expert
    Observability·94/100
    Field notes

    SLO dashboards, alerting, provisioning as code

  • 88

    Loki / ELK

    Advanced
    Observability·88/100
    Field notes

    structured logs, cardinality discipline

  • 82

    Dynatrace / Pinpoint

    Advanced
    Observability·82/100
    Field notes

    APM, trace-based alerting

  • 94

    ArgoCD

    Expert
    GitOps & CI/CD·94/100
    Field notes

    ApplicationSets, multi-cluster, SSO

  • 90

    Jenkins

    Expert
    GitOps & CI/CD·90/100
    Field notes

    shared libraries, dynamic agents

  • 88

    GitLab CI / Bitbucket

    Advanced
    GitOps & CI/CD·88/100
    Field notes

    pipelines as code, cache optimizations

  • 92

    Terraform

    Expert
    IaC & Config·92/100
    Field notes

    modules, workspaces, policy-as-code

  • 88

    Ansible

    Advanced
    IaC & Config·88/100
    Field notes

    playbooks, roles, idempotent rollouts

  • 78

    Crossplane

    Working
    IaC & Config·78/100
    Field notes

    cluster-as-API

  • 94

    AWS

    Expert
    Cloud·94/100
    Field notes

    EKS, VPC, IAM, WAF, KMS, GuardDuty, Lambda, Route53

  • 85

    GCP

    Advanced
    Cloud·85/100
    Field notes

    GKE, Apigee, Cloud Monitoring, IAM

  • 82

    Azure

    Advanced
    Cloud·82/100
    Field notes

    AKS, Front Door, Arc, Key Vault

  • 92

    IAM / KMS / Secret Manager

    Expert
    Security & Compliance·92/100
    Field notes

    least-privilege, rotation

  • 86

    SonarQube / Qualys / Cortex

    Advanced
    Security & Compliance·86/100
    Field notes

    shift-left scanning

  • 88

    CloudTrail / GuardDuty / WAF

    Advanced
    Security & Compliance·88/100
    Field notes

    continuous threat monitoring

  • 88

    Python

    Advanced
    Programming·88/100
    Field notes

    Flask, Django, operators, tooling

  • 82

    JavaScript / React

    Advanced
    Programming·82/100
    Field notes

    internal DevOps UIs

  • 84

    Bash / Go (ops)

    Advanced
    Programming·84/100
    Field notes

    reliability tooling

  • 84

    LLM-assisted SRE

    Advanced
    AI for DevOps·84/100
    Field notes

    alert triage, runbook synthesis

  • 80

    Langfuse

    Advanced
    AI for DevOps·80/100
    Field notes

    trace + evaluate prompts in prod

  • 78

    RAG on runbooks

    Working
    AI for DevOps·78/100
    Field notes

    searchable institutional memory

  • 90

    Redis (SME)

    Expert
    Databases & Data·90/100
    Field notes

    replication, persistence, eviction at payments scale

    • HA + sentinel/cluster topology hardening
    • Client-side patterns and connection pooling discipline
  • 85

    Amazon RDS · Cloud SQL

    Advanced
    Databases & Data·85/100
    Field notes

    managed MySQL/Postgres, backups, parameter groups

  • 84

    PostgreSQL · MySQL

    Advanced
    Databases & Data·84/100
    Field notes

    schema design, slow-query triage, replication

  • 78

    MongoDB

    Working
    Databases & Data·78/100
    Field notes

    replica sets, indexing, ops

  • 84

    AWS API Gateway

    Advanced
    API Management·84/100
    Field notes

    REST/HTTP APIs, throttling, WAF integration

  • 80

    Kong

    Advanced
    API Management·80/100
    Field notes

    plugins, auth, rate limiting

  • 75

    Apigee (GCP)

    Working
    API Management·75/100
    Field notes

    policy chains, dev portal

  • 90

    Tech Leadership

    Expert
    Leadership & Delivery·90/100
    Field notes

    leading 6–12 engineer pods across DevOps/SRE

    • Roadmaps, design reviews, and platform OKRs
    • Hiring loops, ramp-up plans, growth ladders
  • 88

    Mentoring & Coaching

    Advanced
    Leadership & Delivery·88/100
    Field notes

    SRE-2 / SRE-3 growth, 1:1 cadence

  • 90

    Incident Management

    Expert
    Leadership & Delivery·90/100
    Field notes

    on-call rotations, IM/comms, blameless postmortems

  • 88

    Cross-functional Partnership

    Advanced
    Leadership & Delivery·88/100
    Field notes

    product, security, risk, compliance

  • 86

    FinOps / Cost Optimization

    Advanced
    Leadership & Delivery·86/100
    Field notes

    30% cloud-spend reductions; right-sizing & savings plans

  • 90

    Slack · Teams · Confluence · JIRA

    Expert
    Leadership & Delivery·90/100
    Field notes

    documentation-first delivery

JOURNEY

Eight years shipping production systems — IC and tech-lead.

Payments, telecom and SaaS. Ali Cloud → AWS migrations, multi-cluster Kubernetes, observability from scratch, and a DevOps UI that cut toil 70%.

SELECTED WORK

Systems I've designed & shipped.

A sample of the platforms, operators, migrations and dashboards that moved real numbers in production — across payments, telecom and SaaS.

Platform

DevOps Custom UI

Internal DevOps control plane used across multiple verticals — RBAC, compliance, vendor VPN, AWS monitoring, asset management.

  • 70% reduction in toil
  • Adopted by 100+ engineers
  • RBAC with 12 scoped roles
  • PayU production-grade
highlights
  • Roles-based authentication with fine-grained scopes.
  • Live AWS monitoring UI — EC2, RDS, ELB, and cost.
  • Approval workflow for infrastructure creation.
  • Full audit history for security groups and user actions.
ReactFlaskAnsibleAWSMySQL
Migration

PG2 Cloud Migration

Lifted and re-platformed 50+ payment microservices from Ali Cloud to AWS on EKS with end-to-end CI/CD and monitoring.

  • 50+ microservices migrated
  • Zero customer-visible downtime
  • ~30% cost optimization
  • 2× Rockstar Award (Paytm)
highlights
  • Designed the target architecture on EKS with multi-AZ resiliency.
  • Built the end-to-end CI/CD pipeline for all PG2 services.
  • Comprehensive monitoring + alerting before the first customer was cut over.
  • Automated infrastructure provisioning via Terraform modules.
AWSEKSTerraformArgoCDPrometheus
Kubernetes

Scheduled HPA Operator

Kubernetes Operator with a Custom Resource Definition that patches Horizontal Pod Autoscalers on business-day schedules.

  • Zero manual HPA edits
  • Weekday/weekend schedule awareness
  • Audit logs per patch
highlights
  • Day-of-week schedule customization for fine-grained scaling.
  • Resource-name sorting for predictable ordering.
  • Structured logging for debugging and observability.
  • Zero-downtime rollouts via leader election.
GoKubernetesCRDcontroller-runtime
AI + Platform

AI SRE Copilot

LLM-powered assistant that ingests alerts, correlates runbooks, and auto-executes safe remediations — with full Langfuse tracing.

  • 1200+ alerts auto-remediated / week
  • MTTR down 42%
  • 100% traced prompts
  • Serves 5000+ engineers
highlights
  • Alert deduplication and root-cause hypothesis generation.
  • Retrieval over institutional runbooks and postmortems.
  • Safe-remediation playbooks gated by blast-radius scoring.
  • Eval harness for prompt regressions on historical incidents.
PythonLLMsLangfuseRAGKubernetes
Observability

Business Metrics Dashboard

Custom Grafana + Prometheus dashboard for business-critical metrics with automated alerting and incident workflows.

  • Faster incident response
  • Tailored to business KPIs
  • Alerting on leading indicators
highlights
  • Business-facing KPIs alongside infra SLOs.
  • Automated alerts with context-rich payloads.
  • On-call playbooks linked directly from panels.
PrometheusGrafanaAlertmanager
Security

2FA Firewall Integration

Two-factor authentication bolted onto the vendor-access firewall — streamlined onboarding with strict role-based controls.

  • Fewer unauthorized entries
  • Faster vendor onboarding
  • Role-scoped access
highlights
  • Role-based access for vendors with time-boxed permissions.
  • Streamlined onboarding flow with automated provisioning.
  • Full auditability of vendor sessions.
FirewallLDAP2FAAnsible
RECOGNITION

Awards & certifications.

Recognitions earned across Paytm, Airtel, and PayU — plus continued investment in the craft.

AwardPaytm · 2022
Rockstar Award — PG2 Migration & Monitoring

Led the PG2 migration from Ali Cloud to AWS alongside a full CI/CD and monitoring buildout.

AwardPaytm · 2023
Paytm Payment Infinity Award

Recognized for outstanding contributions in building Paytm Payment Infinity.

AwardPaytm · 2023
Rockstar Award — Infrastructure Automation

Automated PG2 infrastructure end-to-end with a best-in-class monitoring system.

AwardAirtel · 2021
Outstanding JVM Hygiene Award

Automated JVM hygiene across multiple microservices for measurable reliability and performance gains.

AwardPayU · 2019
Thank You Award — Big Billion Day

Delivered scalable infrastructure and dashboards that carried Big Billion Day traffic.

Cert · in progressCNCF / Linux Foundation
Certified Kubernetes Administrator (CKA)
Cert · earnedAmazon Web Services
AWS Solutions Architect — Associate
Cert · earnedHashiCorp
Terraform Associate
Cert · in progressAmazon Web Services
AWS Certified DevOps Engineer — Professional
Cert · in progressCNCF / Linux Foundation
Prometheus Certified Associate (PCA)
Cert · earnedCodefresh
ArgoCD / GitOps Fundamentals
Education
B.Tech — Computer Science & Engineering
Greater Noida Institute of Technology (AKTU)
2013 — 2017
let's build something reliable

Got a platform to scale or an SLO to save?

I'm open to staff / principal / platform leadership roles, and to collaborations on observability, Kubernetes, and AI-for-Ops.

Email me
All systems operational© 2026 Vivek Raj. All rights reserved.
Built withusingVivek Raj