Home About Skills Experience Projects Awards Hire / Connect

Open to Lead · Principal · Staff · DevOps/SRE Manager roles

LEAD DEVOPS · SRE · PLATFORM · OPEN TO PRINCIPAL & MANAGER ROLES

Building reliable platforms for thousands of engineers.

8+ years across Paytm, Airtel, PayU and Sprinklr — payments, telecom and SaaS. I design, scale and operate Kubernetes fleets (1000+ clusters, 5000+ engineers), petabyte observability, multi-cloud (AWS · GCP · Azure), and AI-assisted SRE at 99.99% SLOs. Open to Lead · Principal · Staff · DevOps/SRE Manager roles.

View experience Download resume

Kubernetes @ scaleSRE · 99.99% SLOPrometheus · Thanos · VMArgoCD · GitOpsTerraform · AnsibleAWS · GCP · AzureAI-assisted SRETeam Lead · Mentor

Lead DevOps Engineer

@ Sprinklr

1000+

Clusters operated

5000+

Platform users

Uptime SLO

Years in trenches

scroll to explore

ABOUT

I turn fragile infrastructure into boring, predictable platforms — and grow the teams that run them.

I'm Vivek — a Lead DevOps / SRE engineer with 8+ years shipping production systems across payments (Paytm, PayU), telecom (Airtel) and SaaS (Sprinklr). I've owned cloud migrations (Ali Cloud → AWS, on-prem DC → AWS), shipped Kubernetes Operators that automate toil, and built observability stacks that page humans only when humans are actually needed.

I lead from the keyboard: I write the design doc, ship the first version, then mentor the team that owns it. I've operated platforms at 1000+ cluster / 5000+ engineer scale, run on-call rotations, hardened security and compliance for regulated payment workloads, and driven 30%+ cloud cost reductions through right-sizing and FinOps.

Looking ahead, I'm open to Lead, Principal, Staff, or DevOps/SRE Manager roles where reliability, platform engineering and AI-assisted operations are first-class problems.

Platform at scale

1000+ Kubernetes clusters across 14 regions on EKS · GKE · AKS. GitOps-first with ArgoCD ApplicationSets; zero-drift policy enforced in CI.

Observability

Prometheus + Thanos + VictoriaMetrics + Grafana + Loki/ELK at PB scale. Per-tenant SLOs, burn-rate alerting, runbook-linked dashboards.

AI for Ops

LLM-backed SRE copilot (Langfuse-traced, RAG over runbooks) that triages alerts, proposes remediations, and auto-closes low-risk incidents — 1200+/week.

Security & compliance

IAM, KMS, Secret Manager, GuardDuty, CloudTrail, WAF, Shield, SonarQube, Qualys, Cortex. Shift-left policies in CI/CD; PCI-DSS-aware payment workloads.

Leadership & mentoring

Lead cross-functional pods of 6–12 engineers, run incident reviews and on-call rotations, mentor SRE-2/SRE-3 ICs, partner with product on platform roadmaps.

Cost & FinOps

30% cloud spend reductions through right-sizing, spot/savings plans, multi-tenant cluster packing, and a Release Engineering UI that cut deploy toil 70%.

Core pillars

self-assessed

Depth across platform disciplines — not a scorecard.

Kubernetes & Operators96%

Observability & SRE94%

GitOps / CI/CD92%

Multi-cloud (AWS · GCP · Azure)90%

IaC / Terraform / Ansible88%

Security & Compliance86%

Leadership & Mentoring88%

AI-assisted DevOps82%

Operating principles

6 principles

01Reliability is a product, not a ticket queue.
02Every alert should be actionable — or deleted.
03If it isn't in Git, it doesn't exist.
04Automate the 2 a.m. page away.
05Hire for slope, not intercept — and unblock relentlessly.
06Cost is a reliability signal: cheap is calm.

CAPABILITIES

A full-stack DevOps toolbox.

From the kernel to the dashboard, IC to tech-lead — battle-tested across payments (Paytm, PayU), telecom (Airtel) and SaaS (Sprinklr).

96
Kubernetes
Expert
Orchestration·96/100
Field notes
1000+ clusters, custom operators, CRDs
- Unified GitOps for EKS, GKE, and AKS
- CRDs and admission policies as platform contracts
95
Docker
Expert
Orchestration·95/100
Field notes
multi-stage, distroless, BuildKit
- Reproducible image builds; SBOM in CI
- Hardening base images and supply chain
92
Helm
Expert
Orchestration·92/100
Field notes
chart libraries, umbrella charts
90
ECS / EKS / GKE / AKS
Expert
Orchestration·90/100
Field notes
prod across AWS, GCP, Azure
95
Prometheus
Expert
Observability·95/100
Operations notes
federation, recording rules at PB scale
- Federation and HA pairs for global views
- Recording rules to protect query paths
92
Thanos
Expert
Observability·92/100
Field notes
long-term storage, global querier
90
VictoriaMetrics
Expert
Observability·90/100
Field notes
vmagent, vmalert fleet
94
Grafana
Expert
Observability·94/100
Field notes
SLO dashboards, alerting, provisioning as code
88
Loki / ELK
Advanced
Observability·88/100
Field notes
structured logs, cardinality discipline
82
Dynatrace / Pinpoint
Advanced
Observability·82/100
Field notes
APM, trace-based alerting
94
ArgoCD
Expert
GitOps & CI/CD·94/100
Field notes
ApplicationSets, multi-cluster, SSO
90
Jenkins
Expert
GitOps & CI/CD·90/100
Field notes
shared libraries, dynamic agents
88
GitLab CI / Bitbucket
Advanced
GitOps & CI/CD·88/100
Field notes
pipelines as code, cache optimizations
92
Terraform
Expert
IaC & Config·92/100
Field notes
modules, workspaces, policy-as-code
88
Ansible
Advanced
IaC & Config·88/100
Field notes
playbooks, roles, idempotent rollouts
78
Crossplane
Working
IaC & Config·78/100
Field notes
cluster-as-API
94
AWS
Expert
Cloud·94/100
Field notes
EKS, VPC, IAM, WAF, KMS, GuardDuty, Lambda, Route53
85
GCP
Advanced
Cloud·85/100
Field notes
GKE, Apigee, Cloud Monitoring, IAM
82
Azure
Advanced
Cloud·82/100
Field notes
AKS, Front Door, Arc, Key Vault
92
IAM / KMS / Secret Manager
Expert
Security & Compliance·92/100
Field notes
least-privilege, rotation
86
SonarQube / Qualys / Cortex
Advanced
Security & Compliance·86/100
Field notes
shift-left scanning
88
CloudTrail / GuardDuty / WAF
Advanced
Security & Compliance·88/100
Field notes
continuous threat monitoring
88
Python
Advanced
Programming·88/100
Field notes
Flask, Django, operators, tooling
82
JavaScript / React
Advanced
Programming·82/100
Field notes
internal DevOps UIs
84
Bash / Go (ops)
Advanced
Programming·84/100
Field notes
reliability tooling
84
LLM-assisted SRE
Advanced
AI for DevOps·84/100
Field notes
alert triage, runbook synthesis
80
Langfuse
Advanced
AI for DevOps·80/100
Field notes
trace + evaluate prompts in prod
78
RAG on runbooks
Working
AI for DevOps·78/100
Field notes
searchable institutional memory
90
Redis (SME)
Expert
Databases & Data·90/100
Field notes
replication, persistence, eviction at payments scale
- HA + sentinel/cluster topology hardening
- Client-side patterns and connection pooling discipline
85
Amazon RDS · Cloud SQL
Advanced
Databases & Data·85/100
Field notes
managed MySQL/Postgres, backups, parameter groups
84
PostgreSQL · MySQL
Advanced
Databases & Data·84/100
Field notes
schema design, slow-query triage, replication
78
MongoDB
Working
Databases & Data·78/100
Field notes
replica sets, indexing, ops
84
AWS API Gateway
Advanced
API Management·84/100
Field notes
REST/HTTP APIs, throttling, WAF integration
80
Kong
Advanced
API Management·80/100
Field notes
plugins, auth, rate limiting
75
Apigee (GCP)
Working
API Management·75/100
Field notes
policy chains, dev portal
90
Tech Leadership
Expert
Leadership & Delivery·90/100
Field notes
leading 6–12 engineer pods across DevOps/SRE
- Roadmaps, design reviews, and platform OKRs
- Hiring loops, ramp-up plans, growth ladders
88
Mentoring & Coaching
Advanced
Leadership & Delivery·88/100
Field notes
SRE-2 / SRE-3 growth, 1:1 cadence
90
Incident Management
Expert
Leadership & Delivery·90/100
Field notes
on-call rotations, IM/comms, blameless postmortems
88
Cross-functional Partnership
Advanced
Leadership & Delivery·88/100
Field notes
product, security, risk, compliance
86
FinOps / Cost Optimization
Advanced
Leadership & Delivery·86/100
Field notes
30% cloud-spend reductions; right-sizing & savings plans
90
Slack · Teams · Confluence · JIRA
Expert
Leadership & Delivery·90/100
Field notes
documentation-first delivery

$ kubectl get pods -A | grep -v Running
$ argocd app sync platform --prune
$ promql: rate(http_requests_total[5m])
$ terraform plan -out=tfplan && terraform apply
$ helm upgrade --install obs charts/observability
$ kubectl rollout status deploy/api -n prod
$ aws eks update-kubeconfig --name lead-prod
$ k9s --context multi-region

JOURNEY

Eight years shipping production systems — IC and tech-lead.

Payments, telecom and SaaS. Ali Cloud → AWS migrations, multi-cluster Kubernetes, observability from scratch, and a DevOps UI that cut toil 70%.

SELECTED WORK

Systems I've designed & shipped.

A sample of the platforms, operators, migrations and dashboards that moved real numbers in production — across payments, telecom and SaaS.

Platform

DevOps Custom UI

Internal DevOps control plane used across multiple verticals — RBAC, compliance, vendor VPN, AWS monitoring, asset management.

70% reduction in toil
Adopted by 100+ engineers
RBAC with 12 scoped roles
PayU production-grade

highlights

Roles-based authentication with fine-grained scopes.
Live AWS monitoring UI — EC2, RDS, ELB, and cost.
Approval workflow for infrastructure creation.
Full audit history for security groups and user actions.

ReactFlaskAnsibleAWSMySQL

Migration

PG2 Cloud Migration

Lifted and re-platformed 50+ payment microservices from Ali Cloud to AWS on EKS with end-to-end CI/CD and monitoring.

50+ microservices migrated
Zero customer-visible downtime
~30% cost optimization
2× Rockstar Award (Paytm)

highlights

Designed the target architecture on EKS with multi-AZ resiliency.
Built the end-to-end CI/CD pipeline for all PG2 services.
Comprehensive monitoring + alerting before the first customer was cut over.
Automated infrastructure provisioning via Terraform modules.

AWSEKSTerraformArgoCDPrometheus

Kubernetes

Scheduled HPA Operator

Kubernetes Operator with a Custom Resource Definition that patches Horizontal Pod Autoscalers on business-day schedules.

Zero manual HPA edits
Weekday/weekend schedule awareness
Audit logs per patch

highlights

Day-of-week schedule customization for fine-grained scaling.
Resource-name sorting for predictable ordering.
Structured logging for debugging and observability.
Zero-downtime rollouts via leader election.

GoKubernetesCRDcontroller-runtime

AI + Platform

AI SRE Copilot

LLM-powered assistant that ingests alerts, correlates runbooks, and auto-executes safe remediations — with full Langfuse tracing.

1200+ alerts auto-remediated / week
MTTR down 42%
100% traced prompts
Serves 5000+ engineers

highlights

Alert deduplication and root-cause hypothesis generation.
Retrieval over institutional runbooks and postmortems.
Safe-remediation playbooks gated by blast-radius scoring.
Eval harness for prompt regressions on historical incidents.

PythonLLMsLangfuseRAGKubernetes

Observability

Business Metrics Dashboard

Custom Grafana + Prometheus dashboard for business-critical metrics with automated alerting and incident workflows.

Faster incident response
Tailored to business KPIs
Alerting on leading indicators

highlights

Business-facing KPIs alongside infra SLOs.
Automated alerts with context-rich payloads.
On-call playbooks linked directly from panels.

PrometheusGrafanaAlertmanager

Security

2FA Firewall Integration

Two-factor authentication bolted onto the vendor-access firewall — streamlined onboarding with strict role-based controls.

Fewer unauthorized entries
Faster vendor onboarding
Role-scoped access

highlights

Role-based access for vendors with time-boxed permissions.
Streamlined onboarding flow with automated provisioning.
Full auditability of vendor sessions.

FirewallLDAP2FAAnsible

RECOGNITION

Awards & certifications.

Recognitions earned across Paytm, Airtel, and PayU — plus continued investment in the craft.

AwardPaytm · 2022

Rockstar Award — PG2 Migration & Monitoring

Led the PG2 migration from Ali Cloud to AWS alongside a full CI/CD and monitoring buildout.

AwardPaytm · 2023

Paytm Payment Infinity Award

Recognized for outstanding contributions in building Paytm Payment Infinity.

AwardPaytm · 2023

Rockstar Award — Infrastructure Automation

Automated PG2 infrastructure end-to-end with a best-in-class monitoring system.

AwardAirtel · 2021

Outstanding JVM Hygiene Award

Automated JVM hygiene across multiple microservices for measurable reliability and performance gains.

AwardPayU · 2019

Thank You Award — Big Billion Day

Delivered scalable infrastructure and dashboards that carried Big Billion Day traffic.

Cert · in progressCNCF / Linux Foundation

Certified Kubernetes Administrator (CKA)

Cert · earnedAmazon Web Services

AWS Solutions Architect — Associate

Cert · earnedHashiCorp

Terraform Associate

Cert · in progressAmazon Web Services

AWS Certified DevOps Engineer — Professional

Cert · in progressCNCF / Linux Foundation

Prometheus Certified Associate (PCA)

Cert · earnedCodefresh

ArgoCD / GitOps Fundamentals

Education

B.Tech — Computer Science & Engineering

Greater Noida Institute of Technology (AKTU)

2013 — 2017

Building reliable platforms for thousands of engineers.

I turn fragile infrastructure into boring, predictable platforms — and grow the teams that run them.

A full-stack DevOps toolbox.

Eight years shipping production systems — IC and tech-lead.

Sprinklr

Paytm (One97 Communications)

Bharti Airtel

PayU Payments Private Limited

Systems I've designed & shipped.

DevOps Custom UI

PG2 Cloud Migration

Scheduled HPA Operator

AI SRE Copilot

Business Metrics Dashboard

2FA Firewall Integration

Awards & certifications.