You Found It!

You clearly know your way around a terminal. Or a konami code. Either way, I like you already.

Let's build something reliable together.

Let's Talk
gurpreet@sre ~ %
$
scroll
Press 1-4 to navigate
0
+
Years Experience
0
+
Open Source Repos
0
Companies
0
AWS Certifications

Dashboard

SRE Metrics

Real-time reliability metrics from the trenches. Because what gets measured, gets improved.

0%
System Uptime
Last 365 days
< 0 min
MTTR
Mean Time to Recovery
0+
Deployments
Zero-downtime releases
0%
Error Budget
Remaining this quarter
0+
Incidents Resolved
P1-P3 across all services
0%
SLO Compliance
Across all service tiers

About

About Me

I don't just keep systems running — I make sure they never go down in the first place.

Gurpreet Singh Gurpreet Singh Gurpreet Singh Gurpreet Singh

When millions of banking transactions flow through systems every day, there's zero room for "it works on my machine." That's where I come in. As Senior Staff SRE at NatWest Group, I'm the person who ensures that mission-critical financial platforms don't just survive — they thrive under pressure.

Over 14+ years, I've gone from writing my first lines of code at IBM to architecting self-healing platforms that handle enterprise-scale workloads across regulated environments. The journey took me through MakeMyTrip's traffic surges, Myntra's peak sales, and Vertisystem's cloud transformations — each chapter sharpening a different edge of the blade.

I turn chaos into calm. Complex, distributed systems into predictable, autonomous platforms. My toolkit spans SRE, AIOps, FinOps, Kubernetes, and DevSecOps — but the real skill is knowing which lever to pull and when. I build SLO frameworks that make reliability measurable, observability pipelines that surface problems before users notice, and automation that fixes issues faster than humans can type.

Beyond the day job, I write about what I learn. Engineering Blueprints is my platform where I break down complex systems thinking into actionable guidance for engineers and architects who build at scale.

At NatWest, I'm scaling reliability as a product — partnering with engineering, security, and executive leadership to institutionalize SRE practices across the bank's digital platforms. Think SLOs, error budgets, incident governance, and cloud-native operating models — all in a heavily regulated environment where "move fast and break things" is not an option.

On the side, I'm deep into AIOps — exploring how LLMs and intelligent automation can reduce MTTR and make on-call less painful for engineering teams everywhere.

Skills

Expertise

Core competencies honed across 14+ years of building and operating large-scale distributed systems.

SRE & Reliability Engineering 95%
Cloud Architecture (AWS/GCP/Azure) 95%
Kubernetes & Container Orchestration 90%
Observability & Monitoring 95%
AIOps & Intelligent Automation 85%
System Design & Architecture 95%
DevSecOps & GitOps 90%
FinOps & Cloud Cost Optimization 90%
Terraform & Infrastructure-as-Code 90%
Technical Leadership & Mentoring 95%

Resume

Experience

From trainee to Senior Staff — every role taught me something that the last one couldn't.

Summary

Gurpreet Singh

14+ years. IBM to NatWest. From writing code to ensuring millions of transactions never fail. I've built, broken, and rebuilt systems across startups, travel giants, and global banks — each time making them more resilient than before.

Today, I lead SRE & Cloud Engineering at one of the UK's largest banking groups, where reliability isn't a feature — it's the product. My work sits at the intersection of engineering, automation, and business impact: SLO frameworks, AIOps-driven remediation, FinOps optimization, and zero-downtime delivery pipelines across regulated cloud environments.

I don't just build platforms. I build the culture, practices, and engineering muscle that keep them standing at scale.

  • South Delhi, Delhi, India
  • senghgurpreett[@]gmail.com

Bachelor's degree, Computer Science

2007 - 2010

Gayatri Vidya Parishad College of Engineering (Autonomous)

High School, Computer Science, Chemistry, Physics

2005 - 2007

Naval Children School

Junior College, Science

1995 - 2005

Kendriya Vidyalaya

AWS Certified Solutions Architect - Professional

AWS Cloud Quest: Cloud Practitioner

Data Science Fundamentals

Energy To Deliver

STAR (Super Talented Achievers)

Spotlight Recognition Award

Best Team

GoTripper of the Month

Professional Experience

Senior Staff SRE

Oct 2025 - Present

NatWest Group

  • Driving availability, resilience, and operational excellence across mission-critical banking platforms.
  • Institutionalizing SRE practices - SLOs, error budgets, observability, automation, and incident governance.
  • Shaping cloud-native operating models and strengthening regulatory resilience.
  • Scaling reliability as a product, reducing operational risk, and enabling safe delivery at speed.
Aug 2023 - Present
  • Technical leadership platform covering platform engineering, cloud-native architecture, DevSecOps, and SRE maturity models.
  • In-depth architectural writing and applied engineering frameworks for senior engineers and technology leaders.
Jan 2025 - Sep 2025
  • Led enterprise-scale cloud and platform transformation across multi-cloud and hybrid environments.
  • Introduced platform engineering and GitOps operating models, standardized CI/CD and infrastructure patterns.
  • Embedded DevSecOps controls into delivery pipelines, improving compliance posture and release reliability.
Dec 2022 - Jan 2025
  • Evolved DevOps practices into mature SRE-led operating models using Terraform, Argo CD, and Helm.
  • Led FinOps-driven cost optimization and automated observability pipelines.
  • Deployed highly available Kubernetes platforms supporting global SaaS workloads.
Dec 2018 - Jan 2023
  • Led hybrid cloud architecture and modernization programs spanning AWS, Azure, and on-premise platforms.
  • Established reusable architecture patterns and mentored engineering teams in distributed systems design.
Aug 2016 - Nov 2018
  • Modernized high-traffic consumer platforms, migrating monolithic services to microservices on AWS with Docker.
  • Designed hybrid infrastructure solutions enabling seamless scaling during peak travel periods.
Nov 2014 - Aug 2016
  • Engineered backend systems for large-scale customer communication and contact center platforms.
  • Led customer telephony integration program, integrating enterprise telephony with CRM and backend services.
May 2014 - Nov 2014
  • Led solution design and cloud CRM implementations for enterprise clients.
  • Delivered proof-of-concepts, custom integrations, and guided client migrations to cloud-based CRM platforms.
Nov 2011 - May 2014
  • Multi-role technical consultant supporting Oracle Service Cloud platforms across North America and international markets.
  • Designed global customer interaction workflows, self-service portals, and operational dashboards.
May 2011 - Nov 2011
  • Contributed to automation scripting, ITSM integrations, and internal platform enhancements.

Testimonials

What People Say

The most important skill in tech isn't software or hardware. It's building and maintaining relationships.

Gurpreet comes from non engineering background who demonstrated skill/expertise more than an Engineering graduate. He is pleasure to work with, quick learner, high in aspirations, enthusiastic, questions the status quo, delivers what he promise. Gurpreet showed lots of maturity at a trainee level, hence I would strongly recommend him.

Ravi Motamarri

Ravi Motamarri

Delivery Manager / Jaguar

A fantastic colleague and friend. Gurpreet is an exceptionally quick learner, has a growth mindset, constantly willing to find solutions to problems regardless of how trivial or complex. Gurpreet has the ability to translate complex technical challenges into everyday business language which makes sense to all parties involved. A real pleasure to be around and to work with.

Lage Antony

Lage Antony

CRM Consultant / Speridian

Working with Gurpreet was never disappointing! A committed person, he was always able to deliver. He would always allow to discuss all options before taking a decision. His programming skills reflected his open mind and creative skills. You will find him very easy and understanding person to work with.

Rahul Sharma

Rahul Sharma

Customer Experience Solution Architect / EX Wipro

Gurpreet's creativity, attention to detail, focus on delivery is wonderful. He is also a good leader with the ability to put together high performing teams and retain the talent. Gurpreet is a resourceful, creative, and solution-oriented person who was frequently able to come up with new and innovative approaches to his assigned projects.

Dheerja Sharma

Dheerja Sharma

Sr. Program Manager / MakeMyTrip

I had the privilege of working with Gurpreet in CRM team for more than one year at Myntra. He is proactive, result oriented, responsible, independent and customer focused strategist. Innovative perfectionist and technically sound, always ready to put all his energy and time to get the job done. He has an exceptional troubleshooting and analytical skill. He is a great asset to any company.

Vivek Brohma

Vivek Brohma

Software Engineer / Spotify

As per my understanding and interaction with Gurpreet, he is very enthusiastic & intellectual person. He is wise enough to use his sharp brain to achieve his teams' and individual benefit. On personal front being rational in approach, Gurpreet is a straightforward & a generous person.

Shivi Mittal

Shivi Mittal

HR Business Partner / EZOPS Inc.

Gurpreet: highly diligent software architect and developer, superhuman being. Gurpreet is absorbed by technology, yet is warm and welcoming, a truly rare quality for someone so focused. Gurpreet is approachable, a true joy to work with, even when under the extreme pressure of a tough deadline. A solid technologist and an even better human being!

Gene Bond

Gene Bond

Executive Director at iiSM.ORG

Projects

Open Source

Selected projects from my GitHub reflecting work in SRE, Platform Engineering, Cloud, AIOps, and FinOps.

Featured

Latest Activity

Explorations

AI/ML & MLOps

Applied machine learning, MLOps pipelines, and AI-driven SRE — from fraud detection to self-healing platforms.

AI/ML & MLOps Projects

  • Architected a real-time fraud detection system leveraging SageMaker Pipelines for feature engineering, model training, and deployment.
  • Integrated DataZone for governed datasets and Kinesis Firehose for streaming ingestion, enabling sub-second detection of anomalous transactions.
  • Reduced false positives by 20%+ while maintaining 99.95% inference availability through multi-AZ SageMaker endpoints.
  • Designed a recommendation engine using SageMaker + Amazon Personalize, optimized with Kinesis Data Streams for real-time user behavior ingestion.
  • Built CI/CD pipelines for ML models with CodePipeline + CodeBuild + SageMaker — cutting model release cycles from weeks to days.
  • Delivered 35% increase in user engagement metrics, directly driving customer conversion uplift.
  • Implemented GitOps-style workflows for ML: data versioning, model versioning (SageMaker Model Registry), and deployment automation (ArgoCD + SageMaker endpoints).
  • Automated retraining triggers via EventBridge + Lambda, ensuring continuous model freshness from new production data.
  • Integrated model drift monitoring with SageMaker Model Monitor & custom Prometheus/Grafana dashboards.
  • Proactively triggered model retraining jobs upon detecting data distribution shifts or degraded accuracy, reducing business risk from outdated models.
  • Optimized ML training with spot-enabled SageMaker training jobs, reducing costs by 40% while meeting SLA deadlines.
  • Leveraged Step Functions + SageMaker for orchestrating multi-stage workflows (data preprocessing → training → evaluation → deployment).

Business Impact

  • Fraud detection platform safeguarded millions of transactions daily with error-budget aligned reliability.
  • Recommendation engine improved user retention and revenue through hyper-personalized experiences.

Key Engagements

Cloud & Platform Transformation

  • Created and implemented the company's cloud and DevOps transformation strategy, enhancing operational agility and scalability.
  • Trained and mentored 80+ engineers in cloud, DevOps, and platform engineering best practices, boosting team productivity by 30%.
  • Automated 100+ CI/CD pipelines, Kubernetes clusters, and cloud infrastructure deployments, achieving 92% operational efficiency.
  • Reduced enterprise cloud spends by 30% through FinOps-driven cost governance frameworks.

Architecture & Reliability

  • Re-architected legacy monolithic infrastructures into microservices-based distributed systems, improving system update agility by 50%.
  • Improved platform performance by 70% post-transformation, driving better customer experience and system resilience.
  • Drove initiatives to enhance SLO achievement, reducing MTTR by 35% across critical services.
  • Designed and deployed a Covid Vaccination Registration Platform for Indian and international governments on Azure under aggressive timelines.

Strategic & Executive Leadership

  • Orchestrated multi-quarter cloud strategy and platform engineering roadmaps, aligning with organizational OKRs, security mandates, and business KPIs.
  • Built and scaled global engineering capability by mentoring 80+ engineers across US-IND-EMEA regions; instilled SRE culture through hands-on coaching and blameless postmortems.
  • Partnered with cross-functional leadership (product, security, compliance, finance) to shape investment decisions, drive FinOps governance, and model infrastructure capacity.
  • Owned reliability and incident response programs for mission-critical workloads; codified DR runbooks, led Chaos SRE programs, and reduced MTTR by 35%.
  • Operated as a strategic advisor bridging product, security, and business goals, while shaping architectural governance and platform modernization efforts.

Currently Exploring: LLMs, RAG, ML on SRE

Predictive & Anomaly Detection

  • Anomaly detection on metrics/logs with Isolation Forest or auto-encoders to catch issues before threshold alerts fire.
  • Capacity forecasting with ARIMA/LSTM to drive autoscaler targets and FinOps rightsizing.
  • SLO burn-rate prediction models that forecast error-budget exhaustion and trigger early rollbacks.

LLM & RAG Applications

  • RAG-powered ChatOps bot (LangChain + Qdrant) that answers incident questions from runbooks/RCA docs in Slack.
  • LLM-generated RCA summaries that condense logs, timeline, and PagerDuty notes into post-mortems.
  • LLM-generated chaos experiments (LitmusChaos) to broaden failure-mode coverage without manual scripting.

Automation & Self-Healing

  • Self-healing workflows: ML predicts failure → Argo Workflows / AWS SSM runs automated remediation.
  • Root-cause clustering of trace IDs & log fingerprints (DBSCAN, t-SNE) to group correlated failures fast.
  • Reinforcement-learning alert tuning that adapts thresholds based on noise vs. signal feedback.

Articles

Writings

Thoughts on SRE, Platform Engineering, Cloud Architecture, and building reliable systems at scale.

Download Resume