Kratik Jain

DevOps | Cloud | Infrastructure | SRE

image

✨ Passionate DevOps Engineer with sound understanding of cloud-native ecosystem. Loves to tinker with containers. I make life of Devs easier so I can work peacefully on the actual tasks and not on "it works on my machine!". Also, when someone asks me for coffee ☕️ I respond with Error 418 🤷‍♂️. Full time engineer, Part-time memer.


Work Experiences

Infrastructure Engineer @ Kutumb App

Bengaluru | May 2022 - Present

Being from the Infra Team of one of the fastest growing organizations, serving approximately 4M Daily Active Users, I honed my skills while actively contributing to reliability enhancements. Engaging with open-source tools, I identified issues/improvements and made valuable contributions back to the community. Taking ownership of projects, I also facilitated knowledge exchange among team members, fostering a collaborative environment for continuous learning. Ensured that our solutions and approaches are cost effective.

  • Configured existing infra to use IAC (Terraform) with module composition and centralized remote state management on AWS S3.
  • Used ArgoCD with PR Generator to solve a problem - To spin up preview environments on demand with all Secrets + Observability enabled.
  • Extended Marbot using Terraform to create numerous CloudWatch Slack Alerts to cover almost every AWS resource which needed to be monitored.
  • To have clear visibility, created effective cost reports and dashboards to know the exact services/resources which was increasing our $$$.
  • Automated Setup of full fledged EKS Prod/Stage Cluster with all EKS Plugins + Node Groups + Logging/Monitoring/APM Stack using Terraform/Helmfile. Made a dynamic module for this to reuse it as many times as we want without repeating the whole TF code.
  • Understood the requirement and Migrated three kOps k8s clusters (dev, stage and prod) into two EKS clusters (stage and prod) with separate clusterDomains and OpenVPN(Pritunl) setup.
  • Attained K8s Node Autoscaling nirvana using Karpenter.
  • Setup of precise actionable alerts to cover important aspects of the infra.
  • Deployed Pritunl VPN to facilitate secure connectivity to the systems for devs, QAs and PMs. (Prior to this we were using plain OpenVPN, Read blog, It made all k8s FQDNs, Cluster IPs, AWS VPC IPs reachable from local machines.)
  • Self hosted GitHub Actions Runners, dynamic provisioning of runners as per the queued CI Jobs.
  • Setup of Hashicorp Vault + Vault Secrets Operator for ease of management of the Kubernetes Secrets. (Blog)
  • Implemented Argo Workflows with default handler, so that any workflow will retry itself 2 times before completely failing and then alert will be sent to a dedicated slack channel.

Software Engineer @ Cuelogic Technologies - An LTI Company

Pune | Dec 2019 - May 2022

Worked on multiple projects of various scale of different domains across various geographies.
In these, I've transcended from apprentice to architect, weaving together the threads of development and operations to achieve utmost efficiency.

  • Automated creation of an Infra from scratch using AWS Cloudformation, Ansible using ECS Cluster, Nodes and Services.
  • Used Single Load Balancer for multiple Ingresses using "Ingress Group" to save some costs.
  • Optimized CI/CD pipelines - Build Once, Deploy Everywhere.
  • Custom GitHub Action to send Slack Alerts to notify CI/CD pipeline success/failures.
  • Migration of a three tier application from bare metal servers to Docker Swarm Cluster.
  • Created Self Exploding testing infra stack using AWS Cloudformation. Also wrote blog on this here.
  • Setup of an NGINX reverse proxy Django app using Gunicorn.
  • Ensured all the best practices being followed in my projects.
  • Contributed in some Python projects.

Intern @ Cuelogic Technologies

Pune | July 2019 - Dec 2019

During my Internship, I was given training on Linux, Networking, Git, Docker and Kubernetes. I got to work on a project during my internship where we were ingesting TBs of logs daily in our system to power a SOC tool. My job was to keep our self-hosted Kubernetes cluster running reliably. I was appraised for a tool I created to send hourly slack alert which will collect data from various data sources and will give precise system health status. (Kafka Lag, Query Performances, NotReady nodes and pods, etc.)

  • Replicated the exact prod cluster on smaller machines using Vagrant & Ansible.
  • Debugged various K8s related issues, e.g. - IPAM exhaustion of Weave CNI.
  • CronJob for cleanup of some dangling resources.

Projects

MySQL Awesome Stats Collector (MASC)

Dec 2025

Engineered a self-hosted, agentless diagnostics tool for MySQL. It executes parallel collection of InnoDB status, global metrics, and processlists across multiple hosts. Features side-by-side job comparison with delta highlighting, scheduled collections (crons), and detailed connection analysis by IP/User. Built using FastAPI, Alpine.js, and TailwindCSS. Available on GitHub and PyPI (pip).

ElastiCache Hot Shard Debugger

Dec 2025

Developed a sophisticated web-based tool to identify and debug hot shard issues in AWS ElastiCache clusters. Leverages the Redis MONITOR command with a threaded background runner to capture real-time traffic across all shards simultaneously. Features a modern UI with interactive Chart.js visualizations, time-series timeline analysis, and side-by-side job comparison. Built with FastAPI and a hybrid SQLite architecture for isolated data storage. Available on GitHub and PyPI (pip).

AWS Multi-Level Cost Analyzer

Jan 2026

Developed a Streamlit-based cost analysis internal tool allowing granular drill-downs into AWS spend. Architected with a persistent SQLite cache to optimize API calls and a Nginx sidecar for secure basic authentication. Features dynamic filter selection, "Top 5 Cost Movers" insights, and automated markdown report generation. Deployed on Kubernetes with IRSA for secure AWS API access.

Smart RDS Viewer

Aug 2025

Built a professional full-screen terminal CLI for real-time Amazon RDS monitoring. It features live pricing integration, interactive column sorting, and Reserved Instance (RI) utilization analysis. Implemented a smart 24-hour pricing cache and responsive terminal design using Python and Rich. Available on GitHub and PyPI (pip).

Kratik SSH Resume (Portfolio on SSH)

Dec 2024

Architected a unique, interactive terminal-based portfolio served over SSH. Developed in Go using the Charmbracelet ecosystem (Bubble Tea, Wish, Lip Gloss). Features a tabbed UI for cross-section navigation, lipgloss styling for terminal aesthetics, and a custom SSH server implementation. Connect via ssh.kratik.dev

Certifications

Amazon Web Services