Kratik Jain

DevOps | Cloud | Infrastructure | SRE


✨ Passionate DevOps Engineer with sound understanding of cloud-native ecosystem. Loves to tinker with containers. I make life of Devs easier so I can work peacefully on the actual tasks and not on "it works on my machine!". Also, when someone asks me for coffee ☕️ I respond with Error 418 🤷‍♂️. Full time engineer, Part-time memer.

Work Experiences

Infrastructure Engineer @ Kutumb App

Bengaluru | May 2022 - Present

As a key member of the Infra Team (one of the two members) serving approximately 4M Daily Active Users, I honed my skills while actively contributing to reliability enhancements. Engaging with open-source tools, I identified issues/improvements and made valuable contributions back to the community. Taking ownership of projects, I also facilitated knowledge exchange among team members, fostering a collaborative environment for continuous learning. Ensured that our solutions and approaches are cost effective.

  • Configured existing infra to use IAC (Terraform) with module composition and centralized remote state management on AWS S3.
  • Used ArgoCD with PR Generator to solve a problem - To spin up preview environments on demand with all Secrets + Observability enabled.
  • Extended Marbot using Terraform to create numerous CloudWatch Slack Alerts to cover almost every AWS resource which needed to be monitored.
  • To have clear visibility, created effective cost reports and dashboards to know the exact services/resources which was increasing our $$$.
  • Automated Setup of full fledged EKS Prod/Stage Cluster with all EKS Plugins + Node Groups + Logging/Monitoring/APM Stack using Terraform/Helmfile. Made a dynamic module for this to reuse it as many times as we want without repeating the whole TF code.
  • Understood the requirement and Migrated three kOps k8s clusters (dev, stage and prod) into two EKS clusters (stage and prod) with separate clusterDomains and OpenVPN(Pritunl) setup.
  • Attained K8s Node Autoscaling nirvana using Karpenter.
  • Setup of precise actionable alerts to cover important aspects of the infra.
  • Deployed Pritunl VPN to facilitate secure connectivity to the systems for devs, QAs and PMs. (Prior to this we were using plain OpenVPN, Read blog, It made all k8s FQDNs, Cluster IPs, AWS VPC IPs reachable from local machines.)
  • Self hosted GitHub Actions Runners, dynamic provisioning of runners as per the queued CI Jobs.
  • Setup of Hashicorp Vault + Vault Secrets Operator for ease of management of the Kubernetes Secrets. (Blog)
  • Implemented Argo Workflows with default handler, so that any workflow will retry itself 2 times before completely failing and then alert will be sent to a dedicated slack channel.

Software Engineer @ Cuelogic Technologies - An LTI Company

Pune | Dec 2019 - May 2022

Worked on multiple projects of various scale of different domains across various geographies.
In these, I've transcended from apprentice to architect, weaving together the threads of development and operations to achieve utmost efficiency.

  • Automated creation of an Infra from scratch using AWS Cloudformation, Ansible using ECS Cluster, Nodes and Services.
  • Used Single Load Balancer for multiple Ingresses using "Ingress Group" to save some costs.
  • Optimized CI/CD pipelines - Build Once, Deploy Everywhere.
  • Custom GitHub Action to send Slack Alerts to notify CI/CD pipeline success/failures.
  • Migration of a three tier application from bare metal servers to Docker Swarm Cluster.
  • Created Self Exploding testing infra stack using AWS Cloudformation. Also wrote blog on this here.
  • Setup of an NGINX reverse proxy Django app using Gunicorn.
  • Ensured all the best practices being followed in my projects.
  • Contributed in some Python projects.

Intern @ Cuelogic Technologies

Pune | July 2019 - Dec 2019

During my Internship, I was given training on Linux, Networking, Git, Docker and Kubernetes. I got to work on a project during my internship where we were ingesting TBs of logs daily in our system to power a SOC tool. My job was to keep our self-hosted Kubernetes cluster running reliably. I was appraised for a tool I created to send hourly slack alert which will collect data from various data sources and will give precise system health status. (Kafka Lag, Query Performances, NotReady nodes and pods, etc.)

  • Replicated the exact prod cluster on smaller machines using Vagrant & Ansible.
  • Debugged various K8s related issues, e.g. - IPAM exhaustion of Weave CNI.
  • CronJob for cleanup of some dangling resources.


Covid Vaccine Availability Notifer

Sep 2020

During Covid, booking vaccine slots was chaotic. I tackled it with a Python script using CoWin API running on AWS Lambda. Scheduled to run every minute with AWS EventsBridge, I alerted about available slots via Slack and PagerDuty call.

Automated Check-in/Check-out for Zoho People

Oct 2020

My Organization was using Zoho HRMS and we daily needed to do check-in and check-out on the portal. It was bit annoying. So it started as a fun project and to test the power of Automation, I crafted a Python program utilizing Selenium WebDriver to automate the login process on the portal. By simulating a headless Chrome instance, it seamlessly entered my credentials and navigated to the check-in page, where it completed the check-in procedure on my behalf.

Fully Automated "Kubernetes the hard way"

July 2021

Following Kelsey Hightower's Kubernetes The Hard Way, I built a robust, self-hosted Kubernetes infrastructure entirely from scratch. Leveraging AWS, I utilized Terraform for infrastructure provisioning and Ansible for automating all necessary steps, ensuring a seamless and efficient setup process.



Amazon Web Services