Streaming Without Compromise: Head of Reliability Engineering on SRE, Microservices, and Scalable Architecture

You're reading Entrepreneur India, an international franchise of Entrepreneur Media.

Mayflower is a global FunTech company taking the entertainment industry to the next level. Its flagship product is a live-streaming platform. Mayflower's CDN processes over 10,000 parallel input streams and distributes approximately 100,000 output streams. Downtime is unacceptable—every delay means losing users.

Alexandr Hacicheant, Head of Reliability Engineering, ensures system stability and fault tolerance. He implements practices that minimize risks, let developers sleep peacefully at night, and simultaneously optimize business resources. "My job is to ensure the system doesn't just work—it must withstand peak loads and recover quickly from failures," he explains.

Alexandr shares his career journey, key projects, and best practices—SRE, microservices, and minimizing latency in live streaming.

From Developer to CTO

Before joining Mayflower, Alexandr worked remotely for several years with Russian and international companies as a backend developer. He specialized in solving critical issues—whether implementing urgent features or fixing system failures. "For example, when promo campaigns caused a surge in users, I had to ensure services could handle the traffic spike," he recalls.

In 2016, he moved to Cyprus and joined Mayflower. Starting as an engineer on a 15-person team developing and testing new features, he shifted focus to architecture optimization and bottleneck elimination as scaling challenges emerged. "I looked for ways to scale not by buying more servers but by improving our tech stack's efficiency," he says.

One of his initiatives was dedicating 30% of team time to technical debt. This improved system stability, reduced incidents, and enhanced engineers' work-life balance. "Before, employees often woke up at night to fix issues. We started addressing root causes—not just symptoms."

After several years, Hacicheant became CTO, overseeing technical growth: leading tech leads, coordinating backend/client development, ML teams, and DevOps. Under his guidance, security improved—including forming a dedicated infosec team (previously handled by infrastructure teams).

Alexandr and his team implemented automated vulnerability detection pipelines (SAST and SCA solutions) to scan project source code before production deployment, while establishing streamlined remediation processes. Additionally, they deployed a centralized access management system for company resources. Furthermore, Hacicheant spearheaded a company-wide security awareness initiative through interactive training sessions and meetups.

Under Alexandr's leadership, the development and operations teams also dedicated significant efforts to building a cloud platform and migrating applications to cloud infrastructure. This transition delivered substantial advantages in computational resource management and allocation, automated scaling and failure recovery, along with accelerated application and service deployment speeds compared to traditional physical server or virtual machine environments.

Resilience at the Architectural Level

After approximately three years as CTO, the expert took over leadership of Reliability Engineering. Currently, Alexandr's primary goal is to ensure service fault tolerance while establishing robust failure recovery and analysis processes. "Ideally, outages shouldn't occur. But when they do, we need to identify the issue and recover quickly," he explains. System failures can stem from various causes, often due to suboptimal code or hastily chosen architectures.

Alexandr's team identifies failure root causes during profiling and analysis, documents best practices, shares them company-wide, and automates detection of similar future issues. For example, they've implemented load-testing pipelines to evaluate code performance under multi-user loads and assess service readiness for peak traffic.

Under Alexandr's guidance, the team established a three-tier technical support system:

- First line: 24/7 monitoring team

- Second line: SRE team comprising developers and DevOps engineers for specific services

- Third line: Team leads and technical leads with broad expertise

"Initially, incidents frequently escalated to the third line. But as the first and second lines gained experience—writing postmortems (documents detailing timelines, conclusions, and preventive measures) and action items—escalations dropped dramatically," Hacicheant emphasizes. Collectively, these innovations reduced major incidents from weekly occurrences to no more than monthly.

A current priority for Alexandr is decomposing the monolith into microservices. Monolithic architecture—a single, tightly integrated system—simplifies development but severely hinders scaling and partial updates. Microservices, conversely, break applications into independent modules, each handling specific functions and deployable separately, communicating via APIs.

"Monoliths work initially when validating hypotheses quickly. But they eventually impede scaling, and a single failure can disrupt half your business processes," Alexandr explains. For this transition, he oversees technology selection and architectural decisions.

The expert emphasizes that such decisions must always strike a balance between business requirements, technical considerations, and available resources - both human and temporal. Chasing after new technologies and ideas isn't always the optimal approach. "When testing a business hypothesis, it's better to leverage existing solutions and quickly assemble a makeshift system that meets requirements using the tech stack your team already knows. There will be time to refine architectural solutions once the service proves its business value. Otherwise, all resources that could have been invested in validating other hypotheses will be wasted," Hacicheant stresses.

According to the expert, one of the key recommendations before implementing microservice architecture is to thoroughly understand the system's business processes. This enables more precise definition of service boundaries and their decomposition. Additionally, services should maintain loose coupling; issues with one service shouldn't cause the entire system to degrade. This can be achieved through approaches like asynchronous communication and Event-Driven Architectures.

Simultaneously, Hacicheant is enhancing monitoring and improving service observability so the system can automatically identify where failures occur and alert the appropriate personnel. "The ultimate goal isn't manual log collection, but building an intelligent alert system that can independently diagnose what went wrong and where, then precisely notify the responsible team."

SRE: A Systemic Approach to Failures

Under Alexandr's leadership, Mayflower adopted Site Reliability Engineering (SRE) practices—an engineering methodology balancing feature velocity with service stability. Core principles acknowledge that failures are inevitable; teams must minimize impact, automate responses, and prevent recurrences.

At the core of SRE are three key concepts: SLO (Service Level Objective) — the target levels of service availability or performance; SLI (Service Level Indicator) — the metrics used to measure these targets; and the error budget — the acceptable threshold of failure. If this threshold is exceeded, the system automatically triggers an alert, and the CI/CD pipeline may suspend further deployments.

SRE practice includes the preparation of runbooks — step-by-step guides for resolving common issues. After an incident, a postmortem is created. In addition, SRE promotes gradual rollouts: updates are initially delivered to 5–10% of users, and only if the system remains stable are they rolled out more broadly. If issues are detected, tools like Spinnaker can automatically roll back the changes.

This approach helps companies reduce the number of outages, accelerate recovery times, and improve user satisfaction. Instead of chaotic late-night firefighting and stress, SRE brings structure, transparency, and predictability. As Alexandr emphasizes, implementing SRE provides not only technical but also cultural benefits: it enhances collaboration across teams and reduces burnout.

Most Popular

70 Small Business Ideas to Start in 2025

Creating a Brand: How To Build a Brand From Scratch

It's Time to Rethink Research and Development. Here's What Must Change.

How to Better Manage Your Sales Process

AI Agents Can Help Businesses Be '10 Times More Productive,' According to a Nvidia VP. Here's What They Are and How Much They Cost.

Passion-Driven vs. Purpose-Driven Businesses — What's the Difference, and Why Does It Matter?

Most Popular Red Arrow