How Monday.com Built a Repeatable Playbook for Complex Performance Issues at scale

Monday.com's Automations team needed a systematic way to investigate performance outliers and CPU spikes in their high-volume distributed system. With Hud's function-level runtime intelligence, complex investigations that previously required days and deep system expertise now follow a repeatable playbook - from detection to root cause to resolution that can be acted on easily and immediately.

14 minutes

Time to identify performance regression post-deployment

Full production
coverage

No sampling limits, no added instrumentation

Zero

“Voodoo incidents”

How Monday.com Built a Repeatable Playbook for Complex Performance Issues at scale

Company Snapshot

Monday.com provides a customizable Work Operating System designed to help organizations plan, run, and track work across the business. Through no-code and low-code tools, Monday enables teams to build workflows, apps, and automations tailored to their needs.

At the heart of Monday.com’s platform sits the Automations Service – a high-throughput, distributed system that processes millions of customer-defined workflows across SQS consumers. Engineered for flexibility, responsiveness, and high performance, it evolved into an environment where real-time function-level execution data would meaningfully impact engineering speed and operations.

The Challenge

The Automations team manages one of Monday.com’s most powerful and widely used capabilities: a distributed automation engine that executes millions of customer-defined workflows. The system’s flexibility is its strength – but it also means performance can vary significantly across different automations and customer accounts. Over time, performance outliers emerged: specific automations taking much longer than others, intermittent CPU spikes, behaviors visible only under certain traffic patterns.

Investigating these issues was complex – requiring deep system understanding, cross-tool drilling (Datadog, Redis, database queries), and often missing key context due to sampling or incomplete traces. These investigations could take days, and there was no repeatable playbook for getting to root cause quickly.

“An alert tells us that something is wrong, but not why it happens. In a complex system like ours, there are so many influencing parameters and different paths to investigate – the infrastructure, specific flow parameters, the database, and more – each living in a different place. Understanding where the issue came from usually takes most of the time while tackling a bug or incident. And systematically gathering and serving this context to our coding agents is complicated too – there’s a lot of noise in the data, and it’s hard to tell upfront which context will be relevant to solve the issue.”

Moshik Eilon
Group Tech Lead at monday.com

The team already had robust monitoring but wanted to complement it with a way to

Automatically detect and understand the root cause of performance spikes

Save debugging time by bringing the engineer to the function level

Easily compare gradual rollout versions with the baseline, both in the real production environment

In short, they sought visibility that matched the sophistication of the system itself.

The Solution – Function-Level Runtime Intelligence

Monday.com’s engineering team evaluated approaches that could provide deeper production visibility without adding operational overhead or requiring complex setup. They needed a product that could surface the deep data they need during specific behaviors – performance degradations and CPU spikes.

Monday.com chose Hud’s lightweight Runtime Code Sensor, enabling the team to move quickly while maintaining clarity in highly dynamic systems.

“We get functional observability without paying for ingested spans on everything. The fact that we get it for every function – not only for high-level endpoints like traditional APMs – helps us get it out of the box without manually opting in. We just get all of this information for all functions, regardless of how granular they are, even for underlying packages.”

Moshik Eilon
Group Tech Lead at monday.com

Hud’s technology captures minimal data during normal operation, but when an error or a performance degradation occurs – it gathers the deep forensic data needed to understand exactly what happened – data that can be used by engineers or by AI coding agents, to resolve the issue. It integrates directly with Monday.com’s IDEs to present the data where the engineers and agents work, with Slack, through MCP for natural inquiry, and has a complete web product to see all the issues and their root causes.

“Hud eliminated our voodoo incidents – mysterious CPU spikes that required custom profiling tools and days of investigation. Now engineers ask Cursor ‘why is this endpoint slow?’ and get immediate, deployment-correlated answers. AI-powered root cause analysis turned days of drilling through tools into minutes.”

Moshik Eilon
Group Tech Lead at monday.com

How Monday Engineers Use Runtime Intelligence

Automations Group

Earlier visibility into performance degradations in noisy environments

During a routine gradual rollout, the team saw the value of production runtime context immediately. One of the high-volume SQS consumers was slower in the new version. With function-level visibility in place, the team received a clear signal just 14 minutes after deployment without needing to configure anything, identifying the precise function contributing to the slowdown, coupled with forensic details needed to understand and resolve the issue with a coding agent.

What previously could have taken a long time to discover and would have required a broad investigation, now became an early, actionable alert with what is needed to resolve the issue.

“We saw the regression within 14 minutes and had the exact function and execution context we needed. Before Hud, this would’ve meant hours of hypothesis-driven debugging across multiple tools”

Moshik Eilon
Group Tech Lead at monday.com

Understanding bottlenecks with precise execution timing

In another instance, the team identified unexpectedly long lock durations tied to redlock usage. The extended timing was not intuitive from metrics alone, but looking at function-level data for this time frame made the behavior clear within moments. The solution was quick to follow.

Clarity during endpoint behavior analysis

When certain endpoints displayed intermittent failures, runtime intelligence provided an exact view of which functions were the root cause of the flows that resulted in an error. Having this information readily available helped the team quickly understand and resolve those errors, using AI coding agents that could use the context to operate with runtime context.

Using runtime data to guide refactoring decisions

Beyond incident response, the team leveraged runtime context to identify code that was no longer being executed in production. For instance, when considering whether to modernize and get a better understanding of a legacy module, engineers could now verify what parts of it were actually used in reality – and whether it could safely be simplified or removed.This execution-based perspective gave the team confidence to prioritize refactoring efforts based on actual usage patterns rather than assumptions, supporting cleaner, more maintainable code over time.

The Results

The Automation group at Monday.com strengthened its ability to understand and act on key production issues such as errors and performance degradations. While the platform was already highly observable, the addition of Hud’s runtime code sensor and function-level intelligence provided a whole new level of actionable insight – with minimal operational overhead and no need to write logs or configure monitors:

Operational benefits

Earlier detection of performance degradations

Clear root causes for behaviors visible only under real traffic

Safer and better controlled gradual roll-out of new versions

Engineering workflow benefits

Faster, more focused production issue investigations grounded in concrete execution data

Reduced reliance on broad hypothesis-driven exploration

Deeper understanding of service execution behavior across pods and flows

Better-informed refactoring, modernization and cleanup efforts guided by actual runtime context.

Organizational benefits

A stronger foundation for scaling platform reliability as the system evolves

Insights that support ongoing innovation in key services

New opportunities for AI-assisted engineering workflows built on structured runtime context

This evolution reflects Monday.com’s continued commitment to engineering excellence – delivering meaningful impact on visibility and operational intelligence consistent with the sophistication, scale, and ambition of its systems.

Conclusion

Monday.com’s Automations team operates one of the most dynamic and high-scale components of the WorkOS platform. As customer adoption and system complexity increased, the team recognized the need to complement their existing observability practices with precise, function-level runtime context.

By incorporating this level of runtime data, the team transformed complex, time-intensive investigations into a repeatable playbook – reducing debugging cycles from days to minutes and enabling engineers to resolve performance issues with confidence.

This evolution reflects how modern engineering organizations scale: not just by building sophisticated systems, but by building the visibility infrastructure that makes those systems debuggable, maintainable, and continuously improvable.