How Guardz Turned Silent Job Bottlenecks into Same-Day Fixes

As Guardz scaled, background jobs executed continuously under real production workloads, with performance characteristics that only emerged at runtime. When some of those jobs began slowing down, the challenge wasn’t detecting the regression, it was understanding the root cause clearly enough to fix it. Fast.
How Guardz Turned Silent Job Bottlenecks into Same-Day Fixes

Company Snapshot

Guardz is a cybersecurity platform built for MSPs, helping protect SMB environments through continuous monitoring, integrations, and automation. As the platform scaled rapidly, Guardz’s backend evolved into a system with multiple background jobs, third-party PSA integrations, and high-frequency workflows executing continuously in production.

In this environment, correctness wasn’t the challenge – sustaining predictable, explainable behavior across complex production jobs at scale was.

The Challenge

As Guardz scaled, background jobs became a critical part of the platform’s operational backbone, syncing data with PSA systems, processing integrations, and powering customer-facing workflows.

The problem wasn’t that jobs failed. It was that some jobs quietly became expensive.

Certain job runs would occasionally stretch from seconds to minutes – and sometimes hours – silently consuming resources, delaying downstream flows, and creating operational risk. When this happened, the team faced a familiar challenge: the issue only existed in production, under real customer data and execution paths.

“Our existing stack could show that a job was slow – but not what inside it was responsible. Investigations meant adding logs, redeploying, and chasing theories across tools – often taking days and still ending with uncertainty.”

Peri Fishgold
Software Engineer

Guardz needed a way to turn job performance regressions into decisive engineering work – without slowing delivery or increasing operational overhead.

One critical example was a ticket sync job worker. The team knew something was wrong, but every existing signal stopped at the same place: the job was slow. It was difficult to pin-point causality, and adding logs just led to adding more logs., often across multiple iterations turning each investigation into a multi-day guessing exercise. Investigations became hypothesis-driven and time-consuming-especially when the issue only appeared under real production traffic.

“What changed wasn’t just how fast we debugged – it was who could debug. Problems that used to require deep system knowledge became explainable to the whole team, directly from production.”

Tal Doron
Software Engineering Manager.

The Solution: Function-Level Runtime Intelligence

Guardz introduced Hud’s Runtime Code Sensor to pinpoint where time and resources were actually spent inside production jobs, without adding logs, breakpoints, or configuration overhead.

Hud captures lightweight execution data continuously, and automatically escalates to gathering deep forensic context when performance degradations or errors occur. This allows the team to directly look at specific situations and execution paths that lead to the slowdowns, instead of spending time on finding them.

How Guardz Engineers Used Hud in Practice

Isolating the Job Bottleneck

When the same ticket sync job worker spiked again, Hud immediately surfaced which function dominated execution time. One function – updateTicket – was being invoked hundreds of thousands of times within a single job run.

Instead of guessing:

  • Engineers immediately saw the function that was invoked an excessive amount of times
  • They identified that the function was executing inside an unintended loop
  • They correlated the behavior with specific production executions

Fixing with Confidence

Once the root cause was clear, the fix itself was straightforward. The team adjusted the function’s usage pattern, deployed the change, and immediately observed the impact in production-job durations dropped back to expected levels.

The full cycle-from detection to resolution-took less than a day.

“Once we saw the specific execution details, there was nothing left to investigate. We fixed it and moved on.”

Peri Fishgold
Software Engineer.

How Guardz Uses Hud with Agentic Workflows

Hud’s Runtime Code Sensor gave Guardz function-level execution data directly from production – but the real shift came from how that data was used.

Instead of assembling context manually, engineers used Hud’s MCP to ask a precise, targeted question: recursively inspect functions called by the sync ticket job worker to identify bottlenecks and performance issues.

That query surfaced the specific function dominating execution time, along with its invocation pattern and runtime context. With the root cause clear, the workflow shifted from exploration to execution, engineers moved directly from identifying the problem to fixing it, with confidence the change addressed the real issue.

“What mattered to me was making this repeatable. We don’t want production issues to depend on who’s available or who remembers the system best, they need to be explainable every time.”

Tal Doron
Software Engineering Manager.

The Results

Operational impact

  • Bottlenecks identified directly in production
  • Clear attribution of performance issues to specific functions
  • Faster validation of fixes without additional instrumentation

Engineering workflow impact

  • Investigation time reduced from days to hours
  • No need to redeploy just to add logs
  • Less reliance on intuition and tribal knowledge

Organizational impact

  • Easier onboarding to complex job flows
  • Reduced dependence on a small number of “system experts”
  • A shared, explainable understanding of how critical jobs actually run

Conclusion

For Guardz, Hud didn’t just add context – it introduced a new way of approaching production problems, from the first signal to the final fix. As the team rolls this workflow out across the engineering organization, they’re seeing how runtime context can turn opaque performance issues into explainable, actionable work.

Performance regressions in critical jobs are no longer opaque, slow, or risky to investigate. Engineers can move directly from detection to explanation to fix, grounded in real production execution – without adding logs, redeploying, or guessing.

This shift compressed the full debugging cycle from days to less than a day, reduced reliance on a small set of experts, and gave the team confidence to move faster without accumulating hidden operational debt.

“AI helps us move faster only when it’s grounded in reality. Having accurate runtime context means our engineers, and our AI workflows, can reason about real production behavior and act on it safely.”

Victor Trakhtenberg
VP Engineering

As Guardz continues to scale its platform and integrations, runtime context is now part of Guardz’s everyday engineering workflow – guiding what gets investigated, what gets fixed, and how confidently changes ship to production – turning complex production behavior into something engineers can reason about, act on, and trust.

Have questions?

Book a custom introduction to our learning platform.

Website Design & Development InCreativeWeb.com