Yesterday evening, our Talos ATS monitoring systems triggered several alerts, which were immediately escalated as a Priority 1 incident. Our on-call development team began investigating straight away.
The root cause was identified as a Microsoft service outage, which triggered a restart of our APIs. Due to a downstream dependency, several of the APIs did not restart as expected.
The team quickly identified the specific failure and began implementing a fix. All services were fully operational again within 40 minutes, and continued to be closely monitored until the fix was signed off and everything was confirmed stable.
We are reviewing this incident internally to ensure we can further strengthen resilience around this dependency in the future.