Issue Summary
From 10:26 PM to 12:58 PM (GMT+1), requests to most Review API resulted in 500 error response messages. FashionFusionHub applications that rely on these APIs also returned errors or had reduced functionality. At its peak, the issue affected 100% of traffic to this API infrastructure. Users could continue to access other APIs except Review API.
Timeline (all times (GMT+1))
10:19 PM: Updated Reviews page push begins
10:26 PM: Outage begins
10:26 PM: DataDog alerted teams
10:54 PM: Failed change rollback
11:15 PM: Successful configuration change rollback
11:19 PM: Server restarts begin
11:58 PM: 100% of traffic back online
Root Cause
At 10:19 PM PT, the HTTP response from the Review API revealed a 500 Internal Server Error. The root cause of this issue lies in a typographical error. This error occurred during an attempt to access the address of physical services, resulting in a permanent blockage. Specifically, the internal monitoring systems encountered a typographical error during a call to the Review API, leading to a permanent block.
Resolution and recovery
At 10:26 PM PT, the monitoring systems alerted the engineers who investigated and quickly escalated the issue. By 10:40 PM, the incident response team identified that the monitoring system was exacerbating the problem caused by this bug.
To address this bug, we corrected the typographical error in the configuration file for php in the www folder. Which disrupted the flow of the review process, causing the server to return the 500 error. Additionally, investigating and rectifying the permanent blockage in the internal monitoring systems during calls to the Review API is necessary for a comprehensive solution.
To help with the recovery, we turned off some of our monitoring systems which were triggering the bug. As a result, we decided to restart servers gradually (at 11:19 PM), to avoid possible cascading failures from a wide-scale restart. By 11:49 PM, 25% of traffic was restored and 100% of traffic was routed to the API infrastructure at 11:58 PM.
Corrective and Preventative Measures
In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:
Disable the current configuration release mechanism until safer measures are implemented.
Change the rollback process to be quicker and more robust.
Programmatically enforce staged rollouts of all configuration changes.
Add a faster rollback mechanism and improve the traffic ramp-up process, so any future problems of this type can be corrected quickly.
Develop a better mechanism for quickly delivering status notifications during incidents.