Service degradation for api.veryfi.com
Started 19 Apr at 02:13pm UTC, resolved 19 Apr at 02:53pm UTC.
At 14:53:00 UTC, the underlying issues have now been resolved and all services are back to normal.
Start Time: 14:13:00 UTC
End Time: 14:53:00 UTC
Impact: Performance degradation and processing of documents performed by a non-current version of ML model
At 14:13 UTC, a component of our primary processing cluster was restarted as part of a routine automated maintenance process. This process failed due to a version mismatch recently introduced during component upgrades.
From that point in time, all traffic began being routed to a fallback cluster, which is running a non-current version of the machine learning model. During the impact window, some results may have been different than expected due to a different machine learning model being in use.
The fallback cluster began scaling out and reached full production levels within approximately 20 mins and the request processing time returned to normal.
At 14:53 the underlying issue in the primary cluster was resolved and traffic was redirected back to it.
The version mismatch in our primary cluster was resolved and all services returned to full operation.
To ensure that such an incident does not happen again, we have taken the following steps:
We have revised our processes for version control and patching to ensure this mismatch does not happen in the future
We have revised our processes around fallback clusters to ensure they are also running production versions of code and models
At 14:13 UTC, we discovered a performance degradation and processing of documents performed by a non-current version of ML model