K8S cluster migration - problems

Planning

Lack of knowledge of the dependencies between payment areas and cluster elements
Lack of preparation of a specific/detailed plan for the redeployment by the development team
Lack of comprehensive verification of the operation of systems and applications after the repiping by the development team - e.g., a week after the migration, we find out that data from some NFS were to be moved despite the findings l that such information should be in the basic action plan

Problems related to traffic interleaving - spaghetti, a huge number of entries on HAProxy scattered over various load balancers, which often haven't been updated for a long time and don't have renewed puppet certificates
Last-minute additions to HAProxy - an example of this is that the front-end was left on the old cluster, and the entire domain context was moved to the new cluster - this was also due to a lack of planning
The need to perform rollbacks - many things go wrong during the surge itself, when something stops working

Additional work related to issuing SVCs, which serve as proxies and allow elements to communicate between clusters - it often comes out at the last moment, however, that system X needs to communicate with system Y
Additional work associated with aligning changes on the branch from the new cluster - commits are pushed only to the old cluster despite the arrangements
Requirement to keep all data and messages on AMQ queues
FluxCD often refuses to serve on old calster which prolongs work
Time-consuming traffic flow due to the large number of load balancers
PSD2 payment channels shut down - kills not previously issued
Variables and configurations sewn in strange/non-standard places or hardcoded

Last modified: 30 May 2024