Recently I’ve noticed increased level of 502 errors on Application Load Balancer sitting in front of ECS service. Analysis of all load balancer parameters like connection draining I have not found a direct reason of this behaviour. I noticed that application served the response with 200 OK status code, however ALB still sent 502 to the client. Thankfuly instance was still around. After investigating logs I found cryptic message about Elastic Network Adapter being reset:
ena 0000:00:05.0: Device reset completed successfully, Driver info: Elastic Network Adapte
r (ENA) v2.2.10g
Quick analysis shown that it happens quite often:
grep 'Device reset completed successfully' messages | wc -l
45
Ok, so definitely restart of network adapter driver can cause dropped connections from ALB to the instance. But what causes ENA to reset? By digging deeper I found this issue on Github that from the first sight looks related.
And it was that! I found out that single Java application OOM-ing, dropped stack trace to serial, and that caused ENA driver to restart due to serial being busy.
And the solution was to disable dumping kernel log during OOMs:
vm.oom_dump_tasks = 0 # to disable the dump to kernel log & spamming serial out.