Introduction
I’ve been using Amazon MSK for some time and today I encountered an interesting problem on a cluster where we had enabled the storage autoscaling.
What was it?
The cluster was processing an unusual amount of data and generating several gigabytes per hour. We had planned for this and enabled the Automatic Store auto-expansion.
As expected, when the disk went over the configured target (>50% disk used) it triggered the autoscaling job and the disk size was increased.
But a couple of hours later the disk was again getting full and I noticed the autoscaling wasn’t reacting. I was getting worried it would get full and tried adjusting the size manually instead of waiting for the automatic task when I got the following error
The provided cluster's storage was modified in the last 6 hours. Please wait for 6 hours between subsequent storage changes.
The minimum cool-down period is 6 hours and as far as the documentation indicates it cannot be lower.
What happened next?
Unfortunately, the disk did get full and the Kafka brokers stopped responding to consumers and producers. There was nothing I could do other than wait. When the 6 hours limit passed, autoscaling took effect and the problem was resolved.
Conclusion
On paper, autoscaling looked like a great feature but it let us down. We work with very large clusters and we have seen them generate many gigabytes of data in a short period of time. We have always relied on planning and monitoring to prevent disks from getting full and we thought this to be a good feature that did not live up to our expectations.
This is quite a limitation and does not apply to just autoscaling. If we were to increase the storage manually we would have encountered the same problem.
The only thing you can do here is to plan it well in advance and add into the planning the potential for having the cluster unavailable for up to 6 hours.