RocketMQ broker busy problem

  1. the company uses rocketmq message middleware. With the growth of business, it will be reported occasionally recently: [TIMEOUT_CLEAN_QUEUE] broker busy, start flow control for a while, period in queue: 206ms, size of queue: 5

I have made a version upgrade, upgraded to 4.2, and did a lot of search on the Internet, all in general terms. Change the configuration: waitTimeMillsInSendQueue=300-sharp or larger
sendMessageThreadPoolNums=64
useReentrantLockWhenPutMessage=true

but this does not solve the fundamental problem, because if I change it to 300, there will be errors greater than 300ms. It is impossible to modify the waitTimeMillsInSendQueue parameters greatly.
through OPS monitoring, when an error is reported, the occupancy of CPU does increase, but it has not reached 50%, and the memory is not full. IO fluctuates slightly.

but none of these can directly lead to that problem. I checked the rocketmq source code and used MappedByteBuffer to see if it is the performance problem here. Is there any performance problem with MappedByteBuffer? (no substantive answer was found on the Internet)

those who have had the same problem, or those who are more deeply studied by rocketmq, help to solve it. Thank you very much!

May.02,2022

I looked at the source code. This error corresponds to four queues:

  1. queue of sendMessageExecutor
  2. queue of pullMessageExecutor
  3. queue of heartbeatExecutor
  4. queue of endTransactionExecutor

your CPU and memory are not fully loaded, and the IO is jitter. I guess the possible reason is that the IO is overloaded and the write thread is blocked and the processing speed is too slow resulting in timeout.

reference: https://www.e-learn.cn/conten...


this is a very strange problem. The pressure test has overwhelmed the throughput of 20,000, and there is no recurrence problem. The actual online operation, will occasionally report, waitTimeMillsInSendQueue at the beginning of the default 200, I changed to 300, will report more than 300ms errors, I now change to 400ms, will also report 400ms problems. I have no clue.
Log is shown below:

clipboard.png

Menu