The main issue with dual counters is that most of the time, in low latency usecases, your consumer is ~1 message behind the producer.
This means your consumer isn't getting a lot of benefit from caching the producers position. The queue appears empty the majority of the time and it has to re-load the counter (causing it to claim the cacheline).
Meanwhile the producer goes to write message N+1 and update the counter again, and has to claim it back (S to M in MESI), when it could have just set a completion flag in the message header that the consumer hasn't touched in ages (since the ring buffer last lapped). And it's just written data to this line anyway so already has it exclusively.
So when your queue is almost always empty, this counter is just another cache line being ping ponged between cores.
This gets back to Aeron. In Aerons design the reader can get ahead of the writer and it's safe.
Fair point on the head cache line. Tachyon's target is cross-language zero-copy IPC, not squeezing the last nanosecond out of a pure C++ ring. Different tradeoff.
This means your consumer isn't getting a lot of benefit from caching the producers position. The queue appears empty the majority of the time and it has to re-load the counter (causing it to claim the cacheline).
Meanwhile the producer goes to write message N+1 and update the counter again, and has to claim it back (S to M in MESI), when it could have just set a completion flag in the message header that the consumer hasn't touched in ages (since the ring buffer last lapped). And it's just written data to this line anyway so already has it exclusively.
So when your queue is almost always empty, this counter is just another cache line being ping ponged between cores.
This gets back to Aeron. In Aerons design the reader can get ahead of the writer and it's safe.