#1 2022-01-22 15:39:00

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 13,277
Website

Three Locks To Rule Them All

flyinglock.png
To ensure thread-safety, especially on server side, we usually protect code with critical sections, or locks. In recent Delphi revisions, with have the TMonitor feature, but I would rather trust the OS for locks, which are implemented using Windows Critical Sections, or POSIX futex/mutex.

But all locks are not born equal. Most of the time, the overhead of a Critical Section WinAPI or the pthread library is not needed.
So, in mORMot 2, we introduced several native locks in addition to those OS locks, with multi-read/single-write abilities, or re-entrancy.

This is the forum thread of the https://blog.synopse.info/?post/2022/01 … e-Them-All blog article.

Offline

#2 2022-01-23 04:55:44

edwinsn
Member
Registered: 2010-07-02
Posts: 1,177

Re: Three Locks To Rule Them All

Another post with some linked posts that worth spending hours to read, thanks!


Delphi XE4 Pro on Windows 7 64bit.
Lazarus trunk built with fpcupdelux on Windows with cross-compile for Linux 64bit.

Offline

#3 2022-01-24 08:52:25

okoba
Member
Registered: 2019-09-29
Posts: 86

Re: Three Locks To Rule Them All

Very interesting article.
I like to know more about how you debug these problems, like the TAsyncServer case.

Offline

#4 2022-01-24 11:19:51

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 13,277
Website

Re: Three Locks To Rule Them All

First is to identify the issue, with regular tools - we like https://github.com/wg/wrk a lot.
See https://synopse.info/forum/viewtopic.ph … 517#p36517 for some history about the feedback and discussion with mpv (Pavel).

Then, the server code can generate a lot of verbose logs, to allow forensic search after the issue.
The real-time debugger is useless in such cases. You need to trigger the issue, then search about the cause in the logs.
Our TSynLog class which has very high performance, and our LogViewer with thread filtering support helps a lot.
I fought more than two days against an issue which was a one-liner resolution... sad

And of course, the first final step is to have aggressive multi-thread tests in our regression suite, with tuning parameters to simulate thousands of connections if needed.
Just for this testing/validating/benchmarking purpose, we have TNetworkProtocols, TTestMultiThreadProcess,  TTestBidirectionalRemoteConnection and TTestClientServerAccess tests in our mormot2tests project.
The tests were a lot enhanced during last weeks, to refine issues tracking.

Offline

#5 2022-01-24 18:35:05

wsherman
Member
Registered: 2022-01-24
Posts: 3

Re: Three Locks To Rule Them All

ab wrote:

in mORMot 2, we introduced several native locks in addition to those OS locks

Be aware that there are many warnings about using spin-locks in user space on Linux.
References:

https://www.realworldtech.com/forum/?th … tid=189723

Linus wrote:

Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

https://man7.org/linux/man-pages/man3/p … html#NOTES

User-space spin locks are not applicable as a general locking
solution.  They are, by definition, prone to priority inversion
and unbounded spin times.  A programmer using spin locks must be
exceptionally careful not only in the code, but also in terms of
system configuration, thread placement, and priority assignment.

https://news.ycombinator.com/item?id=21970050

You can implement spinlocks in userspace under specific circumstances. You will need to mark your threads as realtime threads[0] and have a fallback to futex if the fast path (spinning) doesn't work out. And even then you need to benchmark on multiple machines and with realistic workloads (not microbenchmarks) to see whether it's actually worth it for your application. It's complex...

https://mjtsai.com/blog/2020/01/06/bewa … ser-space/

How do you even measure whether a spinlock is better than a mutex, and what makes a good spinlock?

https://www.realworldtech.com/forum/?th … tid=189747

Before my blog post I had a very hard time finding guidance that actually backed up the claim "you shouldn't use spinlocks" with benchmarks. Part of my goal was to provide that.

Much of this discussion was started from this post (also see comments at the bottom):
https://probablydance.com/2019/12/30/me … really-is/

from:  https://github.com/synopse/mORMot2/blob … 3048-L3049

/// could be set to TRUE to force SleepHiRes(0) to call the sched_yield API
// - in practice, it has been reported as buggy under POSIX systems

Maybe "buggy" is not the right term.  The API behavior is not well suited for this type of usage (it is too naive since the scheduler doesn't know which thread needs to wake up to release the lock).

Offline

#6 2022-01-24 21:01:31

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 13,277
Website

Re: Three Locks To Rule Them All

You are right, I perfectly know those citations.
We don't want to replace futexes or critical sections. As TMonitor did - with more or less success.
This was the whole purpose of the article.

Our locks only spin with a small number of iterations, with a call to the OS after only 1000 pause opcode.
They are meant to be almost always without contention, and have a content of a few cycles to protect, as we show (some counters + list update, with no blocking OS call).

In practice, we don't see our locks going to call the OS on the most pressing tests, e.g. with thousands of concurrent threads consuming our web server.
If I run the multi-thread tests with high numbers, consuming 100% of CPU, in fact SwitchToThread is never called by our DoSpin() function, with background web browsing, and even changing the application priority.

Spining in user space is not bad in itself.
The libpthread does it. Everyone does it. All locks do initial spinning before switching to kernel land.
Even Linus presents our own algorithm in https://www.realworldtech.com/forum/?th … tid=189755
Spinning with no pause, and forever, is bad for sure. We don't do this.

Offline

#7 2022-01-24 21:33:08

wsherman
Member
Registered: 2022-01-24
Posts: 3

Re: Three Locks To Rule Them All

ab wrote:

Our locks only spin with a small number of iterations...They are meant to be almost always without contention, and have a content of a few cycles to protect

I am not saying you have a problem since I don't know anything about your implementation, and I am not trying to be argumentative.  I am just putting this out there, and it may or may not apply:

https://matklad.github.io//2020/01/02/s … rmful.html

...
Spinning Just For a Little Bit, What Can Go Wrong?
Because spin locks are so simple and fast, it seems to be a good idea to use them for short-lived critical sections. For example, if you only need to increment a couple of integers, should you really bother with complicated syscalls? In the worst case, the other thread will spin just for a couple of iterations…

Unfortunately, this logic is flawed! A thread can be preempted at any time, including during a short critical section. If it is preempted, that means that all other threads will need to spin until the original thread gets its share of CPU again. And, because a spinning thread looks like a good, busy thread to the OS, the other threads will spin until they exhaust their quants, preventing the unlucky thread from getting back on the processor!

Offline

#8 2022-01-24 21:40:59

wsherman
Member
Registered: 2022-01-24
Posts: 3

Re: Three Locks To Rule Them All

ab wrote:

You are right, I perfectly know those citations.
In practice, we don't see our locks going to call the OS...
Spining in user space is not bad in itself...libpthread does it.

Yes, one of the posts I referenced mentions that mutex is implemented in a similar way:

If Not a Spinlock, Then What?
First, if you only use a spin lock because "it’s faster for small critical sections", just replace it with a mutex from std or parking_lot. They already do a small amount of spinning iterations before calling into the kernel, so they are as fast as a spinlock in the best case, and infinitely faster in the worst case.

Last edited by wsherman (2022-01-24 21:46:25)

Offline

#9 2022-01-24 21:44:33

ab
Administrator
From: France
Registered: 2010-06-21
Posts: 13,277
Website

Re: Three Locks To Rule Them All

The problem with futex/mutex is that they need to be initialized and released.
Our locks are much simpler: they don't need any initialization (but being set to 0), and no finalization whatsoever.

About time slice interruption within a lock acquisition, it happens. If the spinning is done wrong, as in your reference article.
But since we use only 1000 x pause opcode, it will very quickly stop via nanosleep and release all the other waiting threads.
In practice - with realistic load benchmarks, and production servers, I have seen no unstability problems.

A good example of spinning use is our https://github.com/synopse/mORMot2/blob … cx64mm.pas
It does spining with pause, then call nanosleep/SwitchToThread. And we have counters about spinning sleep calls.
There is a lock per size of small memory block slot. It is just a simple boolean and some atomic asm operations. Using a futex/mutex there would have been an overkill for sure.
In practice, spinning reduces the number of contentions to a very low number. A few dozen at most - and at startup/warmup of the aggressive benchmark - never in the real life.
So if you know that there is small chance of contention, but you need to ensure pure thread safety in this rare but dangerous case, spinning is just good enough.

Our implementation exactly fits what Linus ended by writing, after arguing against what he saw:

Linus wrote:

Maybe that's not a problem for you, simply because you know you'll never have seriously overcommitted resources, and you'll want to keep tings simple. Or maybe you have some other "I know my load" that means that you want very particular fast-case locking sequences.

Thanks a lot for your reference material.

Offline

Board footer

Powered by FluxBB