我对 Redlock 算法的疑问

在Distributed locks with Redis – Redis中，首先它描述了如何正确地使用单实例实现分布式锁，然后它介绍了分布式版本的算法。但是对于分布式版本，我有许多疑问。

首先，那篇文章说

In the distributed version of the algorithm we assume we have N Redis masters. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We take for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples we set N=5, which is a reasonable value, so we need to run 5 Redis masters on different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way.

In order to acquire the lock, the client performs the following operations:

It gets the current time in milliseconds.

It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances. During step 2, when setting the lock in each instance, the client uses a timeout which is small compared to the total lock auto-release time in order to acquire it. For example if the auto-release time is 10 seconds, the timeout could be in the ~ 5-50 milliseconds range. This prevents the client from remaining blocked for a long time trying to talk with a Redis node which is down: if an instance is not available, we should try to talk with the next instance ASAP.

The client computes how much time elapsed in order to acquire the lock, by subtracting from the current time the timestamp obtained in step 1. If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.

If the lock was acquired, its validity time is considered to be the initial validity time minus the time elapsed, as computed in step 3.

If the client failed to acquire the lock for some reason (either it was not able to lock N/2+1 instances or the validity time is negative), it will try to unlock all the instances (even the instances it believed it was not able to lock).

我的疑问

为什么要顺序地尝试获取所有实例里的锁呢？同时尝试获取会存在什么问题呢？
Redlock 算法所说的 auto-release time 是类似于Distributed locks with Redis – Redis - Correct implementation with a single instance中所说的SET resource_name my_random_value NX PX {ttl}中的 ttl 吗？也就是我下面所说的 TTL，是吗？
在第二步时，会尝试在所有实例中获取锁，它所做的行为跟单实例所做的行为是一样的，也就是SET resource_name my_random_value NX PX {ttl}，那么 ttl 是怎么计算出来的呢？我认为不同实例的 ttl 是不同的，因为尝试获取在不同的实例里的锁的时间是不一样的。因为要确保“如果所有实例的同一个 key 都在同一时间被删除”，所以我觉得每个实例里所设置的 ttl 是“TTL - (在某个实例尝试获取锁的时间 - 第一步获取到的时间)”，对吗？（这里的 TTL 表示的是逻辑上的 TTL，并不是真实设置在某个实例里的 ttl，也就是所有实例里的同一个 key 都会在“第一步获取到的时间 + TTL”这个时间被删除）

jybox

2020-07-28 14:31:03 +08:00

>为什么要顺序地尝试获取所有实例里的锁呢？同时尝试获取会存在什么问题呢？

同时获取可能会死锁（每个客户端成功在一半的实例加锁），所有客户端都顺序访问就不会出现死锁了。

>键是在不同时间被设置的，所以也会在不同的时间失效。这样子还能够保证 mutual exclusion 属性吗？

按我的理解这个 TTL 只是为了避免出现加锁后（因进程崩溃）没有解锁的情况，在一定时间后自动解锁，这本来就是一个特殊情况下的措施，其实在发生达到 TTL 时，锁的有效性就已经得不到保证了（你不知道是进程真的崩了还是暂时卡住了），所以这个 TTL 差个几毫秒并不是那么重要。

正常的应用中不应该出现「期望在获取到锁之后的 ttl 时间内都能够唯一拥有锁」的情况，应该（比如在时间用掉了一半的时候）不断地续期，在结束后主动地释放锁。

JasonLaw

2020-07-29 09:38:56 +08:00

@AmmeLid #7

我觉得并不是，维基百科说“A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing.”。

但是就我们这个例子，比如一个实例下线了，只剩下 4 个，client1 获取到两个，client2 获取到另外两个，它们都无法成功获取到锁，之后释放，但是 https://redis.io/topics/distlock#retry-on-failure 中说了“When a client is unable to acquire the lock, it should try again after a random delay in order to try to desynchronize multiple clients trying to acquire the lock for the same resource at the same time (this may result in a split brain condition where nobody wins).”，因为随机的延迟，最后不会出现“none progressing”这种情况。