T'was the day before genesis, when all was prepared, geth was in sync, my beacon node paired. Firewalls configured, VLANs galore, hours of preparation meant nothing ignored.
Then all at once everything went awry, the SSD in my system decided to die. My configs were gone, chain data was history, nothing to do but trust in next day delivery.
I found myself designing backups and redundancies. Complicated systems consumed my fantasies. Thinking further I came to realise: worrying about these kinds of failures was quite unwise.
事件
信标链具有激励验证器行为的几种机制,所有这些机制都取决于网络的当前状态,因此在更大的背景下考虑其他验证者在决定什么时可能会如何失败,以及什么是决定性的,至关重要不值得确保节点的方法。
作为主动验证器,您的余额要么增加或减小,因此永远不会侧向*。因此,最大化利润的一种非常合理的方法是最大程度地减少您的缺点。信标链可以通过三种方式降低平衡:
- 处罚 当您的验证者错过其职责之一时发出(例如,因为它们离线)
- 无活动泄漏 将网络未能确定的验证者分发给错过职责的验证者(即,当您的验证器离线与其他验证者离线高度相关时)
- 斜线 给予验证者产生矛盾的块或证明的验证者,因此可以用于攻击
*平均而言,验证者的余额可能保持不变,但是对于任何给定的职责,他们都会受到奖励或惩罚。
相关性
就信标链的整体健康而言,单个验证器离线或执行可砍的行为的效果很小。因此,它不会受到严厉的惩罚。相反,如果许多验证者离线,离线验证器的平衡可以更快地减少。
同样,如果许多验证者同时从信标连锁店的角度执行可削弱的动作,这与攻击是无法区分的。因此,将其视为这样,并燃烧了100%的有害验证者的股份。
由于这些“反相关”激励措施,验证者应担心 更多的 关于可能同时影响他人而不是孤立的个人问题的失败。
原因及其概率。
因此,让我们考虑一些故障案例,并通过镜头对它们进行检查,其中有多少其他案件会同时受到影响,以及您的验证者会受到惩罚的严重影响。
我不同意@econoar 这里 这些就是 最坏的情况 问题。这些是更温和的问题。家庭UPS和双WAN地址故障与其他用户无关,因此应该远离您的疑虑列表。
🌍互联网/电力故障
如果您在家验证,那么将来您很可能会遇到这些失败之一。住宅互联网和电源连接无法保证正常运行时间。但是,当互联网确实会下降,或者您的力量熄灭时,停电通常仅限于您所在的地区,即使只有几个小时。
除非你有 非常 参与的互联网/力量,可能不值得为跌倒的连接付费。您将获得几个小时的处罚,但是由于网络的其余部分正常运行,您的罚款大致等于同一时期的奖励。换句话说,一个 k 长达一个小时的故障使您的验证者的平衡恢复到大致的位置 k 失败前几个小时 k 额外的小时,您的验证者的余额将恢复到其预付量。
[Validator #12661 regaining ETH as quickly as it was lost – Beaconcha.in
🛠 Hardware failure
Like internet failure, hardware failure strikes randomly, and when it does, your node might be down for a few days. It is valuable to consider the expected rewards over the lifetime of the validator versus the cost of redundant hardware. Is the expected value of the failure (the offline penalties times the chance of it happening) greater than the cost of the redundant hardware?
Personally, the chance of failure is low enough and the cost of fully redundant hardware high enough, that it almost certainly isn’t worth it. But then again, I am not a whale 🐳 ; as with any failure scenario, you need to evaluate how this applies to your particular situation.
☁️ Cloud services failure
Maybe, to avoid the risks of hardware or internet failure altogether, you decide to go with a cloud provider. With a cloud provider, you have introduced the risk of correlated failures. The question that matters is, how many other validators are using the same cloud provider as you?
A week before genesis, Amazon AWS had a prolonged outage which affected a large portion of the web. If something similar were to happen now, enough validators would go offline at the same time that the inactivity penalties would kick in.
Even worse, if a cloud provider were to duplicate the VM running your node and accidentally leave the old and the new node running at the same time, you could be slashed (the penalties incurred would be especially bad if this accidental duplication affected many other nodes too).
If you are insistent on relying on a cloud provider, consider switching to a smaller provider. It may end up saving you a lot of ETH.
🥩 Staking Services
There are several staking services on mainnet today with varying degrees of decentralisation, but they all contain an increased risk of correlated failures if you trust them with your ETH. These services are necessary components of the eth2 ecosystem, especially for those with less than 32 ETH or without the technical know-how to stake, but they are architected by humans and therefore imperfect.
If staking pools eventually grow to be as large as eth1 mining pools, then it is conceivable that a bug could cause mass slashings or inactivity penalties for their members.
🔗 Infura Failure
Last month Infura went down for 6 hours causing outages across the Ethereum ecosystem; it is easy to see how this is likely to result in correlated failures for eth2 validators.
In addition, 3rd party eth1 API providers necessarily rate-limit calls to their service: In the past this has caused validators to be unable to produce valid blocks (on the Medalla testnet).
The best solution is to run your own eth1 node: you won’t encounter rate-limiting, it will reduce the likelihood of your failures being correlated, and it will improve the decentralisation of the network as a whole.
Eth2 clients have also started adding the possibility of specifying multiple eth1 nodes. This makes it easy to switch to a backup endpoint, in the event your primary endpoint fails (Lighthouse: –eth1-endpoints, Prysm: PR#8062, Nimbus & Teku will likely add support somewhere in the future).
I highly recommend adding backup API options as cheap/free insurance (EthereumNodes.com shows the free and paid API endpoints and their current status). This is useful whether you are running your own eth1 node or not.
🦏 Failure of a particular eth2 client
Despite all the code review, audits, and rockstar work, all of the eth2 clients have bugs hiding somewhere. Most of them are minor and will be caught before they present a major problem in production, but there is always the chance that the client you choose will go offline or cause you to be slashed. If this were to happen, you would not want to be running a client with > 1/3 of the nodes on the network.
You must strike a tradeoff between what you deem to be the best client vs how popular that client is. Consider reading through the documentation of another client so that if something happens to your node, you know what to expect in terms of installing and configuring a different client.
If you have lots of ETH at stake, it is probably worth running multiple clients each with some of your ETH to avoid putting all your eggs in one basket. Otherwise, Vouch is an interesting offering for multi-node staking infrastructure, and Secret Shared Validators are seeing rapid development.
🦢 Black swans
There are of course many unlikely, unpredictable, yet dangerous scenarios that will always present a risk. Scenarios that lie outside the obvious decisions about your staking set-up. Examples such as Spectre and Meltdown at the hardware level, or kernel bugs such as BleedingTooth hint at some of the hazards that exist across the entire hardware stack. By definition, it is not possible to entirely predict and avoid these problems, instead you generally must react after the fact.
What to worry about
Ultimately this comes down to calculating the expected value E(X) of a given failure: how likely an event is to happen, and what the penalties would be if it did. It is vital to consider these failures in the context of the rest of the eth2 network since the correlation greatly affects the penalties at hand. Comparing the expected cost of a failure to the cost of mitigating it will give you the rational answer as to whether it is worth getting in front of.
No one knows all the ways a node can fail, nor how likely each failure is, but by making individual estimates of the chances of each failure type and mitigating the biggest risks, the “wisdom of the crowd” will prevail and on average the network as a whole will make a good estimate. Furthermore, because of the different risks each validator faces, and the differing estimates of those risks, the failures you did not account for will be caught by others and therefore the degree of correlation will be reduced. Yay decentralisation!
📕 DON’T PANIC
Finally, if something does happen to your node, don’t panic! Even during inactivity leaks, penalties are small on short time scales. Take a few moments to think through what happened and why. Then make a plan of action to fix the problem. Then take a deep breath before you proceed. An extra 5 minutes of penalties is preferable to being slashed because you did something ill-advised in a rush.
Most of all: 🚨 Do not run 2 nodes with the same validator keys! 🚨
Thanks Danny Ryan, Joseph Schweitzer, and Sacha Yves Saint-Leger for review
[Slashings because validators ran >1 node – Beaconcha.in]