July 6, 2021
SKALE differentiators, consensus, block proposals and survivability: an interview Stan Kladko, CTO/Co-Founder, SKALE Labs
I had the opportunity to sit down with Stan Kladko, CTO/Co-Founder, SKALE Labs, the other day and get his thoughts on the design of the SKALE Network and why it’s different from other Ethereum scaling and Layer 2 solutions. We talked about production systems vs research projects, mathematical algorithms, binary consensus, block proposals, modularity, self-healing networks, and more.
In the course of our talk, he unpacked a number of details on the inner workings of blockchains and explained how consensus algorithms can be made faster and more energy and resource efficient without sacrificing security.
Can you explain some of the design decisions around SKALE Chains?
Stan: We’ve designed the SKALE Network to be Ethereum compatible. Everything you can run on Ethereum, you can run on SKALE and do so much faster. In addition, we’ve designed and built SKALE Chains to be mathematically provable so that every single point inside SKALE Chains has an algorithm which is well described and which is mathematically provable under certain assumptions. These assumptions are homogeneous across all features.
To our knowledge, these chains are the only mathematically provable Proof of Stake blockchain currently on the market. There are a lot of academic papers from universities but as far as production-quality systems, we think that our system is the first. This rigor is really important because you want to store your funds on something that has a mathematical foundation.
How are SKALE Chains different from other blockchains?
Stan: SKALE is a next generation blockchain. Previous generations of blockchains – Bitcoin, Ethereum, and pretty much any other blockchain – have a fixed block time. This means blocks are released in a certain time sequence, usually like every 10 or 15 seconds. With SKALE, blocks can be released on an as-needed basis. This saves CPU time because if there are no transactions there are no blocks. If someone comes along and issues a transaction then a new block will appear seconds after.
Note that in this most current release, we are still creating empty blocks every 10 seconds because people want to see some kind of beacon or some kind of proof that the chain is working. Theoretically, you could run a SKALE chain and not produce any blocks or produce one only once a month. This as-needed approach allows us to potentially have many chains on the same node and save money and resources for users.
Usually people run a single blockchain and use full nodes. Having many chains per node is unique. The advances in our consensus algorithms let us split nodes across many many blockchains which is really convenient for our validators. Also, our nodes are not running on super expensive hardware. We're able to support inexpensive AWS and Digital Ocean machines. A typical SKALE node will run on a mid-size virtual machine, which might cost a few hundred dollars per month.
Can you describe how SKALE’s version of consensus is different and why this is important?
Stan: In the blockchain world, there has always been a split between production-oriented systems and those in academic research. On the practical level, you have Bitcoin, Ethereum, and Proof of Stake systems like Cosmos and Tendermint. In the academic world, you have ABBA and a number of other mathematically provable algorithms.
Practical systems like Tenderment, do not use ABBA consensus or other binary consensus algorithms. Instead they use plausible proofs built by engineers. Tendermint is the best so far, with Polkadot being second best, but it’s less secure and less exact. And then you go to less advanced algorithms used by EOS which have even less mathematical foundation. If you read their papers, they are not at a research level in the sense that there are no mathematical proofs in them.
It's engineers saying, “This thing is running a certain way but let's consider where things can go wrong.” They then provide some rationale on how they can defend against the bad things that can happen. They will fix one thing and then another issue comes along and they’ll fix that. But to date, there hasn’t been an injection of binary consensus or some other research algorithm into a production system.
If you want more than just an ad hoc collection of engineering fixes, then you need an exact proof that there will be no security vulnerability. What SKALE is doing here is bridging this gap between mathematical research and the real world. We strongly believe we are able to argue that the entire thing is provable from the beginning to the end. That’s our claim. We’re not saying the other production algorithms are bad, as they may work for many situations. It’s just that we are interested in creating something that is mathematically provable.
How does SKALE get to two second block times or less?
Stan: Most Proof of Stake algorithms work in a simple manner. You have a blockchain, the blockchain needs to add blocks, blocks appear, and then they get glued to the previous blocks. If you look at most Proof of Stake algorithms such as ETH2, Tendermint, EOS, and Polkadot, they pretty much all work the same way in that they use single block proposers.
At any point in time, there is a designated node that is the only one that can propose a block. If their block is deemed by the others to be OK, it gets added to the chain and then another block proposer is selected. This selection could be round-robin, by random selection, or by some other mechanism.
This sequence is time-sliced in that there is a fixed period of time within which the proposer can propose a block. There is also a timeout period whereby if the designated proposer does not propose a block, the option goes to the next proposer. For example, Eth2 has a 10 to15 second timeout period which means that each designated proposer has that much time with which to propose a block. It also means that others cannot propose blocks until that time period is up. It also means that a proposer may have to propose an empty block – in the event of no transactions – so as not to be deemed non-responsive or, worse, malicious.
SKALE’s consensus algorithm is different in that it lets every node building the chain propose a block candidate. Which means all 16 nodes in a chain can propose a candidate at the same time. Out of these 16 candidates, one is selected. Based on the mathematics behind the algorithm, a high percentage of the candidates are presumed to be valid candidates which means there should never be a timeout. Which means no fixed time period is needed. It also makes for fast chain creation in that after a block is added, the nodes immediately submit new candidates.
If every node submits a proposal at the same time, does that create unnecessary network traffic?
Stan: Not in this case. Normally, having 16 block proposals would be expensive network wise because every node needs to send their proposal to every other node. It turns out this inefficiency can actually be solved relatively easily. It was solved in Honeybadger using a very advanced mathematical algorithm. We solve it in a more practical and simpler way.
When you have 16 nodes and they’re doing block proposals, every node has a pending queue of transactions and gossips transactions to each other. These queues and this gossiping happens in Ethereum, it happens in Bitcoin, and has existed essentially since the beginning of blockchain.
Every node has a pending queue but the queues are not exactly the same. There are network delays, propagation delays, slow network traffic, etc. But even with this, pending queues are approximately equal. If one node has a transaction in their queue, another node might get it a maybe hundred milliseconds later but for the most part 99% of the queues are the same and the 1% difference comes from network delays.
If I issue a proposal, I don’t need to send you the actual transactions because you also have these transactions. The only thing I need to send are hashes or fingerprints of the transactions, so when I send a proposal, it’s a very lightweight proposal. Instead of the actual transactions, they’re basically just single numbers.
When you get the fingerprints of hashes, you look into your pending queue, find the transactions, and then reconstruct the proposals. This is a huge compression because you're not sending the transactions, you're sending the fingerprints. There may be several transactions which have not made it into your pending queue but then you’ll just tell me, “I got 99% of these transactions but I don't know these three transactions.” In which case, I would then send you just those three transactions. Because of this approach, the block proposal is very lightweight. It’s not a big deal for each node to send their proposal to every other node.
There were proposals to use the same mechanisms in Bitcoin, Eth1, and Eth2 but so far they haven’t materialized. Other Proof of Stake systems are thinking about this, but to my knowledge, we are the first ones to actually realize this idea. Because of this, our proposals are tiny, literally a hundred times smaller than proposals in say Bitcoin or Eth2.
What's the difference between the fast block times in SKALE vs other chains that also make similar claims?
Stan: That’s an important question. A Proof of Stake blockchain consists of a set of 10 to 20 nodes which produces a chain – next block after next block. The term “finality” refers to when you can say that the block is going to be this one exactly. In some blockchains, finality takes a long time.
It is well known in mathematical literature that binary consensus cannot be done mathematically faster than M messages. If you want to have a binary consensus of 100 nodes, the best mathematically provable algorithm takes 100 messages per node. Each node basically sends at least one message to every other node. If you want to have a binary consensus of 1000 nodes, then you need to send 1000 messages.
It is impossible to do this binary consensus in fewer messages.The problem that exists in the Proof of Stake world is that every next project needs to claim better performance than other projects. As a result, they claim to create systems which literally go beyond the speed of light.
Everything that claims to be faster than Tendermint means they’re introducing some beyond-the-speed-of-light type of an argument. When you read the white paper of this super great approach, you always see that it has some pseudo-mathematical language that tries to hide things.
Do these algorithms work? The answer is yes. But the question is why do these algorithms work? Many of these algorithms suppose internally that all of the validators, all the nodes, are good. If you can make an assumption that all of the nodes are honorable then you can use much faster algorithms. If a security assumption breaks in a blockchain, though, people can lose money.
At SKALE, we clearly state that we run as fast as we can under the assumption that at some point one third of validators can become bad and it will still work in a secure manner. This is the trade-off between performance and security. There may be cases where other networks say, all of their validators are good and then they can run a faster system, but that’s not the approach we take.
Let’s talk about the modularity of the SKALE Network and what's been done in SKALE to allow for more extensible chain services.
Stan: At SKALE, we believe that we only need to do the minimum core possible. If we can exclude something thereby saving time, we would do so. But our chains are not such that if you throw away a feature, then users wouldn’t be able to run their systems. We do believe, though, that not everything should be done by us. If someone wants to develop something on SKALE and expand SKALE we do not want to stand in their way.
The perfect scenario for SKALE is to keep our system extremely modular, almost similar to how Linux is done. If you look at Linux, it is perfectly modular in that the only thing the Linux team cares about is the Linux kernel. The kernel is evolving super slowly – it's been in development for 30 years. All the other things are done by other companies and other teams. SKALE can be like the Linux kernel where we just perfect the core and provide ways for people to expose new things via things like packages.
How will packages and customization work – will it be on a chain by chain basis?
Stan: There are two different scenarios. One scenario is the case where a feature is fully implementable in Solidity. With an Eth package manager – which we think will be offered up to the Ethereum community shortly – they would install it into a SKALE chain similar to how Linux installers work. This would be an easy path for us to take.
If you want to modify your chains deeper, then you could use pre-compiled contracts. Say you want to run something really fast on your chain, then you would want to write it in Python or C++ because Solidity can be rather slow. For this, we would use a regular package manager that allows it to get installed as a pre-compile into contracts.
One of the pain points with Ethereum Mainnet is that it is a subjective decision as to which algorithms to include. I believe they have eight or ten precompiled packages for various cryptography approaches, but in cryptography there is an explosion of algorithms. People are doing all kinds of things. The problem with the Ethereum mainnet is that you have to somehow filter which ones to include. In our case, it’s your chain. If you want the particular algorithm, you want to run the best solution for you, you will be able to put that into a package and run it.
Can you explain how network self-healing and survivability work?
Stan: SKALE needs to deal with the situation if a chain crashes. If you look at older blockchains like Bitcoin, it has very simple code and runs relatively slowly, so it doesn’t crash. That's because a simple system can be made secure and bug free. With Ethereum, there may have been a time or two where there was a denial of service attack and it did crash the chain.
With Proof of Stake systems, they need to run fast and so more complicated algorithms need to be used. With complex systems, you cannot avoid having to plan for crashes and catastrophes. You need to address scenarios for when chains die. If you don't, then you are in denial.
It could be bugs in the software, it could be AWS crashes, or events within the network. You have to have a solution to address this. In SKALE, if less than five nodes die out of the 16 nodes, the Proof of Stake algorithm doesn't even notice and so the chain will continue operating as usual.
If a node does die, there are two mechanisms to sync it back to the chain – one short-term and one long-term. If the node has been down for a couple of hours, there is a cache of blocks which is kept on other nodes. The restarted node can just download those blocks and catch up. If a node has been down for a longer period of time, it can get a snapshot from one of the other nodes, catch up from there, and then join back with the chain.
If there’s a terrible problem and everything dies, then the system will halt. You will either again bring nodes back online and you'll start the chain or in the very worst case, you’ll restart from a snapshot. Every node makes a snapshot of itself every 24 hours. (We'll make it faster in the future, but in the first release it is 24 hours). Each node stores their own copy but anyone can also download copies. This is so if a node or nodes disappear, you have it stored elsewhere and so the developer or community can take care of that.
What are you most proud of over the course of your work on SKALE?
Stan: In parallel to releasing software, our team was really growing in terms of understanding how things work. Blockchain is a very new subject. When we started, there was no magic button. People didn't know then as much as we know now. We studied papers and went on a long learning curve. If we were to start from scratch now, we would develop it much faster. But back then, there was no way to hire someone who had the experience from the future.
Blockchains are at the age between something that you can do and that you can’t do – meaning that there is much more complexity involved than in prior systems development. Blockchains exist on multiple computers with some of them presumed to be malicious with the potential to do something bad. So that is a next level of problem. Blockchain systems are on the edge, where when you start, you don’t know if something is possible. It’s great that even at this point, we have it running as well as it is.
It has also been great to see people become experts in a complex field. Things which previously were hard for them, suddenly become easier. Everyone has become an expert in their particular kind of sub system. The people on the team are very friendly and have become friends after many years. I think it is a very bright sign for SKALE because we think we'll be able to develop things quickly in the future. Everyone has high expectations for the future. That's an important accomplishment.