The Internet Computer blockchain makes it possible for the world’s systems and services to be created using advanced canister smart contracts that run entirely on-chain. Canisters enable autonomous service composability that can drive extraordinary network effects, allowing practically any enterprise to be fundamentally reimagined. Since network Genesis in May, the DFINITY Canister SDK has been used to create thousands of canister smart contracts on the Internet Computer, many of which are complete Web 3.0 dapps.
The rapid growth of canisters and users on the Internet Computer blockchain presents interesting engineering challenges. A recent increase in memory-intensive canisters demonstrated that the memory system had a performance bottleneck under heavy load. This blog post describes performance optimizations from NNS proposal 20461, providing details about scaling and WebAssembly (Wasm) memory.
The optimizations were incrementally deployed to all Internet Computer subnets after the adoption of NNS proposal 20461 on Sept. 14. Figures 1–3 show the impact of the optimizations on one subnet under heavy load at the time of the upgrade. You can see two main improvements:
- Increased and more stable finalization: The choppy finalization rate recovered from 0.5 blocks per second to the expected level of 1 block per second.
- Improved message execution time: The average message execution time improved by ~3x and the maximum improved by ~10x.
There are two types of messages that a canister can receive and execute: queries and updates. Queries are read-only in the sense that all modifications that a query performs in the Wasm memory are discarded after execution. In contrast, a successful execution of an update message automatically persists all memory changes and makes them available to subsequent update messages and queries. This concept is known as orthogonal persistence.
Any implementation of orthogonal persistence has to solve two problems:
- How to map the persisted memory into the Wasm memory.
- How to keep track of all modifications in the Wasm memory so that they can be persisted later on.
The current implementation uses page protection to solve both problems. When a message starts executing, we divide the entire address range of the Wasm memory into 4KiB chunks called pages. Initially, all pages are marked as inaccessible using the page protection flags of the operating system. This means that the first memory access triggers a page fault, pauses the execution, and invokes our signal handler. The signal handler then fetches the corresponding page from persisted memory and marks the page as read-only. Subsequent read accesses to that page will succeed without any help from the signal handler. The first write access will trigger another page fault, however, and allow the signal handler to remember the page as modified and mark the page as readable and writable. All subsequent accesses to that page (both read and write) will succeed without invoking the signal handler.
Invoking a signal handler and changing page protection flags are expensive operations. Messages that read or write large chunks of memory cause a storm of such operations, degrading the performance of the whole system. This is the performance bottleneck observed under heavy load. Note that the signal handler was written before the launch of the Internet Computer, and its main priority was correctness and not performance.
A canister executes update messages sequentially, one by one. Queries, in contrast, can run concurrently to each other and to update messages. The support for concurrent execution makes the memory implementation much more challenging. Imagine that a canister is executing an update message at block height H. At the same time, there could still be a long-running query that started earlier at block height H-K. This means the same canister can have multiple versions of its memory active at the same time.
A naive solution to this problem would be to copy the entire memory after each update message. That would be slow and would use a lot of storage. Thus, the current implementation takes a different route. It keeps the modified memory pages in a persistent tree data-structure called PageDelta that is based on Fast Mergeable Integer Maps. At a regular interval (i.e., every N rounds), there is a checkpoint event that commits the modified pages into the checkpoint file after cloning the file to preserve its previous version. Figure 4 shows how the Wasm memory is constructed from PageDelta and the checkpoint file.
Optimization 1: memory-mapping the checkpoint file
The first optimization is to memory map the checkpoint file pages. This reduces the memory usage by sharing the pages between multiple messages running concurrently. This optimization also improves performance by avoiding page copying on read accesses. The number of signal handler invocations remains the same as before, so the issue of signal storms is still open after this optimization.
Optimization 2: page tracking in queries
All memory pages that a query modifies are discarded after execution. This means that the signal handler doesn’t have to keep track of modified pages for queries. But the old implementation of the signal handler did not distinguish between update messages and queries. We introduced a fast path for queries that marks pages as readable and writable on the first access. This low-hanging fruit optimization made queries 1.5x-2x faster on average.
Optimization 3: amortized prefetching of pages
The idea behind the most impactful optimization is simple: If we want to reduce the number of page faults, then we need to do more per signal handler invocation. Instead of fetching a single page at a time, the new signal handler tries to speculatively prefetch more pages. The right balance is required here because prefetching too many pages may degrade performance of small messages that access only a few pages. The optimization computes the largest contiguous range of accessed pages immediately preceding the current page. It uses the size of the range as a hint for prefetching more pages. This way the cost of prefetching is amortized by previously accessed pages. As a result, the optimization reduces the number of page faults in memory intensive messages by an order of magnitude.
The original signal handler was written before the launch of the Internet Computer focusing on correctness over performance. It was not surprising that this area would need to be optimized for performance. The rapid growth of the Internet Computer required optimizations much earlier than expected, however. These optimizations removed one bottleneck, but more bottlenecks may be discovered as the Internet Computer continues to grow. If you are interested in working on performance — DFINITY is hiring!
Thanks to Akhi Singhania, Alin Sinpalean, Dimitris Sarlis, Dominic Williams, Johan Granström, Kiran Joshi, Roman Kashitsyn, Saša Tomić, Stefan Dietiker, Stefan Kaestle for discussing ideas and reviewing code changes. Also thanks to Diego Prats for reviewing this blog post.