Q4 2022 Stability and Performance Improvements
Stability
Some solutions already implemented
Sync stalling
Rules about what is considered our best peer post-merge is not implemented
Trying the same peer over and over again (shuffling the peers)
Invalid block errors
Consensus layer is on the fork with bad data
Besu has a storage exception to report the invalid block to the consensus layer, which sets us off to a wrong fork (potential fix by Justin; GH issue)
Potentially another Besu internal error could cause invalid blocks of which we don’t know about yet
Potential solution identified
Worldstate root mismatch
Bonsai and snapshots
Solution: Confirmed working for many cases, needs more testing and handling of any corner cases
Issues around peering
Need restart to find new peers sometimes during sync
Potentially because of the lack of evaluating of peers during sync and post sync
Solution: ??
Losing many peers
Because threads were blocked, we lost many peers
Vertx for example uses different approaches to threading
Solution: ??
Issues on user experience
Difficulty in reliably communicating with new/inexperienced users
Docs & lack of education
More complicated set up post-merge
Solution: Write up ‘What to expect from staking at home’ and FAQ for Besu Docs
Out of Memory errors
Documentation on how kind of memory config is needed
No mechanism to detect memory leaks
Potential mitigation: make deploying Besu easier by providing default configs
Solution: ??
Users don’t know how much syncing is done
Insufficient logging & bad log UX
Solution/plan: ??
Users hesitant to update or restart Besu with the latest version due to the impression of being unstable
Issues with RPC calls
Incompatibilities with RPC spec / not-same-as-geth causing crashes
Does not meet Chainlink and other orgs needs for RPC calls (accuracy, speed)
Do we implement all of the RPC interfaces that Geth does? I.e. Logger, Trace (all the methods)
Solution: ??
Some specific RPC calls (trace/debug) take a long time or OOM
Lack of testing of large RPC calls
Might need to understand the root cause better
Solution: ?
Performance
Staking Performance
Poor execution performance leading to missed attestations
More investigation ongoing, and some user stories being created
Poor block production
As we tweak the tx pool to build the best block for the user with valuable/DOS-resistant transactions, we need to ensure no performance hit to the client - uses a lot of CPU because we are repeating the block production until CL asks for it
Late blocks could also cause import challenges, would cause restart of the building of the block
Snapshots could help with concurrency in this case
4844 could alter this process and requires good performance as well
I/O and Disk Performance
Besu has problems with slow IO/disks → Besu is generating a lot of IO
We are not using the flat DB during block processing, so have to gather a lot of data from disk
Need caching in more areas - R/W caching
Doing less work, persisting less to the disk, persisting trie logs but not the worldstate (Amez / Karim)
The first hotspot in Besu is reading data from RocksDB using the RocksDB.get method. This is mainly caused by the fact that we have to get most of the WorldState nodes from the Patricia Merkle Trie
Need to identify more areas where IO contention is commonplace
Trace Performance
Poor performance of tracing of blocks / transactions
Not sure why we are slow
Many times Besu would crash when tracing a full block
OOM errors
Short timeout can cause issues
Is db tuned for tracing?
Will need good performance for any rollups use-cases
Solution?: Instead of replaying each time the traces for each user request why not saving this trace result in a separate database or a separate module instead of saving the block and the worldstate for each block
Solution?: Separate into a different microservice
Solution?: Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to the chain do not need to slow down the main flow.
Sync Performance
Poor syncing performance (still the case with 22.10.0?)
Do we need to verify the proof of work blocks on Mainnet?
Useless conversion from byte to RLP and back during sync
Sometimes we are stuck for some time during a snap sync (need investigation)
Full sync / forest performance (also snapsync for the block downloading part)
Persisted worldstate changes may be able to help with full sync on Bonsai
Forest we need to determine what the areas are for performance improvements - some of the recent bonsai improvements can be tweaked to suit forest use-case (unknown though)
EVM Performance - Pending Amez Availability and IMAPP Testing
We need analysis that will tell us Gas cost of each operation corresponds to the algorithmic complexity of Besu implementation. Bonsai might have consequences on algorithmic complexity for some operations.
IMAPP testing - we need an overall analysis (Matt has connected with this team for a profile)
Do not have a profile of Besu’s EVM performance (work with Danno?)
SLOAD, SSTORE - slowest vs gas cost?
EVM performance improvements often appear without context and the broader team is unsure of how the optimizations are created. Is there a standard playbook of optimizations that we are running through, or are there EVM specific performance observations that we are reacting to?
Do we know how we can solve these problems?
Having automatic test on nightly/ci to detect regression asap
Having more modularity
Having a tracing solution on the JAVA level (improve observability)
Using torrent for downloading the block during the initial sync (archive)?
Separate the query part of Besu into a completely separate process. Queries that are asking Besu questions related to chain do not need to slow down the main flow.
Process Improvements (Q1?)
Performance Testing
Lack of performance testing, especially on RPC methods
How do we get alerts and be aware that there is an actual performance regression?
Automated performance testing each release, nothing for the moment
Hive tests return the time the calls are taking, but small load
Can we separate some RPC methods into separate microservices?
Slow Release / Testing Process (CPU, test bounding, process)
Manual release process with a lot of wasted time waiting for builds that could be avoided
Waiting for multiple full builds to complete because you are merging a PR or changing version number, doesn’t need a full build
With code changes, then a full build should be required
Make it easier/faster to run all the tests locally, or avoid the need to create a draft PR for running tests remotely
Support for many features causes tests to be slow (ETC tests, Quorum tests, etc.) when they are not necessarily needed for certain modifications
Issues with the process
Late discovered regressions
Need a more comprehensive testing strategy across contributors
Solution: ??
Establish better process on how to respond to the problem we discover
Good case study: sepolia issue over the weekend on Oct 29/30