Testing Taskforce Brainstorming
Context
Its dawned on us that Besu now represents upward of ~$5B on Mainnet in stake !! Great job for everyone https://supermajority.info/
This is a big responsibility, not just on Mainnet, but other chains that use Besu too. We are representing more and more value and reputation for a ton of organizations. The Consensys team is thinking on how we may be more careful in the release process (specifically regressions). As we approach the Cancun fork, is it OK to delay less-tested PRs on Main to avoid releasing any issues (especially in 24.1.2)? The problem is, users may not be able to safely downgrade after the fork without a manual intervention / new release. Is anything going into this next release that may be less tested? or a default that should be hidden behind a feature flag? Additionally - I wanted to crowd source ideation about regressions. We have had a handful of interventions on releases recently. Instead of just entirely slowing down releases, any ideas from maintainers? The Consensys team is scheduling a sprint to look at testing gaps and improvements that can be made in the codebase. If anyone would like to help in this, let me know! Thanks all!
Ideation
integrate Hive testing into CI/CD pipeline.
We should measure code coverage and make it a condition for merging a PR
benchmark metrics - sync time, block import time etc and test PRs for regression
some of these metrics (sync time, peering) may require long-running tests so may not be practical to run on individual PRs - could be on a nightly cadence
Ensure test coverage for key features eg bonsai
tests that run on mainnet config aren't testing upcoming hard fork features - to avoid finding bugs at the last minute, change this to holesky/sepolia (or make it configurable?)
burn-in on existing (synced) nodes vs syncing from scratch
backward sync is triggered when you upgrade
it would be awesome to make the (Consensys internal) burn-in process less manual.
don't want to make the release process more painful
What will be the outcomes of this epic?
Phase 1 (Analysis)
Aiming to complete this by end of March.
List of recommendations / Prioritized list of gaps to be addressed
An Epic/workstream is created with actionable item based on the findings from the brainstorm
Would be nice to have an idea of the relative amount of work involved so we can compare cost to benefit.
Phase 2 (Solutions)
Start solving the problems identified. May need different resources at this point.
What is the problem we’re trying to solve?
Why has this come up now?
More money is at stake
Besu profile is rising
Besu gaining market share
Greater reputational risk
What would be catastrophic?
Consensus issue
Sync issue
Peering issue (eg Besu nodes lose 100% of peers and drop off the network)
Chain fork
Node database corruption or significant downtime (quite impactful at scale)
Missing or empty block production
Anything with customer funds or keys (includes Web3signer)
Need to verify that our bug prioritization matrix aligns with this.
Bad (but not catastrophic)
Anything here could escalate to catastrophic if the scale was big enough.
Missed attestations
Regression in block processing time
Regression in sync times or performance
Bad UX (nodes are a pain to manage)
Substantial drop in performance
Block export/import broken
Goals
No P1 bugs released
No P2 bugs released
One P2 bug per quarter is tolerable (but our goal should be zero)
We have P2 bugs identified, in the backlog - how do we make sure we don’t release them
Ditto bugs that haven’t been prioritized (unknown - maybe they are P2 or even P1)
Robust processes
we know that PRs have sufficient automated tests and/or require evidence of manual tests where this makes sense
Don't want to make the release process more painful
But could make the release process more robust/comprehensive if doing so doesn’t add hassle/pain to the process?
Ensure we’re spending effort where it makes sense
Opportunities in our (current) process
Bug prioritization
Triage matrix - review
PRs
Feature toggles - new code paths are disabled by default?
Note this requires some extra effort to leverage DI (Dagger)
Merge to main
Nightly
Nightly build could publish to a "nightly" or "develop" version so we can point nightly canaries to it instead of relying on remembering to update them. https://github.com/hyperledger/besu/issues/6426
Confirm that automation around deployment actually deploys a new build each night, should be confirmed by self-reporting the git sha at startup. https://github.com/hyperledger/besu/issues/6637
Canaries updated with nightly build
Pre-release
Enterprise-oriented testing - Incorporating Matt Whitehead's suggestions
https://lf-hyperledger.atlassian.net/wiki/display/BESU/Testing
And ensuring enterprise PRs don't leak bugs into main
Maybe use PEEPS https://github.com/Consensys/PEEPS as a reference but probably makes sense to start with a new repo - might be easier to start from Besu AT framework than the PEEPS one. The PEEPS framework runs everything in Docker so could be slow and the DSL is quite cumbersome to use as everything is very generic so it works with both Quorum, Besu and EthSigner.
Bring this on board later? Part of burn-in or a separate ad-hoc process?
A big part of the release process is regression testing - but it’s very manual
burn-in on existing (synced) nodes vs syncing from scratch
it would be awesome to make the (Consensys internal) burn-in process less manual.
Post-release
Retrospective to fold new learnings back into our processes and automation.
Gaps
Tickets in the epic
See Testing Gaps Epic for the stack ranked issues. (For those without zenhub licences, the epic is here in Github but it won't show all the detail.)
Hive testing - can we integrate Hive testing into CI/CD pipeline? Or at least an intermediate step (less effort) that will enable us to easily view new failures
Smoke test: test besu starts up (with default options?)
Context: our tests didn’t pick up that the data directory wasn’t being created automatically
Simon: Downgrade test - would be good to have a regression test for each release to check we can downgrade to previous version(s).
Automate burn-in
Not much in the way of “initial sync” tests at the IT/AT level. Should review what hive provides for this. https://github.com/hyperledger/besu/issues/6652
Decouple ATs from mainnet.json
most acceptance tests use mainnet.json, so bugs on adding a new fork (eg cancun) may not get covered until much later in the SDLC.
to avoid finding bugs at the last minute, change this to holesky/sepolia (or make it configurable?)
Fuzz testing https://github.com/hyperledger/besu/issues/6655
Fuzz testing on critical areas e.g. Bonsai trielogs - something like tx-fuzz
Code coverage Aggregation https://github.com/hyperledger/besu/issues/6650
We should measure code coverage
and consider make it a condition for merging a PR?
Snap gaps!
Besu serving besu https://github.com/hyperledger/besu/issues/6656
Current ATs test (previous default) forest storage format - need to update to bonsai
reference tests have been updated already
Can we fail faster with local and CI builds - improve devx and reduce context switching
Flaky tests are a major barrier here. People will run more tests locally when they are more reliable, and don’t require multiple runs for confidence.
Does GHA have metrics for flaky tests / see it manually
Integration tests - are very out of date
Update or remove - https://github.com/hyperledger/besu/issues/6739
Covered elsewhere
Is there any coverage of different solidity versions?
No there isn’t.
This would be a very narrow gap - besu EVM is thoroughly tested by reference tests. Anything falling into this gap would likely be a bug in solidity, and caught elsewhere.
Something like Pact contract tests for engine API for various CLs?? Again, might be covered by Hive.
Think this is covered by Hive tests
Devwatch / operational oversight / dogfooding
See DWIP separate effort for improving devwatch itself
Is observability a quality issue?
QA / devops / holesky/sepolia validators