Can we push results somewhere so we can graph them? Tricky to do with CI, and unstable unless we have our own CI
builder, but would be useful to track progress or flaky tests
Tests are still hard to write; missing infrastructure: input behaviour; windowing systems (e.g. clipboard);
rendering backend (vulkan/gl/cairo)
Its interesting what data are you after here. Given that we want an always green master, the biggest data point which would be test-failures over time is out of the picture since there aren’t supposed to be any. This of-course assumes that flaky tests are either ignored or fixed and that the CI run is always consistent and provides the same results for the same commit and environment.
Once you take the above into a count, historic result become way less interesting. Then you can measure performance of benchmarks/tests. This though has a couple of gotchas, the environment might change, say an optimization in graphene or a regression in another part of the stack. You can’t really keep a static environment in the long term. This might be a good thing for the part of the stack that gtk “controls” directly, like glib, graphene, pango etc but everything will need to be reset the moment a change in something like pixman occurs. This will also need a dedicated runner for the benchmarks in order to have consinstant results. This will likely need to be a real hardware machine, no cloud vm, and only be able to run 1 job at a time. It will also most likely need to be frozen in time, no kernel updates for example which would impact the results and mess with the history.
The Flatpak runtime already ships valgrind and strace in the .Sdk extension. The CI images currently don’t contain .Sdk.Debug though cause the CI runners where timing out frequently, but this can be solved if we throw a machine with a better connection at it. We could certainly inlcude more tools in the .Sdk as well, I had my eye on asan fro a while now.
For the fedora/debian/w/e images it should be straight forward to add it the tools as well
Not because I think those results are the ultimate goal that we should look for, but because those results are vastly better than what we have today: nothing.
And once we have those numbers stored somewhere, we can look at them later and do stuff with them. For example, somebody wanting to do a web page to look at those numbers would have actual numbers to look at, and not just code an empty website. And somebody wanting to do benchmarks might actually write a few, just because those numbers get recorded.
And I’m also not sure that those numbers don’t provide any useful information, even with all the problems you listed. You can for example take the mean/average of the last 10 or so runs - that gets fluctuations under control. Or you can compare runtime of a single test against the other tests instead of just looking at wallcclock time, that gives you an indication if a test got slower/faster independent of hardware/load of the testrunner.
And then you can still go and bisect-rerun the test locally to figure out what was up with the questionable commits and which one is the problem.
But without any numbers, you will neither notice any problems nor will you have any ability to pin down problems.