I’d like to temporarily revert the compiler to gcc.
Having experienced crashes when invoked in qemu on one of our OTC CI systems, which broke ACTS CI recently, I’d like to revert meta-ohos-core: toolchain: Switch defaults to use clang, compiler-rt and libc++ (!138) · Merge requests · OSTC / OHOS / meta-ohos · GitLab
I’ve prepared a local revert to see if the issue we’ve been experiencing goes away in Debug qemu boot failure (!57) · Merge requests · OSTC / OHOS / components / staging / xts_acts · GitLab and the results are clear.
What is unclear is exactly which hardware is affected. We know our CI runs on a variety of systems, possibly even migrating from one CPU to another in the larger installation that is beyond our control. What we know is that on contemporary laptops it does not seem to be failing.
My current plan is:
- revert compiler to clang in meta-ohos
- move spread-based CI from acts to manifest and apply it to manifest, meta-ohos and lastly, acts itself.
- open a PR against meta-ohos that re-introduces clang, allowing us to tweak things to see what makes it pass or fail
When you say “migrating from one cpu to another” do you mean starting on class of emulated cpu and ending up on another?
No, I mean that clouds typically migrate VMs amongst a group of CPUs/hosts, hopefully identical, but we don’t really know that.
First of all, I agree with the revert but in parallel, let’s start using
testimage for the qemu images. This will guard us against basic functionality breakage and we can keep evolving that as we continue developing. Start small with the TEST_SUITES (so that it is appropriate for MR CI checks). As we have that, we can be less conservative with a CI job that runs daily/weekly.
Can you help by showing me how to do that? I could search through the docs but if you can just show me, we’ll get that up in a moment.
While we don’t know exactly what is wrong, it is likely this is related to the instructions emitted by clang. In the boot log of the failed clang-based CI run we can see:
[ 14.669252] traps: systemd-udevd trap invalid opcode ip:7fe963ce9f67 sp:7ffc7457b418 error:0 in libc.so[7fe963ca3000+6a000]
The revert has since landed and we’ve made progress on debugging infrastructure to help us understand what the problem was.
I’ll follow up with changes to xts-acts to allow us to explore flipping the container back and getting interactive shell to debug the problem.