IP Compliance Pipeline design (2021-08-20 meeting)
a. Compliance pipeline needs to build all available flavours/distros, machines and images (and maybe also variants, in the near future), to extract metadata and source files (through Tinfoil bb library) and aggregate them before uploading them to Fossology, in order to reduce complexity and ease audit team’s work (more details in this thread): this, plus the fact that automated scanners (both ScanCode and Fossology integraded scanners) require a lot of machine time to complete, entail that the compliance pipeline regularly consumes a lot of machine time (~30h for a complete scan from scratch, ~4-5h for an average scan when only some new compnents have been added to the project)
b. Compliance process is asyncronous, because automated scanner results need to be validated in Fossology by the audit team, and the final results are collected by the compliance pipeline at a later stage
a+b above entail that compliance pipeline cannot be run too frequently, not only to avoid machine overloading (which may be solved by process optimization or by adding more horsepower) but to avoid audit team overloading (which is not solvable by simply adding more resources – and in any case adding human resources would require some months for a number of reasons, so it could happen only after Jasmine release)
So assessing the correct timing for scheduling the compliance pipeline is key.
Also, relying only on a periodic schedule (eg. every 2 weeks) and/or on developers having to manually trigger the compliance pipeline every time they find it appropriate is not a viable solution, because we would risk both to have the pipeline not triggered when it is really needed (i.e. when new software components that do require an audit are added), and to have the pipeline triggered when it is not needed (i.e. when only trivial modifications have been made).
So we need to define an optimal strategy to automatically trigger the compliance pipeline.
Prospected Solution: Split the Pipeline in Two Parts
After some discussion, we came up with the following possible solution.
There should be a first part of the pipeline, that just builds all images (leveraging the existing private SSTATE cache, if possible) and runs only aliens4friends’ debian matcher tool on the upsteam source of each new software package/recipe (we do not care about yocto patches at this stage). If there is a good debian matching (eg. > 80%) or if the package is included in a whitelist, there is no need to proceed further (the new package will be scanned at a later stage, eg. via a periodically scheduled pipeline every 2 weeks). If there is a bad (or no) debian matching, or if the package is included in a blacklist, the second part of the pipeline (the “real” compliance pipeline) will be triggered. This first pipeline should be triggered at every commit on meta-ohos develop branch, as well as on every MR into the develop branch, and provide artifacts (a json file?) that can be read by developers to assess which component changes have been introduced by their commit(s) and to check if the second (the “real”) compliance pipeline has been triggered and why.
internal note: this is a good reason to keep both the “old” debian matcher and the “new” debian snapshot matcher in a4f: the first one (less accurate, not reproducible, but faster) could be used in the first part of the pipeline, while the second one (more accurate, always reproducible, but significantly slower) could be used in the second part of the pipeline.
The second pipeline (the “real” one) will perform all the steps of aliens4friends’ workflow, and will be triggered:
- by the fist pipeline, but also
- by a periodic scheduler and
- manually by developers when needed (great power, great responsibility: use with care!)
There will be also another pipeline, running periodically (eg. nightly) that will perform just the final 2 steps of aliens4friends’ workflow (
harvest), in order to update json stats for the dashboard and therefore to monitor audit work progress on Fossology
As Andrei correctly pointed out, all parts of the pipeline would require some sort of “environment hashing” (modeled on SSTATE) in order to have unique identifiers of what we are checking, scanning and/or auditing. This topic will be further explored in a meeting with Andrei.
We will define a development roadmap next week, together with NOI Techpark.