Data model for source package inventory

We (@pmoser and I, working on the IP compliance toolchain) need to find a way to uniquely tag different source files used by the same version/revision of a bitbake recipe across different machines and project releases (eg. recipe foo/1.0-r0 that uses different versions of file://foo.conf depending on the target machine and/or on the project release, because different stuff is prepended to the FILESPATH where foo.conf is looked for) .

The goal is to generate a single source package (say foo_1.0-r0+sometag) out of a certain bb recipe (say foo/1.0-r0) by including all possible patch, init script and conf file variants that depend on target machine and/or on project release, provided that the upstream source tarball is the same (eg. foo-1.0.tgz downloaded from https://foo.org/downloads/foo-1.0.tgz).

We came up with the following tag tree to tag such variants that we need to put in our metadata model, and we would like to get feedback on that:

<project>/<release>/<distro>/<machine>/<package_arch>/<image>

We are assuming:

  • that a specific yocto project (identified by a certain git repo of a manifest that identifies a set of bb layers) can have multiple releases (each identified by a certain manifest revision);
  • that each release may have multiple distros (each identified by a specific DISTRO configuration);
  • that each distro can be built for multiple MACHINE targets (eg. qemuarm64, raspberrypi4, etc.);
  • that given a specific <project>/<release>/<distro>/<machine> combination, binary packages for recipe foo/1.0-r0 are built only for a specific package_arch (eg. aarch74-poky-linux-musl);
  • that, given a specific <project>/<release>/<distro>/<machine>/<package_arch> combination, FILESPATH and/or SRC_URI for recipe foo/1.0-r0 could in theory be modified in different ways by different target image recipes, so that, f.e. we could have different foo.conf files used in different images (we are not sure of this last assumption)

Are our assumptions correct (including the last one)?

Can you explain what is the point of the source package and how that differs from any upstream git repository or upstream source release?

The only thing I can think of as sensible here is to ask bitbake to produce a source archive for a given recipe, in a given project, therefore applying all the overrides and other magic, so that the rest of that IP compliance stack can look at that archive and be oblivious to anything that is bitbake internal logic.

I would like @agherzan to comment on this but I understand his availability in the next few days can be spotty.

The assumptions here look good to me. The last assumption is most complex which is why I would be most comfortable by just asking bitbake for a source archive.

Can you explain what is the point of the source package and how that differs from any upstream git repository or upstream source release?

To do Software Composition Analysis (for IP compliance + CVE management), we need source packages to inspect, uniquely identified by name/version, and we need to keep track of changes between different versions of the same source package, in order to audit only things that have changed, and automatically reuse our clearing work for things that did not change.

The problem is that Bitbake does not provide source packages uniquely identified by name/version, but uses layers of recipes (identified by name/version/revision) whose variables can be always overridden by other layers. This may lead to recipes that are apparently identified by the same name/version/revision but that include different sets of source files (typically, different sets of patches applied to the same upstream source archive), or even different versions of the same source file (typically, different versions of a configuration file or of an init script).

Moreover, since new layers may be added to a project over time, we could even expect to have (apparently) the same recipe/version/revision, built for the same target machine, that actually “uses” different sets of source files across different releases of the same project (because f.e. a new layer has been added in a newer release, and that new layer prepends some different stuff to FILESPATH for that recipe, causing the same file://foo.conf src_uri pointing to a different file).

While “converting” bitbake recipes in source packages, we would like to avoid to generate a different source package archive for each possible output variant of the same recipe, because it would cause a lot of overhead both in the machine processing and – what is more important – in the human audit process.

So, as long as the upstream source tarball is the same, we would like to create a single source package out of a certain bb recipe, that includes all possible variants of patches, init scripts, configuration files etc. but we would like to tag them appropriately, in order to understand when and where they are used. This will be important not only for IP checks but also for CVE checks (eg. a patch file may fix a CVE, it’s important to know when and where it’s actually applied)

Right, I understand how this makes analysis a nightmare. I didn’t check that myself but do you know if bitbake has any low-level task to generate the work tree for a package compilation step, that we could use as input to our source-package-generation helper?

that’s one of the reasons why we are not using bitbake’s archiver class to generate source packages (the other one is that it does not work :slight_smile: because it silently fails on a number of packages/recipes, at least in OHOS builds)

I think it is mostly correct and it seems that with the discussion in Release flavours - #8 by zyga we’re basically confirming the last assumption

1 Like

Just to avoid misunderstandings: all this is due to the peculiarity of this project. We don’t have a single yocto project, or a bunch of independent yocto projects that may occasionally share some components: we have a single project with a significant number of variants and subvariants, and we want to audit everything with fossology (with a little help from our friend Debian :slight_smile: ) in order to be able to populate an SPDX oracle that can be used by downstream users/implementers to generate the SPDX SBOM of their own sub-sub-variant of the project (by taking data from our SPDX oracle, and filtering it to include only the components actually included in their implementation project).

1 Like

If I understand correctly, the aim here is to find a specific version string that describes enough for that package to be reproduced in a build. Is that the case?

1 Like

I think so, with the extra goal of capturing differences in behavior or content of the same “upstream release” as modified by layers and build configuration

@agherzan @zyga actually, reproducibility is a related issue but here the focus is on designing a data model to correctly describe all source “variants” generated from the same source recipe, avoiding the generation of multiple source packages to be uploaded to fossology for the same recipe. Let me explain it with some examples.

Variants within the same project release, depending on the machine target

Recipe busybox-1.31.1-r0 contains busybox-1.31.1.tar.bz2 as “common” upstream source tarball, plus a series of patches, config files and init scripts (let’s call all of them “Downstream Patches”). In OHOS v0.1.0, there are 48 “common” Downstream Patches, that are always applied, and 5 Downstream Patches that are used only when building for stm32mp1-av96 target machine. We want to avoid generating two different source packages from the same recipe busybox-1.31.1-r0 (thus avoiding having 2 different source uploads to fossology for the same recipe), so we need to appropriately tag those 5 Downstream Patches in our “alien” source package.

(An extreme example is xserver-xf86-config, where you have a different xorg.conf source file for every different group of target machines, so you would need to create 6 different source packages from the same recipe; OK, it’s not a recipe that contains actual source code, but it’s an example of what could happen given how bitbake recipes work)

Variants across different releases

This is more annoying. Recipe curl-7.69.1-r0 in OHOS v0.1.0 includes curl-7.69.1.tar.bz2 as upstream source tarball, an 7 Downstream Patches, applied to all target machine builds.

The very same recipe curl-7.69.1-r0 (same version, same revision) in AllScenariOS v0.1.0-41-g934762e (it’s a tag generated with git describe, corresponding to the current development version) contains 9 Downstream Patches, and the new 2 ones (CVE-2021-22876.patch and CVE-2021-22890.patch) are clearly new vulnerability fixes. But recipe version and revision are the same!

This with a package manager would never happen, and generally it should never happen even without a package manager. But here it does happen, and it’s clearly an upstream problem in yocto project recipe versioning policy (incidentally, they also changed the recipe description field while keeping the same recipe version and revision; it does not harm, but it clearly indicates that there is no policy preventing modification of a recipe file without updating recipe version or revision).

So in this case we should find a way to add some tag to recipe version/revision (my idea is to add a short sha1 of the sha1 checksums of included source files, so we have a unique identifier), like for instance curl-7.69.1-r0-srcf330c29a. What do you think?

P.S. I made just some examples, but there many other similar cases that we found in AllScenariOS builds

I think this is what we’ll need to do. We should take into account how hashes work: order of files you hash will matter. So, depending on how you get the files to hash (and maybe depending on the system) you might be getting different hashes. To avoid that you need to define a way to create the hash source (an archive? your source package?) in a reproducible way.