Files from host end up in output tarball #26

Closed
opened 2022-05-10 14:16:50 +00:00 by zmanji · 8 comments

Using the latest mmdebstrap with mode unshare as a non root user with the output as a tarball I observed two files in the final tarball:

  • /etc/resolv.conf
  • /etc/hostname

These files are copies of files from the host and therfore make the mmdebstrap command not reproducible when run on different hosts.

To observe run:

SOURCE_DATE_EPOCH=0 mmdebstrap --variant=essential --mode=unshare bullseye bullseye.tar 'deb https://debian.notset.fr/snapshot/archive/debian/20220506T205402Z/ bullseye main' 

On two different hosts. The checksums of the tarballs will be different. Using diffoscope to diff the tarballs the only differences between them are the two files listed above.

Can mmdebstrap remove the files copied in?

Using the latest `mmdebstrap` with mode unshare as a non root user with the output as a tarball I observed two files in the final tarball: * /etc/resolv.conf * /etc/hostname These files are copies of files from the host and therfore make the mmdebstrap command not reproducible when run on different hosts. To observe run: ``` SOURCE_DATE_EPOCH=0 mmdebstrap --variant=essential --mode=unshare bullseye bullseye.tar 'deb https://debian.notset.fr/snapshot/archive/debian/20220506T205402Z/ bullseye main' ``` On two different hosts. The checksums of the tarballs will be different. Using `diffoscope` to diff the tarballs the only differences between them are the two files listed above. Can `mmdebstrap` remove the files copied in?
Owner

Your observation is absolutely correct. The reason why /etc/resolv.conf and /etc/hostname is copied in, is because debootstrap is doing the same and thus I think that's what users expect. You are correct that this results in different tarballs when being run on different systems. The scenarios I see are these:

  1. either you run mmdebstrap casually as a normal user on your own computer to create throwaway chroots or build chroots or the like -- in that case it's okay that the output is not the same as on another computer because you never compare your own chroot with those created on another host

  2. or you run mmdebstrap as part of a bigger project like a script that builds a bootable system image and those are supposed to be bit-by-bit reproducible no matter who runs the script. In that case, the script that runs mmdebstrap can easily add --customize-hook 'rm "$1"/etc/resolv.conf' to the mmdebstrap invocation.

Do you see another scenario that I haven't thought of?

Your observation is absolutely correct. The reason why /etc/resolv.conf and /etc/hostname is copied in, is because debootstrap is doing the same and thus I think that's what users expect. You are correct that this results in different tarballs when being run on different systems. The scenarios I see are these: 1. either you run mmdebstrap casually as a normal user on your own computer to create throwaway chroots or build chroots or the like -- in that case it's okay that the output is not the same as on another computer because you never compare your own chroot with those created on another host 2. or you run mmdebstrap as part of a bigger project like a script that builds a bootable system image and those are supposed to be bit-by-bit reproducible no matter who runs the script. In that case, the script that runs mmdebstrap can easily add `--customize-hook 'rm "$1"/etc/resolv.conf'` to the mmdebstrap invocation. Do you see another scenario that I haven't thought of?
Author

I do not see another case, I filed this bug about scenario 2.

I don't think there needs to be a code fix, I think updating the manpage to cover this scenario would be perfect.

Currently I see on the manpage, I see two confusing statements:

Remove all files that make the result unreproducible, like apt and dpkg logs and caches or /etc/machine-id and /var/lib/dbus/machine-id. This can be disabled using --skip=cleanup/reproducible

"SOURCE_DATE_EPOCH"
By setting "SOURCE_DATE_EPOCH" the result will be reproducible over multiple runs with the same options and mirror content

Perhaps adequate resolution would be to update some of the manpage to state that reproducibility is in the scenario of running the command on the same host and not across hosts.

Also documenting that these files are copied in and would affect reproducibility would be great.

I do not see another case, I filed this bug about scenario 2. I don't think there needs to be a code fix, I think updating the manpage to cover this scenario would be perfect. Currently I see on the manpage, I see two confusing statements: > Remove all files that make the result unreproducible, like apt and dpkg logs and caches or /etc/machine-id and /var/lib/dbus/machine-id. This can be disabled using --skip=cleanup/reproducible > "SOURCE_DATE_EPOCH" By setting "SOURCE_DATE_EPOCH" the result will be reproducible over multiple runs with the same options and mirror content Perhaps adequate resolution would be to update some of the manpage to state that reproducibility is in the scenario of running the command on the same host and not across hosts. Also documenting that these files are copied in and would affect reproducibility would be great.
Owner

Ah okay, I understand. Yes, I'll add some more docs to explain this. Thanks for pointing this out!

Ah okay, I understand. Yes, I'll add some more docs to explain this. Thanks for pointing this out!
josch closed this issue 2022-05-11 08:48:28 +00:00

@josch could you re-think that, maybe implement it under a flag (like --clean-host-files). My use case for mmdebstrap is for using it for installing packages in "Distroless" Docker images, which are based on Debian (ATM Bullseye).

Currently, I'm need to do something like this...

ARG RELEASE=sid
FROM debian:$RELEASE-slim AS base
ARG RELEASE=sid
ARG PACKAGES=apache2,libapache2-mod-php

COPY dpkg-excludes /dpkg-excludes
RUN apt-get update; \
        apt-get -y install --no-install-recommends mmdebstrap; \
        mmdebstrap \
        --variant=extract \
        --dpkgopt=/dpkg-excludes \
+       --setup-hook='find dpkg -not -type d >dpkg-files.txt' \
        --include=vim $RELEASE dpkg; \
+        while read file; do \
+                if [ -f "$file" ]; then \
+                        rm -f "$file"; \
+                fi; \
+        done <dpkg-files.txt; \
+        find dpkg -type d -empty -delete

FROM gcr.io/distroless/static
COPY --from=0 /dpkg/ /
@josch could you re-think that, maybe implement it under a flag (like --clean-host-files). My use case for mmdebstrap is for using it for installing packages in "Distroless" Docker images, which are based on Debian (ATM Bullseye). Currently, I'm need to do something like this... ```diff ARG RELEASE=sid FROM debian:$RELEASE-slim AS base ARG RELEASE=sid ARG PACKAGES=apache2,libapache2-mod-php COPY dpkg-excludes /dpkg-excludes RUN apt-get update; \ apt-get -y install --no-install-recommends mmdebstrap; \ mmdebstrap \ --variant=extract \ --dpkgopt=/dpkg-excludes \ + --setup-hook='find dpkg -not -type d >dpkg-files.txt' \ --include=vim $RELEASE dpkg; \ + while read file; do \ + if [ -f "$file" ]; then \ + rm -f "$file"; \ + fi; \ + done <dpkg-files.txt; \ + find dpkg -type d -empty -delete FROM gcr.io/distroless/static COPY --from=0 /dpkg/ / ```
Owner

Hi @markkrj, thanks for your input! Just to confirm, you are not calling mmdebstrap manually in a real terminal but you are calling it as part of a Dockerfile, correct?

Hi @markkrj, thanks for your input! Just to confirm, you are not calling mmdebstrap manually in a real terminal but you are calling it as part of a Dockerfile, correct?
Owner

Hi @markkrj, my argument is: if you need the functionality of a hypothetical --clean-host-files option in a situation where you call mmdebstrap from a script, then you don't need that option because you can just use hooks to do whatever you need to accomplish. Agreed?

Hi @markkrj, my argument is: if you need the functionality of a hypothetical `--clean-host-files` option in a situation where you call mmdebstrap from a script, then you don't need that option because you can just use hooks to do whatever you need to accomplish. Agreed?

Hi @markkrj, thanks for your input! Just to confirm, you are not calling mmdebstrap manually in a real terminal but you are calling it as part of a Dockerfile, correct?

Yes.

you can just use hooks to do whatever you need to accomplish. Agreed?

@josch Ye, we can circumvent it by script, but then we end up cluttering the script/Dockerfile with otherwise unneeded things. I'll understand if you keep this as won't fix, just wanted to show another use case for removing host files from chroot. Maybe you could remove host files just for "extract" variant, as I don't see a reason for extract having it also.

> Hi @markkrj, thanks for your input! Just to confirm, you are not calling mmdebstrap manually in a real terminal but you are calling it as part of a Dockerfile, correct? Yes. > you can just use hooks to do whatever you need to accomplish. Agreed? @josch Ye, we can circumvent it by script, but then we end up cluttering the script/Dockerfile with otherwise unneeded things. I'll understand if you keep this as won't fix, just wanted to show another use case for removing host files from chroot. Maybe you could remove host files just for "extract" variant, as I don't see a reason for extract having it also.
Owner

Hi @markkrj, thank you for your update. Yes, adding hooks to your Dockerfile is additional cost on your end. The problem is, that by adding a --clean-host-files option you are shifting the cost from your end to every mmdebstrap user because every additional option we add means that we must not only document it and thus increase the already very large manual page, making it more confusing. It also means that we are in the situation that we have to define what "host files" actually means. What about people who want to delete some but not the other. Now we have an option that can take multiple arguments, making it even more complex. And what if somebody wants to delete these files only in certain circumstances? It probably makes sense to only delete them if SOURCE_DATE_EPOCH is set because otherwise the output wouldn't be reproducible anyways. But what if somebody comes along with a use-case that would make it useful even without SOURCE_DATE_EPOCH? One needs to be very careful when adding new command line options because those options are an interface and you essentially can never remove or change them after you introduced them because then you will break other people's code. Options with a use-case as small as --clean-host-file remind me of this:

open office toolbars

Or of man pages like this one: https://manpages.debian.org/unstable/parallel/parallel.1.en.html

Creating a chroot has a lot of moving parts and everybody has some very specific requirements for their chroot creation tool. The advantage for mmdebstrap is, that most people with very specific requirements are running it from their own set of scripts and this means that instead of herding a large collection of very specific command line options, we can just provide them a hook mechanism that allows them to do whatever their specific use-case requires.

So in essence, adding new command line options is not free because every new option adds a cost for every user of mmdebstrap including the long-term maintenance of the option. I'll only add new options as a short-hand for something that can be done with hooks if it is clear that the option will be used by people running mmdebstrap from the CLI and thus there is a requirement for something short to avoid a lot of typing.

This is not the case for you. What I can offer you though is for mmdebstrap to ship a hook script that specifically deletes all files from the chroot that were copied in from the host, essentially containing a single line:

rm "$1/etc/resolv.conf" "$1/etc/hostname"

Then you could run mmdebstrap with:

--hook-dir=/usr/share/mmdebstrap/hooks/delete-host-files

But oh, this line is even longer than just the manual rm and from the code you posted above I see that this is not quite what you are needing which precisely proves my point. It is very hard to come up a mini-option like --clean-host-files that does exactly what the user wants. From the code you posted it looks to me that even if there was a --clean-host-files option, it wouldn't exactly do what you want and you wouldn't use it in the end anyways.

Hi @markkrj, thank you for your update. Yes, adding hooks to your Dockerfile is additional cost on your end. The problem is, that by adding a `--clean-host-files` option you are shifting the cost from your end to every mmdebstrap user because every additional option we add means that we must not only document it and thus increase the already very large manual page, making it more confusing. It also means that we are in the situation that we have to define what "host files" actually means. What about people who want to delete some but not the other. Now we have an option that can take multiple arguments, making it even more complex. And what if somebody wants to delete these files only in certain circumstances? It probably makes sense to only delete them if `SOURCE_DATE_EPOCH` is set because otherwise the output wouldn't be reproducible anyways. But what if somebody comes along with a use-case that would make it useful even without `SOURCE_DATE_EPOCH`? One needs to be ***very*** careful when adding new command line options because those options are an interface and you essentially can never remove or change them after you introduced them because then you will break other people's code. Options with a use-case as small as `--clean-host-file` remind me of this: ![open office toolbars](https://upload.wikimedia.org/wikipedia/commons/5/52/OpenOfficeToolbars.png) Or of man pages like this one: https://manpages.debian.org/unstable/parallel/parallel.1.en.html Creating a chroot has a *lot* of moving parts and everybody has some very specific requirements for their chroot creation tool. The advantage for mmdebstrap is, that most people with very specific requirements are running it from their own set of scripts and this means that instead of herding a large collection of very specific command line options, we can just provide them a hook mechanism that allows them to do whatever their specific use-case requires. So in essence, adding new command line options is not free because every new option adds a cost for every user of mmdebstrap including the long-term maintenance of the option. I'll only add new options as a short-hand for something that can be done with hooks if it is clear that the option will be used by people running mmdebstrap from the CLI and thus there is a requirement for something short to avoid a lot of typing. This is not the case for you. What I can offer you though is for mmdebstrap to ship a hook script that specifically deletes all files from the chroot that were copied in from the host, essentially containing a single line: rm "$1/etc/resolv.conf" "$1/etc/hostname" Then you could run mmdebstrap with: --hook-dir=/usr/share/mmdebstrap/hooks/delete-host-files But oh, this line is even longer than just the manual `rm` and from the code you posted above I see that this is not quite what you are needing which precisely proves my point. It is **very** hard to come up a mini-option like `--clean-host-files` that does exactly what the user wants. From the code you posted it looks to me that even if there was a `--clean-host-files` option, it wouldn't exactly do what you want and you wouldn't use it in the end anyways.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: josch/mmdebstrap#26
No description provided.