Ports and Compiled Python Packages

Preface

CPython's different setup mechanisms as well as the CPython interpreter itself create byte-compiled copies (.pyc and sometimes .pyo prior to Python 3.5) of a Python file at installation or run-time. The byte-compiled versions of the Python files are used in the first place by the interpreter to speed up loading and importing the files, when Python code is to be executed by a program.

Trouble in Userland

Right now, both package installers, the former pkg_install and the current pkg, as well as the ports tree's make install and make deinstall mechanisms install Python packages including the byte-compiled files.

At packaging and installation time, the files are picked up by the fake-pkg targets, a checksum for them is created and written into ${PKG_DBDIR}/${PKGNAME}/+CONTENTS (pkg_install) or the package checksum file (pkg).

When a byte-compiled version of the original .py file is created, CPython stores the file's st_mtime and a magic number within the byte-compiled file and recompiles those files on mismatches of the .py file's st_mtime or the interpreter's magic number.

Recompiling those files leads to checksum mismatches for the package management tools, which might cause problems on auditing secured environments or the behaviour of package build systems (which check the integrity). If post-install hooks cause a recompile (since the st_mtime of a file as well as the installed Python interpreter's magic number might be different on the target system, compared to the system on which the packages were generated), the checksum verification will also fail for every package that installs python modules.

Mitigation considerations

On first glance, the most simple solution would be to avoid creating a checksum for those files, since the CPython interpreter will not stop (re)compiling the files unless

  1. PYTHONDONTWRITEBYTECODE is set in the environment or -B passed to ${PYTHON_CMD}

  2. PEP-517 or distutils install commands pass the --no-compile parameter (to avoid the initial creation of .pyc/.pyo files)

  3. the CPython package directories (${PYTHON_SITELIBDIR} and friends) are read-only and hence inaccessible for write operations

2) and 3) work as long as ports are built as a non-root user. When root, the CPython package directories are accessible for write operations, so bytecode is written, causing filesystem violations. 1) is not feasible unless login.conf is changed or another way to set the environment variable for the root user is realised.

The FreeBSD port maintainers and packaging tools are not in the position to decide for users whether byte-compiling files should be performed or not. Hence the three requirements from above should be supported but not enforced on users.

Current approach

This is a hybrid of not compiling bytecode during the port build process and avoiding package checksum mismatches (described in historical approaches). Using pkg triggers, bytecode files are compiled or removed after the pkg transactions complete, mimicking what Python's own packaging tools like pip do by default. Only operates on ${PYTHON_SITELIBDIR} and follows PEP-3147. Implemented as D34739

This approach was chosen primarily due to the RECORD file from Python Packaging Authority's recording installed packages standard not requiring bytecode to be included, and thus devel/py-installer, used during port build's install/stage, does not include them. (By contrast, the distutils install command's --record parameter does include bytecode.) While PEP-3147 specifies a uniform file and directory format for bytecode, it is not prudent to infer from that when such files are not included in RECORD. Additionally, this provides the most flexibility for users to decide whether they want bytecode or not, especially when such an option in the lang/python ports is available. Implementations other than CPython do not compile or use bytecode, a CPython implementation detail, so packaging the same Python ports with those additional flavours would break with bytecode included in plists.

Deterministic bytecode

By default, the st_mtime checking between bytecode and original sources is not deterministic and thus results in non-reproducible builds. PEP-552 addresses this by providing the option of verifying a deterministic hash instead. Passing the appropriate flags for this option to any compile commands during the port build process would have a higher chance of using packaged bytecode rather than immediate invalidation.

This approach was not chosen because the aforementioned pitfalls were not addressed. Further, no installation tool that PyPA supports supplies a parameter to pass invalidation_mode to compileall.

Historical approaches

Retained for historical purposes and commentary.

Avoiding package checksum mismatches

Solution for pkg

In contrast to pkg_install pkgng picks up the ${TMPPLIST} and creates the checksum internally, unreachable for any post-checksum operation as for pkg_install. Hence pkg register should be enhanced to receive a set of excludes (e.g. as regular expression) to ignore the checksum creation for files matching on it.

TODO: Implement/describe the pkgng extension.

Solution for pkg_install

To avoid checksum mismatches for pkg_install (pkg_delete and make deinstall), ${PKG_DBDIR}/${PKGNAME}/+CONTENTS must not contain checksums for the byte-compiled files. Since changing pkg_install just for this case is not a viable option, the necessary code might go into the port tree's Makefile infrastructure.

By enhancing the fake-pkg target in bsd.port.mk, we can strip the offending checksums quite easily, while at the same time we can avoid messing around with ${TMPPLIST}, causing side effects for other ports.

                ${ECHO_MSG} "===>   Registering installation for ${PKGNAME}"; \
                ${MKDIR} ${PKG_DBDIR}/${PKGNAME}; \
                ${PKG_CMD} ${PKG_ARGS} -O ${PKGFILE} > ${PKG_DBDIR}/${PKGNAME}/+CONTENTS; \
+               ${SED} -i -e '/\.py[co]$/{n;d;}' ${PKG_DBDIR}/${PKGNAME}/+CONTENTS; \
                ${CP} ${DESCR} ${PKG_DBDIR}/${PKGNAME}/+DESC; \
                ${ECHO_CMD} ${COMMENT:Q} > ${PKG_DBDIR}/${PKGNAME}/+COMMENT; \
                if [ -f ${PKGINSTALL} ]; then \

Note: http://people.freebsd.org/~mva/python_checksum.patch will always contain the most recent version of the patch.

Avoid byte-compiling

If checksums are not created anymore, audits based on the packaging tools create a hole for the byte-compiled files, since the auditor cannot tell, if they were modified without keeping track of their st_mtime elsewhere, creating a possible security issue. Additionally, avoiding byte-compiled files can save diskspace by reducing the installed package size to a minimum of necessary files.

pkg

TODO: Is it necessary for pkgng to do something here?

pkg_install

To create a sort of frozen package without any byte-compiled files, byte-compiling should be made optional and it should be up to the user, whether byte-compiling is wanted or not. For the ports tree, it would be necessary to make compiling at installation time optional, so distutils and easy_install are aware of a --no-compile flag.

At the moment PYDISTUTILS_INSTALLARGS enforces the -c -O1 options, causing .pyc and .pyo files to be created without giving a user the chance to influence the behaviour. This however is a necessity, so that users do not have to deal with potential security issues.

To make byte-compiling optional, several prerequisites have to be met.

distutils (and easy_install, since it just utilizes distutils) differentiate between C extensions and pure python extensions and use different intermediate build directories. This needs to be aligned so that the last requirement (auto-populating) can be met for mixed package installations.

This change enables us to track which Python files are installed as Python packages and modules (and hence would be byte-compiled by default).

-PYDISTUTILS_BUILDARGS?=
-PYDISTUTILS_INSTALLARGS?=      -c -O1 --prefix=${PREFIX}
+PYDISTUTILS_BUILDDIR?=         ${WRKSRC}/build/lib
+PYDISTUTILS_BUILDARGS?=                --build-platlib ${PYDISTUTILS_BUILDDIR} --build-purelib ${PYDISTUTILS_BUILDDIR}
+.if !defined(WITHOUT_PYTHON_BYTECOMPILE)
+PYDISTUTILS_COMPILEARGS?=      -c -O1
+.else
+PYDISTUTILS_COMPILEARGS?=      --no-compile
+.endif
+PYDISTUTILS_INSTALLARGS?=      --prefix=${PREFIX}

To populate the ${TMPPLIST} automatically, ${PYDISTUTILS_BUILDDIR} now can be scanned for any *.py file. For each file, a .pyc and .pyo entry can be added to ${TMPPLIST}.

.if defined(USE_PYDISTUTILS) && !defined(WITHOUT_PYTHON_BYTECOMPILE)
_RELDIR=        ${PYTHONPREFIX_SITELIBDIR:S/^${PREFIX}\///}
add-plist-post: add-plist-pyc
add-plist-pyc:
        @${TOUCH} ${TMPPLIST}.pyc_tmp
        @for i in `find ${PYDISTUTILS_BUILDDIR} -type f -name '*.py'`; do \
                PYC=`${ECHO_CMD} $$i | ${SED} "s|.py$$|.pyc|"`; \
                NEWC=`${ECHO_CMD} $${PYC} | ${SED} "s|${PYDISTUTILS_BUILDDIR}||"`; \
                NEWO=`${ECHO_CMD} $${NEWC} | ${SED} "s|.pyc$$|.pyo|"`; \
                ${ECHO_CMD} "${_RELDIR}$${NEWC}" >> ${TMPPLIST}.pyc_tmp; \
                ${ECHO_CMD} "${_RELDIR}$${NEWO}" >> ${TMPPLIST}.pyc_tmp; \
        done; \
        ${CAT} ${TMPPLIST} >> ${TMPPLIST}.pyc_tmp; \
        ${CAT} ${TMPPLIST}.pyc_tmp > ${TMPPLIST};

.endif

The last task to do is cleaning up all pkg-plist files in the ports tree from the .pyc and .pyo entries.

Note: http://people.freebsd.org/~mva/pyc_compile.bsd.python.mk.patch will always contain the most recent version of the patch.

Python/CompiledPackages (last edited 2023-02-15T21:56:10+0000 by CharlieLi)