This is still a work in progress. More details to be added for the builder section and REST API.
Contents
Port CI
- Used for automated exp-runs and package build queue management.
Overview
The build application will be the central place for queuing, monitoring, and viewing exp-run results. It will have an HTML and JSON REST interface. It will also display production package build status. This application will be the "master" for exp-runs.
New builds will be queued into the master. They may or may not have patches. Patches can be for the ports tree or src tree. Builds can be configured to do a comparison once completed against a known good build to find new failures.
Client builders will check in to the master and ask for work. They will occasionally report their status as well.
Lost/crashed builds will be retried a few times before being marked failed.
Result analysis
- Results will list the following:
- New failures
- New skipped
- New ignored
- New built
Benefits
- Simplify management/sharing of servers in cluster
- Remove workload for portmgr needing to manually run,track,analyse exp-runs
- Increase transparency for entire project by letting all see what the "exp-run queue" is
- Encourage committers to upload patches for testing
- Facilitate daily QAT plist builds to increase enforcement of orphans/leftovers, standards
- Facilitate daily QAT non-plist builds to find ports that no longer compile/package
- Allow all portmgr to be involved with exp-runs in a more secure manner than giving root
- Automatically retry failed package builds. Currently they require manual intervention.
Problems with exp-run/build automation not solved here
Poudriere stability. See Stability section at https://fossil.etoilebsd.net/poudriere/doc/trunk/doc/todo.wiki
- Patches against src/ports tree that the nightly reference is not using require running a specific reference for that patch. It could try to find the "closest" reference, but would deliver many false-positives. We don't want people looking at the list of failures and misjudging the results. We want confidence in the results so we do not introduce new failures into head.
- Similar to 2, need to automatically run a reference build if needed. Leaving out of initial design as it complicates things too much. In general the nightly reference build has worked well. Asking submitter to rebase patch against a known reference revision seems reasonable for version 1.
Build types
Exp-run
- Patch against the src or ports tree.
- All ports will be built.
- All existing packages will be deleted.
- Completion will compare against a Reference build.
svn patch; bulk -ca
Reference
- No Patch.
- All ports will be built.
- All existing packages will be deleted.
- Completion will compare against previous Reference build.
bulk -ca
QAT
- No patch.
- All ports will be built.
- All existing packages will be deleted.
- Plist testing will be done.
- Completion will compare against previous QAT build.
bulk -cat
Port
- Patch against the src or ports tree.
- Only the 1 port will be built.
- Existing packages for that 1 port will be deleted.
- Plist testing will be done.
svn patch; testport -n -o
Package
- No patch.
- All ports will be built.
- Existing packages kept.
- NO_PACKAGE packages not built, RESTRICTED packages cleaned up after bulk.
- Completion will compare against previous build for new failures/skipped (not built as it will be different due to incremental).
bulk -a
Authentication
- All connections will require SSL.
- The master's REST port SSL cert will be self-signed with our own CA that the builders will use exclusively to validate the cert.
- All requests and responses between builder/master will use a shared secret to ensure the commands are coming from known trusted clients.
Queue system
- The system will be fully readable without authentication.
- Via LDAP/Kerberos, all committers will have access to upload a patch.
- Portmgr users can approve patches, cancel builds, retry builds, and modify configuration.
Builder/Master
- Builders will be authenticated with known tokens that can be revoked.
- Builders will be configured (on the builder machine) to specify which job types and archs it can handle. This will allow having the package build machines claim package builds jobs, but never any kind that has a patch. This ensures the package builder is still secure in this system as it only takes an order to start a build, never executes any arbitrary data.
Builder registration
- Portmgr will login to master and generate a token. This token will only display once and never again. This is the shared secret for authenticating requests and must remain secure.
- Portmgr will setup new builder and specify the token and the master's SSL CA public key in its configuration.
Builder starts up and generates an id from uuidgen.
- Builder connects to Master to register its id and hostname with the token.
- Possibly uses DH1080 here to insert another shared secret that is never exchanged, for each to use in future communications.
- Master registers the builder and associates its id and hostname with the token.
- If token is already registered, deny registration.
Builder requests
- All builder requests to master will include its authentication key. An invalid key will be rejected. This is only so a rogue/unapproved builder cannot steal work.
id: uuid request_key: sha256("request", builder hostname, uuid, token)
Master responses
- Master responses to builder will include a validation key so that the builder can authenticate the master is valid to prevent it receiving a rogue job. Note that this key is different than the request key so that it cannot just be replayed from the request. It does contain the same shared secret token though. Any responses not containing this valid key will be ignored.
response_key: sha256("response", builder hostname, uuid, token)
Notifications
- Status updates for unapproved/start/stop/crashed/cancelled/finished/analysed jobs will go to IRC.
IRC
- The IRC bot will be read-only. No queueing or approving of jobs as there is no way to authenticate securely, even with SSL the IRCD is a weak point.
Gnats
- Updates to builds that have an associated PR will send an update to gnats when the build crashes, finishes, and once analyzed. There may be a long delay between finished and analyzed which is why there is an extra notice.
- Final results for builds will be emailed to appropriate parties according to the build type.
Queue/Build process
- Patch is uploaded by committer and configured for a specified build.
Only exp-run and port builds are available to non-portmgr.
- Portmgr can queue any build type
Build is marked unapproved until a portmgr approves it.
- Builders check in frequently for work.
- Once builder takes a build, a new job is created and assigned to the builder.
The build will be marked running. It will provide a Build URL back to the master for it to update the job object.
- A patch id will be given in the response, along with checksum.
- Builder will download associated patch from master if provided and compare checksum.
- Build starts.
- The builder will check-in every 5 minutes and on boot.
Missing 4 checkins will consider the job as lost and cause it to have the build's fail_cnt incremented and its status moved to retry to retry.
A crashed or stale build seen on startup will notify master of the failure and have the build's fail_cnt incremented and its status moved to approved to retry.
If a build crashes 3 times it is marked as failed and not retried again. It may be causing panics and should not continue to bring down builders.
A build can be cancelled at any time by a user. When the builder checks in, if the build the job is is running for is cancelled it will receive that notice and cancel its work. When completed it will report back to the master and it will mark the job and build as cancelled.
- When a job is completed the builder will notify the master and provide a URL to a tarball of its log files, along with checksum.
- If the reported job is successful and has been re-queued/reassigned then the new job should be aborted.
- If the reported job is not finished, but is already considered lost, then the job will be aborted.
- Master will download the log files, compare checksum, and then extract locally for later display and analysis.
Exp-run/Port process
When the build is completed, it will be marked as pending-analysis
When the master has an adequate reference build available it will compare against pending-analysis builds, update their results, and then mark them as analysed.
- Results will be mailed to any associated PR, portmgr, and the person who queued it.
QAT process
When the build is completed, it will be marked as pending-analysis
When the master has an adequate reference build available it will compare against pending-analysis builds, update their results, and then mark them as analysed.
- New failures will be mailed to ports@, portmgr and potentially CC all committers on the hook for the commit range.
Master
Periodic checks
Check queue to find timed out jobs. Increment timeout_cnt for the job. Once the timeout count reaches 4, the job will be aborted, build's fail_cnt incremented and the build re-queued by changing its status to retry.
Check Exp-runs/Port jobs in pending-analysis state and try to compare against a Reference build. If a reference is not ready, try later. If a reference is done, update status to analyzed and send job notifications.
- Nightly queue up a Reference build
- Nightly queue up a QAT build
Database Tables
Token
token: String of token
builder_id: (optional) ID of assigned builder
uid_generated: UID of user who generated the token
Builder
id: ID of builder
uuid: UUID of builder
hostname: Hostname of builder
CPU: Number of CPUs
RAM: Amount of RAM
status: idle|running|lost
Build
id: ID of build
name: Name of job (security risk, requires sanitation)
type: exp-run|port|qat|reference|package
pr: (optional) Associated PR
patch_id: ID of patch
status: unapproved|approved|assigned|failed-patch|running|crashed|cancelling|cancelled|retry|pending-analysis|analyzed|finished
status_txt: Last status update text
cancelled: Boolean on whether or not build is aborted
uid_queued: UID of who queued the job
uid_approved: UID of who approved the job
notify_email: Comma-separated string of emails to CC on results
arch: i386|amd64
release: 9.1|10.0|head
branch: Which ports branch to use
fail_cnt: Number of times the build has failed
priority: Secondary number to sort the queue by, lower is ran first
Job
id: ID of job
build_id: ID of build
lost: Boolean to determine if this is currently active or lost. Only 1 active should exist at once.
status: starting|running|crashed|cancelling|cancelled|finished
status_txt: Last status update text
builder_id: Which builder is assigned this job
buildurl: Build URL
timeout_cnt: Number of times the job has timed out
cancelled: Boolean on whether or not job is aborted
Patch
id: ID of patch
type: src|ports
path: Path to patch on local system
checksum: Checksum for patch
REST API
User
Unauthenticated
List all builds
List all jobs
List all patches
List all builders
Retrieve patch
View build
View job
Authenticated: Committer
Create build
Authenticated: Portmgr,Clusteradm
Revoke builder
Halt new jobs
Resume new jobs
Authenticated: Portmgr
Generate builder token
Approve build
Abort build
Abort job
Builder
Authenticated
Register
Request work
Report status
Builder
- All status will be queued and persistent so that outages or connectivity issues to master will not lose responses.