The FreeBSD Ports distfile survey


This page is obsolete as of 201008.


The distfile survey results

Current cron run status

Interpreting the various types of results: DistFileSurveyResults


The distfile survey is broken up into several parts. First, the ports tree is updated, and the 'makefetchlist' script traverses the updated tree to extract information from each port, such as MAINTAINER, any pkg-descr/WWW URL, and of course the list of URLs that would be fetched when downloading the ports. Second, the 'portftpcheck' script checks each URL (with limits on how often a given URL is checked). Third, the ""portftpout"" script combines fetchlist and results to create the HTML result files.


After using 'cvsup' to update a private copy of the ports tree (located in /a/fenner/ports/ on pointyhat), and 'cvs update' to update the custom portsurvey .mk files (located in /home/fenner/Mk on the cluster), 'makefetchlist' does a recursive make of bill-fetch and lets make do the heavy lifting. The main work here is in the diffs to, in particular, the target bill-fetch.

Since 'makefetchlist' traverses the whole ports tree, it can sometimes catch odd breakage. If there is any output to stderr other than make progress messages, they end up in


'portftpcheck' reads the whole fetchlist into memory, as well as the whole results history (although it only remembers the latest result). If the latest result for a given URL was OK, or was a hard error (note: hard error checking here needs to be improved), then it waits 14 days before rechecking that URL; otherwise it waits one day (note: comment says 2, code says 1).

It parses the URL to determine whether it's http, https or ftp (the schemes it can handle); file: URLs it assumes are OK and anything else it flags as an unknown URL scheme. If it's http or https, it uses the http checker; if it's ftp, it uses the ftp checker.

ftp checker

The FTP checker uses the Net::FTP perl module to communicate with FTP sites. It interprets URLs strictly according to RFC 1738, i.e., it performs the following steps:

  1. Log in to system, using username and password if present.
  2. Split the remainder of the URL on '/'
  3. If there are any directory components, issue a 'CWD' command for each.

  4. Issue a 'LIST *' command, where * represents the file component of the URL.

  5. If the file is not found, and the original ends in .gz, .Z or .tar, we try again without the suffix to catch auto-compressing or auto-tar'ing servers. (This heuristic may be out of date.)
  6. If the file is not found, also try adding .Z or .gz, to catch auto-decompressiong servers. (This heuristic might be out of date.)

It reuses the control connection to check all URLs for a given FTP site; it does this by:

  1. Saving the result of a 'PWD' command immediately after logging in.

  2. Issuing a 'CWD' command with the above saved result instead of step 1. above if already logged in.

This has been relatively reliable, although I haven't checked the FTP protocol spec to see if it's required to work.

Current known bugs:

  1. IO::Socket's asynchronous connect algorithm doesn't work on FreeBSD. Earlier versions caused 'connection refused' to be reported as 'Timeout'; after upgrading IO::Socket, they are now reported as 'connect: Invalid argument'.


Uses the LWP perl library to perform a HEAD request on the given URL. It follows redirects, verbosely, except for a 302 redirect when fetching a distfile (since the default fetch args includes "-A"). It will even follow relative redirects, since these are so common, even though they are not legal according to the HTTP/1.1 spec.

Current known bugs:

  1. 'HEAD' is known to give false negatives with non-standards-compliant web servers.

  2. A relative redirect from an "https" URL will 'redirect' to the "http" version.
  3. IO::Socket's asynchronous connect algorithm doesn't work on FreeBSD. Earlier versions caused 'connection refused' to be reported as 'Timeout'; after upgrading IO::Socket, they are now reported as 'connect: Invalid argument'.


Such a mess. So unhappy.

Current data files


Contains 3 whitespace-seperated columns per line; the first column says what type of data the line represents, the second column describes the port, and the third column is the data type described by the first column.

Data Types




One of the URLs that would be fetched


A patchfile URL


The URL from the pkg-descr file


The package name


The port is BROKEN


The port is FORBIDDEN


The port is DEPRECATED


The port has an EXPIRATION_DATE


The port has a special fetch method




Contains 4 whitespace-seperated columns per line, although the 4th may contain whitespace.

  1. portftpchecks idea of success/failure/dunno, as ">", "<", "?", respectively.

  2. Timestamp in seconds since 1970.
  3. URL
  4. Free-text description of result.

Column 1 can also contain "R", indicating a repeat count for the following result. There are 5 fields in an "R" line:

  1. "R"
  2. The first time that this set of repeats occurred.
  3. URL
  4. The number of times that this set of repeats occurred.
  5. The last time that this set of repeats occurred. This should be the timestamp on the very next line which is the result that was repeated.

One repeat (#4 == 1) means that the result occurred twice.


DistfileSurvey (last edited 2015-10-24T07:26:44+0000 by MarkLinimon)