The FreeBSD Ports distfile survey

Note

This page is obsolete as of 201008.

URLs

The distfile survey results

Current cron run status

Interpreting the various types of results: DistFileSurveyResults

Design

The distfile survey is broken up into several parts. First, the ports tree is updated, and the 'makefetchlist' script traverses the updated tree to extract information from each port, such as MAINTAINER, any pkg-descr/WWW URL, and of course the list of URLs that would be fetched when downloading the ports. Second, the 'portftpcheck' script checks each URL (with limits on how often a given URL is checked). Third, the ""portftpout"" script combines fetchlist and results to create the HTML result files.

makefetchlist

After using 'cvsup' to update a private copy of the ports tree (located in /a/fenner/ports/ on pointyhat), and 'cvs update' to update the custom portsurvey .mk files (located in /home/fenner/Mk on the cluster), 'makefetchlist' does a recursive make of bill-fetch and lets make do the heavy lifting. The main work here is in the diffs to bsd.port.mk, in particular, the target bill-fetch.

Since 'makefetchlist' traverses the whole ports tree, it can sometimes catch odd breakage. If there is any output to stderr other than make progress messages, they end up in http://people.freebsd.org/~fenner/portsurvey/makefetchlisterrs.txt.

portftpcheck

'portftpcheck' reads the whole fetchlist into memory, as well as the whole results history (although it only remembers the latest result). If the latest result for a given URL was OK, or was a hard error (note: hard error checking here needs to be improved), then it waits 14 days before rechecking that URL; otherwise it waits one day (note: comment says 2, code says 1).

It parses the URL to determine whether it's http, https or ftp (the schemes it can handle); file: URLs it assumes are OK and anything else it flags as an unknown URL scheme. If it's http or https, it uses the http checker; if it's ftp, it uses the ftp checker.

ftp checker

The FTP checker uses the Net::FTP perl module to communicate with FTP sites. It interprets URLs strictly according to RFC 1738, i.e., it performs the following steps:

  1. Log in to system, using username and password if present.
  2. Split the remainder of the URL on '/'
  3. If there are any directory components, issue a 'CWD' command for each.

  4. Issue a 'LIST *' command, where * represents the file component of the URL.

  5. If the file is not found, and the original ends in .gz, .Z or .tar, we try again without the suffix to catch auto-compressing or auto-tar'ing servers. (This heuristic may be out of date.)
  6. If the file is not found, also try adding .Z or .gz, to catch auto-decompressiong servers. (This heuristic might be out of date.)

It reuses the control connection to check all URLs for a given FTP site; it does this by:

  1. Saving the result of a 'PWD' command immediately after logging in.

  2. Issuing a 'CWD' command with the above saved result instead of step 1. above if already logged in.

This has been relatively reliable, although I haven't checked the FTP protocol spec to see if it's required to work.

Current known bugs:

  1. IO::Socket's asynchronous connect algorithm doesn't work on FreeBSD. Earlier versions caused 'connection refused' to be reported as 'Timeout'; after upgrading IO::Socket, they are now reported as 'connect: Invalid argument'.

http

Uses the LWP perl library to perform a HEAD request on the given URL. It follows redirects, verbosely, except for a 302 redirect when fetching a distfile (since the default fetch args includes "-A"). It will even follow relative redirects, since these are so common, even though they are not legal according to the HTTP/1.1 spec.

Current known bugs:

  1. 'HEAD' is known to give false negatives with non-standards-compliant web servers.

  2. A relative redirect from an "https" URL will 'redirect' to the "http" version.
  3. IO::Socket's asynchronous connect algorithm doesn't work on FreeBSD. Earlier versions caused 'connection refused' to be reported as 'Timeout'; after upgrading IO::Socket, they are now reported as 'connect: Invalid argument'.

portftpout

Such a mess. So unhappy.

Current data files

fetchlist

Contains 3 whitespace-seperated columns per line; the first column says what type of data the line represents, the second column describes the port, and the third column is the data type described by the first column.

Data Types

M

MAINTAINER

U

One of the URLs that would be fetched

UP

A patchfile URL

UD

The URL from the pkg-descr file

P

The package name

B

The port is BROKEN

F

The port is FORBIDDEN

D

The port is DEPRECATED

E

The port has an EXPIRATION_DATE

N

The port has a special fetch method

S

The DIST_SUBDIR

results

Contains 4 whitespace-seperated columns per line, although the 4th may contain whitespace.

  1. portftpchecks idea of success/failure/dunno, as ">", "<", "?", respectively.

  2. Timestamp in seconds since 1970.
  3. URL
  4. Free-text description of result.

Column 1 can also contain "R", indicating a repeat count for the following result. There are 5 fields in an "R" line:

  1. "R"
  2. The first time that this set of repeats occurred.
  3. URL
  4. The number of times that this set of repeats occurred.
  5. The last time that this set of repeats occurred. This should be the timestamp on the very next line which is the result that was repeated.

One repeat (#4 == 1) means that the result occurred twice.

New Results file format

DistfileSurvey (last edited 2010-12-23 14:04:35 by MarkLinimon)