Git drawbacks

(editor's note: this is an edited version of my private conversations with PeterWemm. The opinions are his, all editing mistakes are mine.)

Workflow issues -- 1

The key problem with git and other dvcs models is that their optimum workflow is directly inverse to the way we've liked to do things. If you aren't willing to make the workflow changes, those tools will fight you every step of the way. The real question is whether we're willing to make the workflow changes and what the implications are.

In particular, git is optimized for patch pulling, not so for pushing to a shared repo.

The costs of the workflow change are offset by benefits of making the change. But it has huge downstream impact to get the full benefit.

e.g.: suppose a pink unicorn came along with a magic wand that solved all the consensus problems and all we had to do was wave the wand and everybody would be in agreement. And then:

Suppose we converted svn to git as a bare repository for FreeBSD.org src tree.

We immediately lose the concept of version numbers. A downstream consumer has to use the git tools to query if a particular change is present in their stream. That has implications for security alerts as you have to be a git consumer. (Unless they're on a release CDROM image cycle of course.)

But even then there's technical issues with a naked repo and doing commits.

Commits are "tree as a unit" changes. There's no subtree commits. You upload your deltas, insert them into the naked repo, and if you don't lose a race, you effectively change the link for 'head' to the new hash head. If you lose a commit race, you're done.

Or the tree is locked while somebody is uploading from a 28.8k modem in the back of Nigeria.

It's not so much an issue for Linus because there are so few committers to the root tree. It's a pull model.

Workflow issues -- 2

Suppose you clone a repo with branches and that the tree looks vaguely like a src style tree, for sake of an example.

Suppose somebody commits to their private version of RELENG_9 and then pushes. Git connects to the shared repo, uploads code blobs and updates all the tags. All the tags.

Suppose somebody else did something else in parallel and worked on a change set to head. They've got a string of local commits that they want to publish. They do a final sync, do a rebase to make their changes relative to top-of-tree as fast as they can (before somebody else commits to head) and do a git push.

Git then fails because the second person's version of RELENG_9 is stale. So the second person has to stop trying to commit to head, and instead checkout 'releng_9' and do a merge of origin/releng_9 so their local copy of releng_9 is up to date and commit to their local repo.

Then, they check out head again and discover somebody else has committed to head while they were messing with releng_9. So, they fetch new head commits, merge. They rebase again to refactor their commit chain against head again, and attempt to push.

If they don't lose a race this time, it'll probably work because releng_9 and head were both up to date and were able to be 'fast forwarded'.

Basically, committing to a shared repo with git is really really hostile and not a design feature. It assumes there is somebody like Linus to receive patches in email and merge them into their repo locally and publish the results, rather than having 300 people racing each other to try and get a commit in before their rebased work is invalidated or branches go stale.

Granted, this is a worst case scenario. But git is not designed for hundreds of people pushing to a shared repo with branches. You could probably lessen the pain by breaking the repo up into fragments, but that has its own problems. How the hell does an end user manage building from source with custom config or make.conf settings? When something has 200+ ports to build as a dependency, you have to have a whole tree checked out. I'm sure there's ways around it, eg: portsnap could synthesize a unified tree from multiple repos, but that requires use of portsnap then. (I use it, but I don't want to force others to.)

git really is designed with a single committer who publishes their work in mind. I'm sure there are patch commit queue tools to lessen the pain, e.g.: they serialize writes, attempt to auto-rebase etc. etc. If those tools are good then I'm sure the experience will be a heck of a lot better than letting hundreds of committers loose trying to write to a shared tree. I don't know what those tools are or where they might be though.

Scaling issues

The next question is scaling. You can checkout smaller chunks now I believe, but there's still the 'top of tree' and 'tree as a unit' concept.

It's also switching back from the p4/svn style namespace branches back to cvs-style streamed stuff.

For what its worth, the svn thing did what I set out to do ... and that's stop losing critical metadata like when files went off and back on -stable branches and so on. You can check out a coherent tree as of a particular rev on any stable branch and it works ... modulo imported brain damage from cvs.

Rant about mergeinfo

We're doing merges wrong in FreeBSD.org. It's not supposed to be done that way. That's why its not working.

mergeinfo is not intended to be used for release engineering or stable branches. Its purpose is for short lived feature branches where you do a few merges and they collapse out of existence when you merge into the "stable" (i.e.: head) branch.

The correct way for MFCs would be svn merge -c change, but there's a flag to tell it to ignore mergeinfo and not generate any.

mergeinfo is designed for keeping track that you missed merging one particular change out of all of them, and so you can pick up the missing change in the next merge. mergeinfo is designed with the workflow of branches being completely merged, not occasional merges. We record that metadata in the commit log. "MFC change #23546: do stuff". That's what people look at.

We don't have a tool which could tell us if a revision in head is merged to a branch based on that metadata. mergeinfo doesn't tell you that because ... lots of people do diff | patch to avoid the mergeinfo mess. Except people were told right from the beginning to not add mergeinfo to the stable/* branches. It gradually came into more frequent usage, but the last word was "don't do it."

There was almost a pre-commit rule to reject commits that added it to branches.

mergeinfo simply wasn't designed with the '-stable' branch model in mind. It was more designed with 'branch head, work on something, do a couple of merges to branch ... test, commit. branch is over."

FreeBSD.org's stable/release model is kind of an aberration compared to what most people do.

There is no perfect match for our workflow. The problem is FreeBSD.org having the same people doing release engineering. (editor's note: he later expanded on these ideas more in FreeBSD-ng-detail.)

In most of the rest of the universe, you have developers who work in their sandboxes ... doing feature development and "head" is supposed to be the stable -dev tree at any time. Then, an entirely separate entity grabs stuff when it's time for release a nd bashes it into release state ... assigns version numbers, packages, etc. etc.

The fact that we *have* -stable trees in our dev repo is kind of an aberration compared to most of the rest of the universe.

Small companies package snapshots of their -dev tree (which is supposed to be "stable" at any point), package it as 1.0 (or 0. 1-beta) and ship it. Big companies/entities have entirely separate universes partitioning their developers from release engineering. The fact that we have head freezes at all in FreeBSD-universe is an aberration compared to what most people do.

When people are being paid a salary to develop stuff, their bosses want them to keep doing what they're supposed to be doing. Not taking 2-3 months to do release engineering support.

And that's why SCM tools never quite fit us.

You have stuff from the Linux universe where the developers never do release engineering ... that's somebody's job like Redhat or Ubuntu or whatever. And git is well suited to that universe. But Redhat use rpm for release management, not git.

Meanwhile in FreeBSD-universe we expose the SCM to the end users.

While we don't put svn on the default install, it's everywhere ... security patches, the works.

That's why the git folks say revision numbers are a distraction and since they're inconvenient to implement in their model, they're not doing it. Your Ubuntu or Redhat consumers aren't looking at SCM id's to see if they need a patch, they're looking at deb or rpm or whatever.

Final thoughts

git can implement our shared cvs/svn repo push-commit model for a few users. That is the key ... proponents can easily manufacture a "demo" that it works for a few users, or one person using multiple commits/pushes from multiple machines/accounts. But those demos only show how it works when effectively serialized. It does not scale to Many(tm) people committing, especially with branches being involved.

I'm sorry if that sounds grim, but it is what it is. git is a dvcs to coordinate flow between multiple single-user, single-writer repos. It is not a shared-commit repo manager. It can sort-of do it, but you have to fight the tool every step of the way.

GitDrawbacks (last edited 2012-01-04 18:32:34 by MarkLinimon)