Article summary
Learning Git can be overwhelming, especially if you have not had any previous experience with a version control system. Many Git tutorials begin with a few basic commands, and you can probably get by on those for most day-to-day tasks. But eventually, you’ll run into a situation that the tutorial didn’t cover (like, “Oops, I just committed to the wrong branch”).
Fortunately, a quick Google search will reveal a Stack Overflow post with a cryptic command to get the job done. And given the number of upvotes, it will probably work. But unless you understand why it worked, you’ll likely be right back there again the next time you run into a similar problem.
Beneath the often-confusing command-line syntax, Git is really not as complicated as it seems. And many issues are easier to resolve when you first understand the underlying structure.
Commits
Understand that conceptually Git primarily tracks changes. The basic unit of a Git repository is the commit. Deep down (and not even that deep), Git is just a collection of commit objects. A commit includes information about a change, such as the author, the ID of its parent commit, a log message, and a diff, which describes the actual differences between changed files (the internal storage details are only slightly more complicated).
Commits are immutable by design. This is because a commit’s ID is simply the sha1sum of its contents. So you can see that if a property of a commit needs to be modified, the sha1sum (i.e. the commit’s ID) will also change.
Perhaps counterintuitively, this immutability makes it easy and safe to rewrite history. With Git, you can modify a past log message, squash multiple commits into one, or even change the order of commits. Instead of modifying any commits in place, all of these operations create new commits and then graft them into the tree.
The old commits still exist, but they are essentially orphaned because no other commits, branches, or tags point to them (Git automatically clears out orphans after a while–90 days by default). Since a commit only points at its parent (or parents, which can happen when branches are merged), replacing a commit somewhere in the middle of a tree also requires replacing any that come after it.
Branches and Tags
The system of commits pointing at their parents is what forms the underlying tree structure of a Git repository. A branch is little more than a named pointer to a specific commit. Since commits maintain their own intrinsic order by pointing to their parents, you can derive the history by simply following the pointers from one commit to the previous one. A branch points to its most recent commit, and this pointer is automatically updated each time a commit is added to the branch.
A tag is similar to a branch, except that it is basically a bookmark that is not automatically updated. Tags are typically used to pin the contents of the repository at a particular point in time (e.g. a software release).
Merging and Rebasing
These are often regarded as “advanced” topics, but conceptually, they are still fairly straight-forward (though admittedly, they can potentially get complicated in practice).
It is quite common to begin a new branch in order to develop a new feature. But at some point, that branch needs to be merged back from whence it came. Provided there are no conflicts, a merge consists of creating a new commit that points back to all of its parents (typically two, but it’s possible to merge more than two simultaneously!).
Having a repository history with a lot of merge commits can make it difficult to locate the commit actually responsible for a particular change. So instead of merging, sometimes, it’s more desirable to rebase. Whereas merging leaves existing commits alone and creates a new merge commit, rebasing always creates new commits that are nearly identical to the originals. Think of this as plucking the base of a tree of commits from one place and reconnecting it somewhere else. Basically, you’re changing the parent pointer of the commit at the base of this tree. But remember that whenever any property of a commit needs to change, that commit and all those that follow it must be replaced.
Remotes
Much of the power of Git comes from its ability to sync multiple copies of a repository. By default, when you clone from a remote Git repository, Git will designate the remote upstream repository as “origin” (this is just a convention, and you can actually add additional remotes in order to sync with multiple repositories).
It also sets up some remote-tracking branches in your local repository. These are basically read-only pointers that match the branches in the remote repository. You don’t manipulate these branches directly. Instead, you push commits from your own repository. When you push, Git sends any commits that are found in your local repository, but not in the remote repository. Then the remote updates its branch to the tip of those commits.
A similar process occurs when you pull. Your local repository copies down any commits that exist in the remote repository and then updates its remote-tracking branches. Strictly speaking, a pull is actually composed of two separate operations: a fetch followed by a merge. Depending on your workflow, you may prefer to fetch (which only copies commits from the remote) and then rebase instead of merge. This has the benefit of keeping your history a little cleaner as one continuous stream of changes, rather than branches and merges all over the place.
Practical Applications
I’ve deliberately avoided giving any example Git commands here because, honestly, the Git command line utilities are kind of a mess (they have improved over time, but still contain inconsistent syntax, strange use of terminology, etc.). But my hope is that if you understand the concepts behind Git, it will be relatively easy to find the command to do what you want (easier than trying to infer concepts from a command that doesn’t make sense).
Quote: “A commit includes information about a change, such as the author, the ID of its parent commit, a log message, and a diff (describing the actual differences between files).”
Unless I’ve misunderstood, the commit actually points to a tree of blobs (representing file contents), not a diff. Most of the blobs (and subtrees) in the tree will probably be the same as the ones pointed to by the parent commit, with new blobs (not diffs) created for changed files.
This is notably different from many other VCSes (eg svn), and is one of the pieces of information that helps with understanding what’s happening under the hood.
Hi Kerry, thanks for the comment! It’s true that the actual object that git stores doesn’t itself include a diff. I wanted to try to strike a balance between low-level implementation and high-level concepts, so hopefully it’s not too confusing.