This section tells you how to set up to build a project that is hosted on GitHub,
using Visual Studio 2017. If you don't already have Visual Studio 2017, you
can get the community edition for free from this link.
This section also goes through important routine tasks like getting the latest
changes from GitHub and submitting a pull request to the master branch.
It will show you how to do this using just Visual Studio 2017. Older version
of Visual Studio as well as other IDEs are possible, but not covered here.
You can also use 'raw' git commands but I don't cover that here.
We use the PerfView project https://github.com/Microsoft/perfview as an example but the vast majority of this section applies to any open source GitHub hosted project.
The step-by-step instructions will not make a whole lot of sense however without a certain amount of background knowledge about a basic understanding of what GIT and GitHub do and some of the basic concepts associate with the GIT source code control system. This is where we start.
The set of all files versioned as a unit by the GIT source code control system is called a repository, or repo for short. Logically a GIT repository contains a collection of snapshots called commits which represents a set of files at one point in time. The data in any such snapshot is hashed (including all its meta data like when the commit was made), and this hash is the ID for the commit. In theory the ID is very long (64+ bytes), but in practice all GIT commands will take a unambiguous prefix. Typically 8 Hex digits is more than sufficient so you will see instructions talking about commit 534aab38 or commit 3abcd333.
One of the very nice properties of naming commits using a hash of the data is that combining two repos into a single one is trivial. If we assume that the hashing GIT does is 'perfect' (and it really is close to perfect) if two commits have the same ID (hash) then they have to have been the same data, which means they really are the same commit. Thus combining two repos is simply a matter of making a union of all the commits in the two component repos.
This ability to easily combine two repositories make GIT well suited for 'distributed' source code control, where there is no 'master' repository that all 'clients' work from. Instead each repository is can be thought of as the 'master', and a repository like GitHub is only special in the sense that it is a well known publishing point.
Indeed when working with GitHub you NEVER have just one repository. At the minimum there are two repositories you need to know about. First is the repository hosted on GitHub, and the other is a local repository on the machine where you make your changes. This local repository is a COMPLETE CLONE of the GitHub one (which includes all commits and thus all history). Thus after cloning, you create many local changes (commits) without every communicating with GitHub. Only when you want to publish (calling pushing) or get updates (called pulling) do you need connectivity to GitHub.
A repository is a collection of MANY snapshots (commits), and most commits have a predecessor (the snapshot before the change). Some commits have multiple natural predecessors (as when changes from multiple sources are merged). Thus there is the concept of 'history' of a commit where you can see the set of commits, that over time lead to a particular place. Thus a commit can be thought of not just as one snapshot, but in fact the complete history that is 'reachable' by following predecessor links. GIT has a concept of a pointer to a commit (history) called a branch. Most GIT command have the concept the 'current branch', and when a repo is created, typically it creates a branch called 'master' which typically is used to represent the 'master' or maximally shared history of code (you may have other branches that represent releases or other specialized versions of the code base)
While it is possible to commit your changes directly to the 'master' branch, branches in GIT are very cheap and GIT encourages you to make a new branch (derived from 'master') whenever independent work is being done. These independent branches can then be merged into 'master' (or as we will see used as pull requests) independently of each other.
This section tells you about two possible ways to use a GitHub repository, and why you should choose one or the other.
If one of the following two conditions hold
- You have Read/Write permission to the GitHub repository of interest.
- You only have read access, but you never want to 'push' changes back to the repository.
You can use a simple GIT setup with only two repositories (the one on GitHub and the local
repository on your development machine). In this setup you can pull new commits
from the GitHub repository to your local one to keep your local copy up to date.
You can even make local changes for your own personal experimentation in your local repository.
However you can only 'push' those changes back to GitHub if you have write permission to
the repo. This is fine for personal projects, but not good enough for open source projects
because there is no way people without write access to contribute updates.
In an open source project, we want the ability for ANYONE to PROPOSE a change
to the main repository, but we need an APPROVAL process so that only those with
special permissions (the maintainers) can actually update the main repository.
GitHub accomplishes this with a process know as the pull request. A pull request
is GitHub procedure where any user can indicate that he would like a particular change
(commit) integrated to the GitHub repository. People who do have write permission
to the repository can look at the change and determine if it meets
the standards of the repository and if so merge (pull) the change into it. In this way
arbitrary users can contribute fixes to a repository in a controlled way.
There is a logistic problem with pull requests on where to put the tenative (proposed) changes. We need a commit (set of file changes) that anyone can create, but can easily be integrated into the main GIT repository. This commit can't be created the main repository (because it was created by a user without write permissions to the main repository). It also can't be in a user's local repository because that is on the user's machine and not available to GitHub maintainers. Thus we need a repository that is in a public place (that is on GitHub), but is writeable by a particular user. GitHub solves this problem by making a 3rd repository which is typically called fork.
Thus the typical flow for an open source project is
-
Create a Fork (copy of the repo) of the open source repository (e.g. https://github.com/Microsoft/perfview) that is writable by you. There is a button on the project's GitHub page (upper right corner called 'fork') that does this. Every GitHub user has a 'area' assigned to their GitHub user name that can host these writable forks. There may be other locations (shared by a group) that you may put the fork, so when clicking the 'Fork' button may prompt you with choices of where to put it. Typically you want to use the area associated with just your user. For example my GitHub user name is vancem, so the fork created when I click on fork button of https://github.com/Microsoft/perfview creates the fork called https://github.com/vancem/perfview (notice it in the 'vancem' area of GitHub which is writable by me).
If the fork already exists, it simply takes me there (so you can find your existing forks easily). -
Clone your GitHub fork locally on your machine. This works just like if you had cloned the https://github.com/Microsoft/perfview directly, but with the significant difference that you can write to this repository, so you can submit changes.
-
Prepare a change by making a new branch in your GitHub fork with the changes you want. This actually involves several sub-steps A. Create a new branch representing your proposed change in your local repository which is a clone of your GitHub Fork. B. Create one or more commits in this branch that embody the change you want.
C. Push your changes to your GitHub Fork. -
Once you have a branch you are happy with in your own fork, there is a button on YOUR FORKs GitHub page that allow you to submit that branch as a pull request. When you do this you write a description / rationale for The change where you are persuading the maintainers to accept your request. There is an area for discussion, and typically the maintainers ask for changes. You update your branch as needed, and the pull request automatically gets updated to the latest revision of the branch. Hopefully after enough discussion and updates, the maintainers accept the pull request and merge your branch from your GitHub fork into the main repository.
-
The next time you update your fork from the man repository, your master branch will have your pull request reflected in the branch.
So there are two different ways you can set up your clone of the repository
- You can simply clone the repository locally work from it.
- You can create a fork, and then clone fork locally and work with that.
The basic answer is that (2) can do pull requests (and generally forces all updates to be pull requests), however it is also more cumbersome (syncing to the latest code is harder and updates have the overhead of the pull request procedure). So the simple answer is
- Use (1) when you have read-write access and want a low overhead checkin process (private projects typically fall into this bucket).
- Use (1) if you have read-only access and don't need to submit fixes/changes (thus you don't need pull requests)
A typical scenario is that start with the direct (unforked) option up until the point where you want to start modifying the code base at which point you switch to forked option.
Once you have decided on an option, see one of the following.