Going Further with Git

Going Further with Git#

It’s got three keyboards and a hundred extra knobs, including twelve with ‘?’ on them.

— Terry Pratchett

Two of Git’s advanced features let us to do much more than just track our work. Branches let us work on multiple things simultaneously in a single repository; pull requests) let us submit our work for review, get feedback, and make updates. Used together, they allow us to go through the write-review-revise cycle familiar to anyone who has ever written a journal paper in hours rather than weeks.

Your zipf project directory should now include:

zipf/
├── .gitignore
├── bin
│   ├── book_summary.sh
│   ├── collate.py
│   ├── countwords.py
│   ├── plotcounts.py
│   ├── script_template.py
│   └── utilities.py
├── data
│   ├── README.md
│   ├── dracula.txt
│   ├── frankenstein.txt
│   └── ...
└── results
    ├── dracula.csv
    ├── dracula.png
    ├── jane_eyre.csv
    ├── jane_eyre.png
    └── moby_dick.csv

All of these files should also be tracked in your version history. We’ll use them and some additional analyses to explore Zipf’s Law using Git’s advanced features.

What’s a Branch?#

So far we have only used a sequential timeline with Git: each change builds on the one before, and only on the one before. However, there are times when we want to try things out without disrupting our main work. To do this, we can use branches to work on separate tasks in parallel. Each branch is a parallel timeline; changes made on the branch only affect that branch unless and until we explicitly combine them with work done in another branch.

We can see what branches exist in a repository using this command:

$ git branch

* master

When we initialize a repository, Git automatically creates a branch called master. It is often considered the “official” version of the repository. The asterisk * indicates that it is currently active, i.e., that all changes we make will take place in this branch by default. (The active branch is like the current working directory in the shell.)

Default Branches

In mid-2020, GitHub changed the name of the default branch (the first branch created when a repository is initialized) from “master” to “main.” Owners of repositories may also change the name of the default branch. This means that the name of the default branch may be different among repositories based on when and where it was created, as well as who manages it.

In the previous chapter, we foreshadowed some experimental changes that we could try and make to plotcounts.py.

Making sure our project directory is our working directory, we can inspect our current plotcounts.py:

$ cd ~/zipf
$ cat bin/plotcounts.py

"""Plot word counts."""

import argparse

import pandas as pd


def main(args):
    """Run the command line program."""
    df = pd.read_csv(args.infile, header=None,
                     names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False,
                                           method='max')
    ax = df.plot.scatter(x='word_frequency',
                         y='rank', loglog=True,
                         figsize=[12, 6],
                         grid=True,
                         xlim=args.xlim)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'),
                        nargs='?', default='-',
                        help='Word count csv file name')
    parser.add_argument('--outfile', type=str,
                        default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2,
                        metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

We used this version of plotcounts.py to display the word counts for Dracula on a log-log plot (Figure @ref(fig:git-cmdline-loglog-plot)). The relationship between word count and rank looked linear, but since the eye is easily fooled, we should fit a curve to the data. Doing this will require more than just a trivial change to the script, so to ensure that this version of plotcounts.py keeps working while we try to build a new one, we will do our work in a separate branch. Once we have successfully added curve fitting to plotcounts.py, we can decide if we want to merge our changes back into the master branch.

Creating a Branch#

To create a new branch called fit, we run:

$ git branch fit

We can check that the branch exists by running git branch again:

$ git branch

* master
  fit

Our branch is there, but the asterisk * shows that we are still in the master branch. (By analogy, creating a new directory doesn’t automatically move us into that directory.) As a further check, let’s see what our repository’s status is:

$ git status

On branch master
nothing to commit, working directory clean

To switch to our new branch we can use the checkout command that we first saw in previous chapter:

$ git checkout fit
$ git branch

  master
* fit

In this case, we’re using git checkout to check out a whole repository, i.e., switch it from one saved state to another.

We should choose the name to signal the purpose of the branch, just as we choose the names of files and variables to indicate what they are for. We haven’t made any changes since switching to the fit branch, so at this point master and fit are at the same point in the repository’s history. Commands like ls and git log therefore show that the files and history haven’t changed.

Where Are Branches Saved?

Git saves every version of every file in the .git directory that it creates in the project’s root directory. When we switch from one branch to another, it replaces the files we see with their counterparts from the branch we’re switching to. It also rearranges directories as needed so that those files are in the right places.

What Curve Should We Fit?#

Before we make any changes to our new branch, we need to figure out how to fit a line to the word count data. Zipf’s Law says:

The second most common word in a body of text appears half as often as the most common, the third most common appears a third as often, and so on.

In other words the frequency of a word \(f\) is proportional to its inverse rank (r): \( \frac{1}{r^\alpha} \)

with a value of \(\alpha\) close to one. The reason \(\alpha\) must be close to one for Zipf’s Law to hold becomes clear if we include it in a modified version of the earlier definition:

The most frequent word will occur approximately \(2^\alpha\) times as often as the second most frequent word, \(3^\alpha\) times as often as the third most frequent word, and so on.

This mathematical expression for Zipf’s Law is an example of a power law.

In general, when two variables \(x\) and \(y\) are related through a power law, so that \(y = ax^b\)

taking logarithms of both sides yields a linear relationship: \(log(y) = \log(a) + b\log(x)\)

Hence, plotting the variables on a log-log scale reveals this linear relationship. If Zipf’s Law holds, we should have

\( r = cf^{\frac{-1}{\alpha}} \)

where \(c\) is a constant of proportionality. The linear relationship between the log word frequency and log rank is then

\( \log(r) = \log(c) - \frac{1}{\alpha}\log(f) \)

This suggests that the points on our log-log plot should fall on a straight line with a slope of \(- \tfrac{1}{\alpha}\) and intercept \(\log(c)\). To fit a line to our word count data we therefore need to estimate the value of \(\alpha\); we’ll see later that \(c\) is completely defined.

In order to determine the best method for estimating \(\alpha\), we turn to @More2016 [TODO - FIX REFERENCE], which suggests using a method called {maximum likelihood estimation. The likelihood function is the probability of our observed data as a function of the parameters in the statistical model that we assume generated it. We estimate the parameters in the model by choosing them to maximize this likelihood; computationally, it is often easier to minimize the negative log likelihood function. @More2016 [TODO - FIX REFERENCE], define the likelihood using a parameter \(\beta\), which is related to the \(\alpha\) parameter in our definition of Zipf’s Law through \(\alpha = \tfrac{1}{\beta-1}\). Under their model, the value of \(c\) is the total number of unique words, or equivalently the largest value of the rank.

Expressed as a Python function, the negative log likelihood function is:

import numpy as np


def nlog_likelihood(beta, counts):
    """Log-likelihood function."""
    likelihood = - np.sum(np.log((1/counts)**(beta - 1)
                          - (1/(counts + 1))**(beta - 1)))
    return likelihood

Obtaining an estimate of \(\beta\) (and thus \(\alpha\) then becomes a numerical optimization problem, for which we can use the [scipy.optimize][scipy-optimize] library. [TODO Fix REFERNCE] Again following @More2016 [TODO - Fix reference], we use Brent’s Method with \(1 < \beta \leq 4\).

from scipy.optimize import minimize_scalar


def get_power_law_params(word_counts):
    """Get the power law parameters."""
    mle = minimize_scalar(nlog_likelihood,
                          bracket=(1 + 1e-10, 4),
                          args=word_counts,
                          method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha

We can then plot the fitted curve on the plot axes (ax) defined in the plotcounts.py script:

def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
    """
    Plot the power law curve that was fitted to the data.

    Parameters
    ----------
    curve_xmin : float
        Minimum x-bound for fitted curve
    curve_xmax : float
        Maximum x-bound for fitted curve
    max_rank : int
        Maximum word frequency rank.
    alpha : float
        Estimated alpha parameter for the power law.
    ax : matplotlib axes
        Scatter plot to which the power curve will be added.
    """
    xvals = np.arange(curve_xmin, curve_xmax)
    yvals = max_rank * (xvals**(-1 / alpha))
    ax.loglog(xvals, yvals, color='grey')

where the maximum word frequency rank corresponds to \(c\), and \(-1 / \alpha\) the exponent in the power law. We have followed the [numpydoc][numpydoc] [TODO - fix references] format for the detailed docstring in plot_fit—see Appendix @ref(documentation) for more information about docstring formats. [TODO - replace with link to google style guide from functions]

Verifying Zipf’s Law#

Now that we can fit a curve to our word count plots, we can update plotcounts.py so the entire script reads as follows:

"""Plot word counts."""

import argparse

import numpy as np
import pandas as pd
from scipy.optimize import minimize_scalar


def nlog_likelihood(beta, counts):
    """Log-likelihood function."""
    likelihood = - np.sum(np.log((1/counts)**(beta - 1)
                          - (1/(counts + 1))**(beta - 1)))
    return likelihood


def get_power_law_params(word_counts):
    """Get the power law parameters."""
    mle = minimize_scalar(nlog_likelihood,
                          bracket=(1 + 1e-10, 4),
                          args=word_counts,
                          method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha


def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
    """
    Plot the power law curve that was fitted to the data.

    Parameters
    ----------
    curve_xmin : float
        Minimum x-bound for fitted curve
    curve_xmax : float
        Maximum x-bound for fitted curve
    max_rank : int
        Maximum word frequency rank.
    alpha : float
        Estimated alpha parameter for the power law.
    ax : matplotlib axes
        Scatter plot to which the power curve will be added.
    """
    xvals = np.arange(curve_xmin, curve_xmax)
    yvals = max_rank * (xvals**(-1 / alpha))
    ax.loglog(xvals, yvals, color='grey')


def main(args):
    """Run the command line program."""
    df = pd.read_csv(args.infile, header=None,
                     names=('word', 'word_frequency'))
    df['rank'] = df['word_frequency'].rank(ascending=False,
                                           method='max')
    ax = df.plot.scatter(x='word_frequency',
                         y='rank', loglog=True,
                         figsize=[12, 6],
                         grid=True,
                         xlim=args.xlim)

    word_counts = df['word_frequency'].to_numpy()
    alpha = get_power_law_params(word_counts)
    print('alpha:', alpha)

    # Since the ranks are already sorted, we can take the last
    # one instead of computing which row has the highest rank
    max_rank = df['rank'].to_numpy()[-1]

    # Use the range of the data as the boundaries
    # when drawing the power law curve
    curve_xmin = df['word_frequency'].min()
    curve_xmax = df['word_frequency'].max()

    plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax)
    ax.figure.savefig(args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=argparse.FileType('r'),
                        nargs='?', default='-',
                        help='Word count csv file name')
    parser.add_argument('--outfile', type=str,
                        default='plotcounts.png',
                        help='Output image file name')
    parser.add_argument('--xlim', type=float, nargs=2,
                        metavar=('XMIN', 'XMAX'),
                        default=None, help='X-axis limits')
    args = parser.parse_args()
    main(args)

We can then run the script to obtain the \(\alpha\) value for Dracula and a new plot with a line fitted.

$ python bin/plotcounts.py results/dracula.csv --outfile
  results/dracula.png

alpha: 1.0866646252515038

So according to our fit, the most frequent word will occur approximately \(2^{1.1}=2.1\) times as often as the second most frequent word, \(3^{1.1}=3.3\) times as often as the third most frequent word, and so on. Figure @ref(fig:git-advanced-dracula-fit) shows the plot.

knitr::include_graphics("figures/git-advanced/dracula-fit.png")

The script appears to be working as we’d like, so we can go ahead and commit our changes to the fit development branch:

$ git add bin/plotcounts.py results/dracula.png
$ git commit -m "Added fit to word count data"

[fit 38c209b] Added fit to word count data
 2 files changed, 57 insertions(+)
 rewrite results/dracula.png (99%)

If we look at the last couple of commits using git log, we see our most recent change:

$ git log --oneline -n 2

38c209b (HEAD -> fit) Added fit to word count data
ddb00fb (origin/master, master) removing inverse rank
        calculation

(We use --oneline and -n 2 to shorten the log display.) But if we switch back to the master branch:

$ git checkout master
$ git branch

  fit
* master

and look at the log, our change is not there:

$ git log --oneline -n 2

ddb00fb (HEAD -> master, origin/master) removing inverse rank
        calculation
7de9877 ignoring __pycache__

We have not lost our work: it just isn’t included in this branch. We can prove this by switching back to the fit branch and checking the log again:

$ git checkout fit
$ git log --oneline -n 2

38c209b (HEAD -> fit) Added fit to word count data
ddb00fb (origin/master, master) removing inverse rank
        calculation

We can also look inside plotcounts.py and see our changes. If we make another change and commit it, that change will also go into the fit branch. For instance, we could add some additional information to one of our docstrings to make it clear what equations were used in estimating \(\alpha\).

def get_power_law_params(word_counts):
    """
    Get the power law parameters.

    References
    ----------
    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
      beta (Eq. 2) and the maximum likelihood estimation (mle)
      of beta (Eq. 6).

    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
      Large-Scale Analysis of Zipf's Law in English Texts.
      PLoS ONE 11(1): e0147073.
      https://doi.org/10.1371/journal.pone.0147073
    """
    mle = minimize_scalar(nlog_likelihood,
                          bracket=(1 + 1e-10, 4),
                          args=word_counts,
                          method='brent')
    beta = mle.x
    alpha = 1 / (beta - 1)
    return alpha

$ git add bin/plotcounts.py
$ git commit -m "Adding Moreno-Sanchez et al (2016) reference"

[fit 1577404] Adding Moreno-Sanchez et al (2016) reference
 1 file changed, 14 insertions(+), 1 deletion(-)

Finally, if we want to see the differences between two branches, we can use git diff with the same double-dot .. syntax\index{Git commands!diff} used to view differences between two revisions:

$ git diff master..fit

diff --git a/bin/plotcounts.py b/bin/plotcounts.py
index c511da1..6905b6e 100644
--- a/bin/plotcounts.py
+++ b/bin/plotcounts.py
@@ -2,7 +2,62 @@
 
 import argparse
 
+import numpy as np
 import pandas as pd
+from scipy.optimize import minimize_scalar
+
+
+def nlog_likelihood(beta, counts):
+    """Log-likelihood function."""
+    likelihood = - np.sum(np.log((1/counts)**(beta - 1)
+                          - (1/(counts + 1))**(beta - 1)))
+    return likelihood
+
+
+def get_power_law_params(word_counts):
+    """
+    Get the power law parameters.
+
+    References
+    ----------
+    Moreno-Sanchez et al (2016) define alpha (Eq. 1),
+      beta (Eq. 2) and the maximum likelihood estimation (mle)
+      of beta (Eq. 6).
+
+    Moreno-Sanchez I, Font-Clos F, Corral A (2016)
+      Large-Scale Analysis of Zipf's Law in English Texts.
+      PLoS ONE 11(1): e0147073.
+      https://doi.org/10.1371/journal.pone.0147073
+    """
+    mle = minimize_scalar(nlog_likelihood,
+                          bracket=(1 + 1e-10, 4),
+                          args=word_counts,
+                          method='brent')
+    beta = mle.x
+    alpha = 1 / (beta - 1)
+    return alpha
+
+
+def plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax):
+    """
+    Plot the power law curve that was fitted to the data.
+
+    Parameters
+    ----------
+    curve_xmin : float
+        Minimum x-bound for fitted curve
+    curve_xmax : float
+        Maximum x-bound for fitted curve
+    max_rank : int
+        Maximum word frequency rank.
+    alpha : float
+        Estimated alpha parameter for the power law.
+    ax : matplotlib axes
+        Scatter plot to which the power curve will be added.
+    """
+    xvals = np.arange(curve_xmin, curve_xmax)
+    yvals = max_rank * (xvals**(-1 / alpha))
+    ax.loglog(xvals, yvals, color='grey')
 
 
 def main(args):
@@ -16,6 +71,21 @@ def main(args):
                          figsize=[12, 6],
                          grid=True,
                          xlim=args.xlim)
+
+    word_counts = df['word_frequency'].to_numpy()
+    alpha = get_power_law_params(word_counts)
+    print('alpha:', alpha)
+
+    # Since the ranks are already sorted, we can take the last
+    # one instead of computing which row has the highest rank
+    max_rank = df['rank'].to_numpy()[-1]
+
+    # Use the range of the data as the boundaries
+    # when drawing the power law curve
+    curve_xmin = df['word_frequency'].min()
+    curve_xmax = df['word_frequency'].max()
+
+    plot_fit(curve_xmin, curve_xmax, max_rank, alpha, ax)
     ax.figure.savefig(args.outfile)
 
 
diff --git a/results/dracula.png b/results/dracula.png
index 57a7b70..5f10271 100644
Binary files a/results/dracula.png and b/results/dracula.png
differ

Why Branch?

Why go to all this trouble? Imagine we are in the middle of debugging a change like this when we are asked to make final revisions to a paper that was created using the old code. If we revert plotcount.py to its previous state we might lose our changes. If instead we have been doing the work on a branch, we can switch branches, create the plot, and switch back in complete safety.

Merging#

We could proceed in three ways at this point:

Add our changes to plotcounts.py once again in the master branch.
Stop working in master and start using the fit branch for future development.
Merge the fit and master branches.

The first option is tedious and error-prone; the second will lead to a bewildering proliferation of branches, but the third option is simple, fast, and reliable. To start, let’s make sure we’re in the master branch:

$ git checkout master
$ git branch

  fit
* master

We can now merge the changes in the fit branch into our current branch with a single command:

$ git merge fit

Updating ddb00fb..1577404
Fast-forward
 bin/plotcounts.py   |  70 ++++++++++++++++++++++++++++++++++++++
                           +++++++++++++++++++++++++++++++
 results/dracula.png | Bin 23291 -> 38757 bytes
 2 files changed, 70 insertions(+)

Merging doesn’t change the source branch fit, but once the merge is done, all of the changes made in fit are also in the history of master:

$ git log --oneline -n 4

1577404 (HEAD -> master, fit) Adding Moreno-Sanchez et al
        (2016) reference
38c209b Added fit to word count data
ddb00fb (origin/master) removing inverse rank calculation
7de9877 ignoring __pycache__

Note that Git automatically creates a new commit (in this case, 1577404) to represent the merge. If we now run git diff master..fit, Git doesn’t print anything because there aren’t any differences to show.

Now that we have merged all of the changes from fit into master there is no need to keep the fit branch, so we can delete it:

$ git branch -d fit

Deleted branch fit (was 1577404).

Not Just the Command Line

We have been creating, merging, and deleting branches on the command line, but we can do all of these things using [GitKraken][gitkraken], [the RStudio IDE][rstudio-ide], [TODO: fix references] and other GUIs. The operations stay the same; all that changes is how we tell the computer what we want to do.

Handling Conflicts#

A conflict occurs when a line has been changed in different ways in two separate branches or when a file has been deleted in one branch but edited in the other. Merging fit into master went smoothly because there were no conflicts between the two branches, but if we are going to use branches, we must learn how to merge conflicts.

To start, use nano to add the project’s title to a new file called README.md in the master branch, which we can then view:

$ cat README.md

# Zipf's Law

$ git add README.md
$ git commit -m "Initial commit of README file"

[master 232b564] Initial commit of README file
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

Now let’s create a new development branch called docs to work on improving the documentation for our code. We will use git checkout -b to create a new branch and switch to it in a single step:

$ git checkout -b docs

Switched to a new branch 'docs'

$ git branch

* docs
  master

On this new branch, let’s add some information to the README file:

# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text
files and plot each word's rank versus its frequency.

$ git add README.md
$ git commit -m "Added repository overview"

[docs a0b88e5] Added repository overview
 1 file changed, 3 insertions(+)

In order to create a conflict, let’s switch back to the master branch. The changes we made in the docs branch are not present:

$ git checkout master

Switched to branch 'master'

$ cat README.md

# Zipf's Law

Let’s add some information about the contributors to our work:

# Zipf's Law

## Contributors

- Amira Khan <amira@zipf.org>

$ git add README.md
$ git commit -m "Added contributor list"

[master 45a576b] Added contributor list
 1 file changed, 4 insertions(+)

We now have two branches, master and docs, in which we have changed README.md in different ways:

$ git diff docs..master

diff --git a/README.md b/README.md
index f40e895..71f67db 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,5 @@
 # Zipf's Law
 
-These Zipf's Law scripts tally occurrences of words in text
-files and plot each word's rank versus its frequency.
+## Contributors
+
+- Amira Khan <amira@zipf.org>

When we try to merge docs into master, Git doesn’t know which of these changes to keep:

$ git merge docs master

Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.

If we look in README.md, we see that Git has kept both sets of changes, but has marked which came from where:

$ cat README.md

# Zipf's Law

<<<<<<< HEAD
## Contributors

- Amira Khan <amira@zipf.org>
=======
These Zipf's Law scripts tally the occurrences of words in text
files and plot each word's rank versus its frequency.
>>>>>>> docs

The lines from <<<<<<< HEAD to ======= are what was in master, while the lines from there to >>>>>>> docs show what was in docs. If there were several conflicting regions in the same file, Git would mark each one this way.

We have to decide what to do next:

keep the master changes,
keep those from docs,
edit this part of the file to combine them,
or write something new.

Whatever we do, we must remove the >>>, ===, and <<< markers. Let’s combine the two sets of changes so the resulting file reads:

# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text
files and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>

We can now add the file and commit the change, just as we would after any other edit:

$ git add README.md
$ git commit -m "Merging README additions"

[master 55c63d0] Merging README additions

Our branch’s history now shows a single sequence of commits, with the master changes on top of the earlier docs changes:

$ git log --oneline -n 4

55c63d0 (HEAD -> master) Merging README additions
45a576b Added contributor list
a0b88e5 (docs) Added repository overview
232b564 Initial commit of README file

If we want to see what really happened, we can add the --graph option to git log:

$ git log --oneline --graph -n 4

*   55c63d0 (HEAD -> master) Merging README additions
|\  
| * a0b88e5 (docs) Added repository overview
* | 45a576b Added contributor list
|/  
* 232b564 Initial commit of README file

At this point we can delete the docs branch:

$ git branch -d docs

Deleted branch docs (was a0b88e5).

Alternatively, we can keep using docs for documentation updates. Each time we switch to it, we merge changes from master into docs, do our editing (while switching back to master or other branches as needed to work on the code), and then merge from docs to master once the documentation is updated.

Remember to Push

If you are using a remote repository, don’t forget to use git push to keep your version on GitHub up to date with your local version.

A Branch-Based Workflow#

What is the best way to incorporate branching into our regular coding practice? If we are working on our own computer, this workflow will help us keep track of what we are doing:

git checkout master to make sure we are in the master branch.
git checkout -b name-of-feature to create a new branch. We always create a branch when making changes, since we never know what else might come up. The branch name should be as descriptive as a variable name or filename would be.
Make our changes. If something occurs to us along the way—for example, if we are writing a new function and realize that the documentation for some other function should be updated—we do not do that work in this branch just because we happen to be there. Instead, we commit our changes, switch back to master, and create a new branch for the other work.
When the new feature is complete, we git merge master name-of-feature to get any changes we merged into master after creating name-of-feature and resolve any conflicts. This is an important step: we want to do the merge and test that everything still works in our feature branch, not in master.
Finally, we switch back to master and git merge name-of-feature master to merge our changes into master. We should not have any conflicts, and all of our tests should pass.

Most experienced developers use this branch-per-feature workflow, but what exactly is a “feature”? These rules make sense for small projects:

Anything cosmetic that is only one or two lines long can be done in master and committed right away. Here, “cosmetic” means changes to comments or documentation: nothing that affects how code runs, not even a simple variable renaming.
A pure addition that doesn’t change anything else is a feature and goes into a branch. For example, if we run a new analysis and save the results, that should be done on its own branch because it might take several tries to get the analysis to run, and we might interrupt ourselves to fix things that we discover aren’t working.
Every change to code that someone might want to undo later in one step is a feature. For example, if a new parameter is added to a function, then every call to the function has to be updated. Since neither alteration makes sense without the other, those changes are considered a single feature and should be done in one branch.

The hardest thing about using a branch-per-feature workflow is sticking to it for small changes. As the first point in the list above suggests, most people are pragmatic about this on small projects; on large ones, where dozens of people might be committing, even the smallest and most innocuous change needs to be in its own branch so that it can be reviewed (which we discuss below).

Using Other People’s Work#

So far we have used Git to manage individual work, but it really comes into its own when we are working with other people. We can do this in two ways:

Everyone has read and write access to a single shared repository.
Everyone can read from the project’s main repository, but only a few people can commit changes to it. The project’s other contributors fork the main repository to create one that they own, do their work in that, and then submit their changes to the main repository.

The first approach works well for teams of up to half a dozen people who are all comfortable using Git, but if the project is larger, or if contributors are worried that they might make a mess in the master branch, the second approach is safer.

[TODO - Fix the references below] Git itself doesn’t have any notion of a “main repository”, but forges like [GitHub][github], [GitLab][gitlab], and [BitBucket][bitbucket] all encourage people to use Git in ways that effectively create one. Suppose, for example, that Sami wants to contribute to the Zipf’s Law code that Amira is hosting on GitHub at https://github.com/amira-khan/zipf. Sami can go to that URL and click on the “Fork” button in the upper right corner (Figure @ref(fig:git-advanced-fork-button)). GitHub immediately creates a copy of Amira’s repository within Sami’s account on GitHub’s own servers.

knitr::include_graphics("figures/git-advanced/fork-button.png")

When the command completes, the setup on GitHub now looks like Figure @ref(fig:git-advanced-after-fork). Nothing has happened yet on Sami’s own machine: the new repository exists only on GitHub. When Sami explores its history, they see that it contains all of the changes Amira made.

knitr::include_graphics("figures/git-advanced/after-fork.png")

A copy of a repository is called a clone. In order to start working on the project, Sami needs a clone of their repository (not Amira’s) on their own computer. We will modify Sami’s prompt to include their desktop user ID (sami) and working directory (initially ~) to make it easier to follow what’s happening:\index{Git commands!clone}

sami:~ $ git clone https://github.com/sami-virtanen/zipf.git

Cloning into 'zipf'...
remote: Enumerating objects: 64, done.
remote: Counting objects: 100% (64/64), done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 64 (delta 20), reused 63 (delta 19), pack-reused 0
Receiving objects: 100% (64/64), 2.20 MiB | 2.66 MiB/s, done.
Resolving deltas: 100% (20/20), done.

This command creates a new directory with the same name as the project, i.e., zipf. When Sami goes into this directory and runs ls and git log, they see that all of the project’s files and history are there:

sami:~ $ cd zipf
sami:~/zipf $ ls

README.md       bin             data             results

sami:~/zipf $ git log --oneline -n 4

55c63d0 (HEAD -> master, origin/master, origin/HEAD) 
        Merging README additions
45a576b Added contributor list
a0b88e5 Added repository overview
232b564 Initial commit of README file

Sami also sees that Git has automatically created a remote for their repository that points back at their repository on GitHub:

sami:~/zipf $ git remote -v

origin  https://github.com/sami-virtanen/zipf.git (fetch)
origin  https://github.com/sami-virtanen/zipf.git (push)

Sami can pull changes from their fork and push work back there, but needs to do one more thing before getting the changes from Amira’s repository:

sami:~/zipf $ git remote add upstream
              https://github.com/amira-khan/zipf.git
sami:~/zipf $ git remote -v

origin      https://github.com/sami-virtanen/zipf.git (fetch)
origin      https://github.com/sami-virtanen/zipf.git (push)
upstream    https://github.com/amira-khan/zipf.git (fetch)
upstream    https://github.com/amira-khan/zipf.git (push)

Sami has called their new remote upstream because it points at the repository from which theirs is derived. They could use any name, but upstream is a nearly universal convention.

With this remote in place, Sami is finally set up. Suppose, for example, that Amira has modified the project’s README.md file to add Sami as a contributor. (Again, we show Amira’s user ID and working directory in her prompt to make it clear who’s doing what):

# Zipf's Law

These Zipf's Law scripts tally the occurrences of words in text
files and plot each word's rank versus its frequency.

## Contributors

- Amira Khan <amira@zipf.org>
- Sami Virtanen

Amira commits her changes and pushes them to her repository on GitHub:

amira:~/zipf $ git commit -a -m "Adding Sami as a contributor"

[master 35fca86] Adding Sami as a contributor
 1 file changed, 1 insertion(+)

amira:~/zipf $ git push origin master

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 315 bytes | 315.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local
        objects.
To https://github.com/amira-khan/zipf.git
   55c63d0..35fca86  master -> master

\newpage

Amira’s changes are now on her desktop and in her GitHub repository but not in either of Sami’s repositories (local or remote). Since Sami has created a remote that points at Amira’s GitHub repository, though, they can easily pull those changes to their desktop:

sami:~/zipf $ git pull upstream master

From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> upstream/master
Updating 55c63d0..35fca86
Fast-forward
 README.md | 1 +
 1 file changed, 1 insertion(+)

Pulling from a repository owned by someone else is no different than pulling from a repository we own. In either case, Git merges the changes and asks us to resolve any conflicts that arise. The only significant difference is that, as with git push and git pull, we have to specify both a remote and a branch: in this case, upstream and master.

Pull Requests#

Sami can now get Amira’s work, but how can Amira get Sami’s? She could create a remote that pointed at Sami’s repository on GitHub and periodically pull in Sami’s changes, but that would lead to chaos, since we could never be sure that everyone’s work was in any one place at the same time. Instead, almost everyone uses pull requests. They aren’t part of Git itself, but are supported by all major online forges.

A pull request is essentially a note saying, “Someone would like to merge branch A of repository B into branch X of repository Y.” The pull request does not contain the changes, but instead points at two particular branches. That way, the difference displayed is always up to date if either branch changes.

But a pull request can store more than just the source and destination branches: it can also store comments people have made about the proposed merge. Users can comment on the pull request as a whole, or on particular lines, and mark comments as out of date if the author of the pull request updates the code that the comment is attached to. Complex changes can go through several rounds of review and revision before being merged, which makes pull requests the review system we all wish journals actually had.

To see this in action, suppose Sami wants to add their email address to README.md. They create a new branch and switch to it:

sami:~/zipf $ git checkout -b adding-email

Switched to a new branch 'adding-email'

then make a change and commit it:

sami:~/zipf $ git commit -a -m "Adding my email address"

[adding-email 3e73dc0] Adding my email address
 1 file changed, 1 insertion(+), 1 deletion(-)

sami:~/zipf $ git diff HEAD~1

diff --git a/README.md b/README.md
index e8281ee..e1bf630 100644
--- a/README.md
+++ b/README.md
@@ -6,4 +6,4 @@ and plot each word's rank versus its frequency.
 ## Contributors
 
 - Amira Khan <amira@zipf.org>
-- Sami Virtanen
+- Sami Virtanen <sami@zipf.org>

Sami’s changes are only in their local repository. They cannot create a pull request until those changes are on GitHub, so they push their new branch to their repository on GitHub:

sami:~/zipf $ git push origin adding-email

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 315 bytes | 315.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local
  objects.
remote: 
remote: Create a pull request for 'adding-email' on GitHub by
  visiting:
  https://github.com/sami-virtanen/zipf/pull/new/adding-email
remote: 
To https://github.com/sami-virtanen/zipf.git
 * [new branch]      adding-email -> adding-email

When Sami goes to their GitHub repository in the browser, GitHub notices that they have just pushed a new branch and asks them if they want to create a pull request (Figure @ref(fig:git-advanced-after-sami-pushes)).

knitr::include_graphics("figures/git-advanced/after-sami-pushes.png")

When Sami clicks on the button, GitHub displays a page showing the default source and destination of the pull request and a pair of editable boxes for the pull request’s title and a longer comment (Figure @ref(fig:git-advanced-pull-request-start)).

knitr::include_graphics("figures/git-advanced/open-pull-request.png")

If they scroll down, Sami can see a summary of the changes that will be in the pull request (Figure @ref(fig:git-advanced-pull-request-summary)).

knitr::include_graphics("figures/git-advanced/open-pull-request-detail.png")

The top (title) box is autofilled with the previous commit message, so Sami adds an extended explanation to provide additional context before clicking on “Create Pull Request” (Figure @ref(fig:git-advanced-pull-request-fill-in)). When they do, GitHub displays a page showing the new pull request, which has a unique serial number (Figure @ref(fig:git-advanced-pull-request-new)). Note that this pull request is displayed in Amira’s repository rather than Sami’s, since it is Amira’s repository that will be affected if the pull request is merged.

knitr::include_graphics("figures/git-advanced/fill-in-pull-request.png")

knitr::include_graphics("figures/git-advanced/new-pull-request.png")

Amira’s repository now shows a new pull request (Figure @ref(fig:git-advanced-pull-request-viewing)). Clicking on the “Pull requests” tab brings up a list of PRs (Figure @ref(fig:git-advanced-pull-request-list)) and clicking on the pull request link itself displays its details (Figure @ref(fig:git-advanced-pull-request-details)). Sami and Amira can both see and interact with these pages, though only Amira has permission to merge.

knitr::include_graphics("figures/git-advanced/viewing-new-pull-request.png")

knitr::include_graphics("figures/git-advanced/pr-list.png")

knitr::include_graphics("figures/git-advanced/pr-details.png")

Since there are no conflicts, GitHub will let Amira merge the PR immediately using the “Merge pull request” button. She could also discard or reject it without merging using the “Close pull request” button. Instead, she clicks on the “Files changed” tab to see what Sami has changed (Figure @ref(fig:git-advanced-pull-request-changes)).

knitr::include_graphics("figures/git-advanced/pr-changes.png")

If she moves her mouse over particular lines,\index{Git!pull request!reviewing}\index{reviewing (Git pull request)}\index{code review!pull request} a white-on-blue cross appears near the numbers to indicate that she can add comments (Figure @ref(fig:git-advanced-pull-request-comment-marker)). She clicks on the marker beside her own name and writes a comment: She only wants to make one comment rather than write a lengthier multi-comment review, so she chooses “Add single comment” (Figure @ref(fig:git-advanced-pull-request-write-comment)). GitHub redisplays the page with her remarks inserted (Figure @ref(fig:git-advanced-pull-request-pr-with-comment)).

knitr::include_graphics("figures/git-advanced/pr-comment-marker.png")

knitr::include_graphics("figures/git-advanced/pr-writing-comment.png")

knitr::include_graphics("figures/git-advanced/pr-with-comment.png")

While Amira is working, GitHub has been emailing notifications to both Sami and Amira. When Sami clicks on the link in their email notification, it takes them to the PR and shows Amira’s comment. Sami changes README.md, commits, and pushes, but does not create a new pull request or do anything to the existing one.\index{Git!pull request!updating} As explained above, a PR is a note asking that two branches be merged, so if either end of the merge changes, the PR updates automatically.

Sure enough, when Amira looks at the PR again a few moments later she sees Sami’s changes (Figure @ref(fig:git-advanced-pull-request-pr-with-fix)). Satisfied, she goes back to the “Conversation” tab and clicks on “Merge”. The icon at the top of the PR’s page changes text and color to show that the merge was successful (Figure @ref(fig:git-advanced-pull-request-successful-merge)).

knitr::include_graphics("figures/git-advanced/pr-with-fix.png")

knitr::include_graphics("figures/git-advanced/pr-successful-merge.png")

To get those changes from GitHub to her desktop repository, Amira uses git pull:

amira:~/zipf $ git pull origin master

From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
Updating 35fca86..a04e3b9
Fast-forward
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

To get the change they just made from their adding-email branch into their master branch, Sami could use git merge on the command line. It’s a little clearer, though, if they also use git pull from their upstream repository (i.e., Amira’s repository) so that they’re sure to get any other changes that Amira may have merged:

sami:~/zipf $ git checkout master

Switched to branch 'master'
Your branch is up to date with 'origin/master'.

sami:~/zipf $ git pull upstream master

From https://github.com/amira-khan/zipf
 * branch            master     -> FETCH_HEAD
Updating 35fca86..a04e3b9
Fast-forward
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Finally, Sami can push their changes back to the master branch in their own remote repository:

sami:~/zipf $ git push origin master

Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/sami-virtanen/zipf.git
   35fca86..a04e3b9  master -> master

All four repositories are now synchronized.

Handling Conflicts in Pull Requests#

Finally, suppose that Amira and Sami have decided to collaborate more extensively on this project. Amira has added Sami as a collaborator to the GitHub repository. Now Sami can make contributions directly to the repository, rather than via a pull request from a forked repository.

Sami makes a change to README.md in the master branch on GitHub. Meanwhile, Amira is making a conflicting change to the same file in a different branch. When Amira creates her pull request, GitHub will detect the conflict and report that the PR cannot be merged automatically (Figure @ref(fig:git-advanced-pr-conflict)).

knitr::include_graphics("figures/git-advanced/pr-conflict.png")

Amira can solve this problem with the tools she already has. If she has made her changes in a branch called editing-readme, the steps are:

Pull Sami’s changes from the master branch of the GitHub repository into the master branch of her desktop repository.
Merge from the master branch of her desktop repository to the editing-readme branch in the same repository.
Push her updated editing-readme branch to her repository on GitHub. The pull request from there back to the master branch of the main repository will update automatically.

GitHub and other forges do allow people to merge conflicts through their browser-based interfaces, but doing it on our desktop means we can use our favorite editor to resolve the conflict. It also means that if the change affects the project’s code, we can run everything to make sure it still works.

But what if Sami or someone else merges another change while Amira is resolving this one, so that by the time she pushes to her repository there is another, different, conflict? In theory this cycle could go on forever; in practice, it reveals a communication problem that Amira (or someone) needs to address. If two or more people are constantly making incompatible changes to the same files, they should discuss who’s supposed to be doing what, or rearrange the project’s contents so that they aren’t stepping on each other’s toes.

Summary#

Branches and pull requests seem complicated at first, but they quickly become second nature. Everyone involved in the project can work at their own pace on what they want to, picking up others’ changes and submitting their own whenever they want. More importantly, this workflow gives everyone has a chance to review each other’s work. As we discuss in Section @ref(style-review), doing reviews doesn’t just prevent errors from creeping in: it is also an effective way to spread understanding and skills.

Exercises#

Explaining options#

What do the --oneline and -n options for git log do?
What other options does git log have that you would find useful?

Modifying prompt#

Modify your shell prompt so that it shows the branch you are on when you are in a repository.

Ignoring files#

GitHub maintains [a collection of .gitignore files][github-gitignore] for projects of various kinds. Look at the sample .gitignore file for Python: how many of the ignored files do you recognize? Where could you look for more information about them?

Creating the same file twice#

Create a branch called same. In it, create a file called same.txt that contains your name and the date.

Switch back to master. Check that same.txt does not exist, then create the same file with exactly the same contents.

What will git diff master..same show? (Try to answer the question before running the command.)
What will git merge same master do? (Try to answer the question before running the command.)

Deleting a branch without merging#

Create a branch called experiment. In it, create a file called experiment.txt that contains your name and the date, then switch back to master.

What happens when you try to delete the experiment branch using git branch -d experiment? Why?
What option can you give Git to delete the experiment branch? Why should you be very careful using it?
What do you think will happen if you try to delete the branch you are currently on using this flag?

Tracing changes#

Chartreuse and Fuchsia are collaborating on a project. Describe what is in each of the four repositories involved after each of the steps below.

Chartreuse creates a repository containing a README.md file on GitHub and clones it to their desktop.
Fuchsia forks that repository on GitHub and clones their copy to their desktop.
Fuchsia adds a file fuchsia.txt to the master branch of their desktop repository and pushes that change to their repository on GitHub.
Fuchsia creates a pull request from the master branch of their repository on GitHub to the master branch of Chartreuse’s repository on GitHub.
Chartreuse does not merge Fuchsia’s PR. Instead, they add a file chartreuse.txt to the master branch of their desktop repository and push that change to their repository on GitHub.
Fuchsia adds a remote to their desktop repository called upstream that points at Chartreuse’s repository on GitHub and runs git pull upstream master, then merges any changes or conflicts.
Fuchsia pushes from the master branch of their desktop repository to the master branch of their GitHub repository.
Chartreuse merges Fuchsia’s pull request.
Chartreuse runs git pull origin master on the desktop.

Key Points#

[TODO - copy in keypoints]

Acknowledgments and License#

This section has largely been taken from Research Software Engineering with Python: Building Software that Makes Research Possible github by Damien Irving, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson under at a Creative Commons Attribution 4.0 International License (CC-BY 4.0).