Debugging

27. Debugging#

Four people are riding in the car when the car breaks down and coasts to the side of the road. The first person - a chemical engineer - states, “hmmm. We may have gotten in some bad gasoline when stopped. Let me test it to be sure it’s correct. The second individual - a mechanical engineer - states, “I thought I heard some knocking from the engine. Let me take a look at that, and we should be back on the road soon.” The third person - an electrical engineer, states, “That was odd the car stopped suddenly. That could be a sign that the electrical system has an issue. Let me check that, and I’ll get us going again in a minute.” The final person - a computer scientist - says, “Why don’t we just get out of the car, change seats, get back in, and see if the car starts.”

While humorous, an underlying lesson exists in this joke - the need to gather evidence, make educated guesses about what has occurred (hypothesize), and then test (experiment) to see if that guess was correct or not when diagnosing a problem. The first three individuals followed that process, while the last just attempted to do something randomly. While “restarting the system” often solves many computer-related system issues (what the restart process does is force the system to re-initialize and place itself into a consistent state), it is not a process for fixing issues with programs. We cannot just re-run the program and fiercely hope that the program works - that’s the definition of insanity - doing the same thing repeatedly and expecting different results. As we debug(fix) our programs, we need to do so systematically - not just randomly make coding changes until things just work.

(Additional side note: A couple of additional lessons exist in the joke. First, the four individuals bring a diversity of thought to the problem at hand - looking at different causes for a particular problem. Unfortunately, they also demonstrated an egocentric bias where they relied heavily upon their backgrounds in making guesses as to what occurred with the car.)

We troubleshoot problems regularly in our daily lives - phone not working, car not starting, lights not on, toaster broken. For instance, if a toaster does not work, we implicitly follow a process to find the underlying cause and then apply a correction. Our minds gather information from the environment as well as recall past facts that we know about the toaster:

Is the toaster on?
Does the toaster have electricity?
- Is it plugged into the wall outlet?
- Is the outlet working?
- Is there power to the circuit?
- Is there power to the house?
- Is there power to the neighborhood
Are there other circumstances in which the toaster has behaved similarly?
Was it dropped recently? Does the toaster have any physical damage?

We then hypothesize (make educated guesses) from the information gathered. For example, the toaster does not work because the circuit breaker flipped as too many electrical devices were drawing power simultaneously in the kitchen. We then test that hypothesis by performing some action or experiment. In the case of the toaster, we test if resetting the circuit breaker fixes the issue. If it does, great, we solved the problem. If not, we need to gather more information and look at different possibilities through alternate hypotheses. Yes, we have applied the steps of the scientific method in this process, demonstrating the ubiquity of the method.

In many ways, debugging is very similar to troubleshooting - we want to make the debugging process as systematic as possible to most effectively find and fix issues within our programs.

Sometimes you have no choice but to debug - especially when the bug only appears when you integrate the whole system, or a user reports the issue. In this case, localizing the bug to a particular module may be challenging. This notebook presents a systematic strategy for more effective debugging.

We took many of the key points presented in this notebook from the book Why Programs Fail by Andreas Zeller. The book is an excellent resource for systematic debugging.

Also related is “How to Debug” by John Regehr, a lecture from an embedded systems course, which is more low-level than this course but with the same general principles for systematic debugging.

Finally, Debugging: The Nine Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems by David Agans is a readable, eminently practical guide to debugging in a various technical situations, from software to hardware to cars to plumbing. We refer to several of the Nine Rules here.

27.1. Reproduce the bug#

Start by finding a small, repeatable test case that produces the failure. If the bug was found by regression testing, then you are in luck; you already have a failing test case in your test suite. If the bug was reported by a user, it may take some effort to reproduce the bug. For graphical user interfaces and multithreaded programs, a bug may be hard to reproduce consistently if it depends on timing of events or thread execution.

Nevertheless, any effort you put into making the test case small and repeatable will pay off, because you will have to run it over and over while you search for the bug and develop a fix for it. Furthermore, after you successfully fixed the bug, you will want to keep the test case in your regression test suite, so that the bug never crops up again. Once you have a test case for the bug, making this test work becomes your goal.

For example, suppose you have written this function: (We have not provided the body, which is represented by ....)

def mostCommonWord(text): 
    """
    Find the most common word in a string.
    
    Args:
       text (str): string containing zero or more words, where a word 
                   is a string of alphanumeric characters bounded by nonalphanumerics.

    Returns:
    a word (str) that occurs maximally often in text, ignoring alphabetic case.
    """
    ...

A user passes the whole text of Shakespeare’s plays into your method, something like mostCommonWord(allShakespearesPlaysConcatenated), and discovers that instead of returning a predictably common English word like “the” or “a”, the method returns something unexpected, perhaps “e”.

Shakespeare’s plays have 100,000 lines containing over 800,000 words, so this input would be very painful to debug by normal methods, like print-debugging and breakpoint-debugging. Debugging will be easier if you first work on reducing the size of the buggy input to something manageable that still exhibits the same (or very similar) bug:

does the first half of Shakespeare show the same bug? (Binary search! Always a good technique. More about this below.)
does a single play have the same bug?
does a single speech have the same bug?

Once you discovered a small test case, find and fix the bug using that smaller test case, and then go back to the original buggy input and confirm that you fixed the same bug.

27.1.1. Exercise: Reducing a bug to a test case#

Note: The answers for all of the exercises are at the end of this notebook. You should write down your answers prior to checking if they are correct.

Suppose a user reports that mostCommonWord("chicken chicken chicken beef") returns “beef” instead of “chicken”.

To shorten and simplify this input before you start debugging, which of the following inputs are worth trying?
1. mostCommonWord("chicken chicken beef")
2. mostCommonWord("Chicken Chicken Chicken beef")
3. mostCommonWord("chicken beef")
4. mostCommonWord("a b c")
5. mostCommonWord("c c c b")

Give all answers that make sense, not just the simplest one (because the simplest one sometimes no longer exhibits the bug!)

Suppose you reduce the “chicken chicken chicken beef” input down to “c c b”, which also has a problem. You find a bug, fix it, and observe that both “c c b” and “chicken chicken chicken beef” are now returning the right answer. Which test cases should you now add to your test suite?
1. self.assertEqual(mostCommonWord("chicken chicken chicken beef"), "chicken")
2. self.assertEqual(mostCommonWord("c c b"), "c")
3. self.assertEqual(mostCommonWord("c b"), "c")
4. you shouldn’t change the test suite, because you have not changed the spec

27.2. Find the bug using the scientific method#

To localize the bug and its cause, you can use the scientific method:

Study the data. Look at the test input that causes the bug, and examine the incorrect results, failed assertions, and stack traces that result from it.
Hypothesize. Propose a hypothesis, consistent with all the data, about where the bug might be, or where it cannot be. It is good to make this hypothesis general at first.
Experiment. Devise and run an experiment that tests your hypothesis. It is good to make the experiment an observation at first - a probe that collects information but disturbs the system as little as possible.
Repeat. Add the data you collected from your experiment to what you knew before, and make a fresh hypothesis. Hopefully you have ruled out some possibilities and narrowed the set of possible locations and reasons for the bug.

This kind of deliberate process is not needed for every bug. With good fail-fast design, you will hopefully get an exception very close to the source of the bug, the stack trace will lead you right to it, and the mistake will be obvious from inspecting the code. So when do you need to apply the scientific method? A good rule of thumb is the 10-minute rule. If you have spent 10 minutes hunting for a bug using ad hoc, unsystematic inspection, then it is time to take a step back and start applying the scientific method instead.

As part of this transition, you should also move your debugging process out of your head - which has a very limited working memory of what you have tried and what you have learned from it - and start taking notes, either on paper or on your laptop. In each iteration of the process, you will be writing down:

Hypothesis. Based on what you have learned so far, what is your next hypothesis about the location or cause of the bug?
Experiment. What are you about to try that will shed light on the hypothesis, by verifying or falsifying it?
Predictions. What do you expect, based on your hypothesis, to be the result of the experiment?
Observations. What actually happened when you did the experiment?

These kinds of questions should be very familiar to you from your past experience in science classes. In the next few sections, we will see what kind of form they take when you debug code. In debugging, some hypotheses and some experiments are better than others.

27.2.1. Exercise: Scientific Method#

The process that we already discussed for simplifying a bug report down to a simple test case is an example of applying the scientific method. For each of these moments in a particular test-simplification process, which step of the scientific method above is being applied?

A user reports that mostCommonWord(“chicken chicken chicken beef”) returns “beef” instead of “chicken”:
1. study the data
2. hypothesize
3. experiment
Maybe the particular words “chicken” and “beef” do not cause the failure - what matters is the number of times they occur.
1. study the data
2. hypothesize
3. experiment
Run a test case mostCommonWord(“a a a b”).
1. study the data
2. hypothesize
3. experiment

27.3. Step 1. Study the data#

With “Study the data”, we look at the test input that causes the bug, and examine the incorrect results, failed assertions, and stack traces that result from it. Also, we track what occurred as we reproduced the issue. As part of this reproduction, we can gather information about the state of the programming by placing print statements or stepping through the code with a debugger.

One important form of data is the stack trace from an exception. Practice reading the stack traces that you get, because they will give you enormous amounts of information about where and what the bug might be.

In a Python stack trace (traceback), the oldest call is on top, and the latest call is on the bottom. (Other languages may have a different ordering.) But the calls at the top or bottom of the stack may be library code that you did not write. Your own code - where the bug is most likely to be - is often somewhere in the middle. Do not let that dissuade you. Scan through the stack trace until you see something familiar, and then find it in your code.

27.4. Exercise: Reading a stack trace#

Consider a program that attempts to read the contents of a file and print them to the console. The following stack trace is produced if the file is not found:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <cell line: 18>()
     15 def main():
     16     process_file()
---> 18 main()

Input In [2], in main()
     15 def main():
---> 16     process_file()

Input In [2], in process_file()
     10 def process_file():
     11     name = get_filename()
---> 12     contents = read_contents(name)
     13     print(contents)

Input In [2], in read_contents(filename)
      4 def read_contents(filename):
----> 5     f = open(filename, "rt")
      6     contents = f.read()
      7     f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'test.txt'

Which line of code raise the exception? In what method was that located?
Which line of code started the process?
Where was the file name introduced into the program?
What was the name of the file that was not found?

As you can see, we can gather quite a bit of information from stack traces.

27.5. Step 2. Hypothesize#

The point where the program actually failed, by throwing an exception or producing a wrong answer, is not necessarily where the bug is. The buggy code may have propagated some bad values through good parts of the program before it eventually failed. So your hypotheses should be about where the bug actually is (or is not), and what might have caused it.

It can help to think about your program as a flow of data, or steps in an algorithm, and try to rule out whole sections of the program at once. Let’s think about this in the context of the mostCommonWord() example, fleshed out a little more with three helper methods:

def mostCommonWord(text): 
    """Find the most common word in a string.
       ...
    """
    words = splitIntoWords(text)            # array of strings
    frequencies = countOccurrences(words)   # dictionary keyed by words of occurances
    winner= findMostFrequent(frequencies) 
    return winner

The flow of data in mostCommonWord() is shown below:

Suppose that we’re getting an unexpected exception in countOccurrences(). That’s the failure that we’re investigating. Then we can rule out everything downstream of that point (that is, everything later in the data flow) as a possible location for the bug. We won’t find the bug by looking in findMostFrequent(), for example, because it hasn’t even been executed yet when the failure occurs.

So here are some hypotheses consistent with the information we have so far. They’re listed in reverse order of time from the failure:

the bug is in countOccurrences: its input is valid but then it throws an exception
the bug is in the connection between splitIntoWords and countOccurrences: both methods meet their contracts, but the postcondition guaranteed by the former does not satisfy the precondition expected by the latter
the bug is in splitIntoWords: its input is valid but it produces bad output
the bug is in the original input to mostCommonWord: text does not satisfy the precondition of the whole method

Which hypothesis to try first? Debugging is a search process, and you can use binary search to speed up the process. To do a binary search, you could divide this dataflow in half, perhaps guessing that the bug is in the connection between the first helper method and the second, and use one of the experiments below (like print statements, breakpoints, or assertions) to check the value there. From the answer to that experiment, you would know whether to pursue the bug in the earlier half of the dataflow or the later half.

27.5.1. Slicing#

The mostCommonWord() data flow we just looked at is an example of slicing, which means finding the parts of a program that contributed to computing a particular value. When you have a failure - a bad value at a particular point in the program - the slice for that value consists of the lines of the program that helped compute that bad value. The bug that caused the bad value lies somewhere in that slice, so that’s your search space.

Automated tools for program slicing do exist, though they are not very practical yet. But programmers also do slicing naturally, in their heads, when looking at code to generate a hypothesis about where a bug might or might not be. It’s a useful skill for reasoning about programs, so let’s understand it better.

Here’s an example. Suppose x is a local number variable which is not supposed to exceed 100. At some point a debugging print statement reports that it has gone bad:

x = 0            # must be <= 100
...
print("x =", x)  # prints a value much greater than 100

What is the slice for this bad value? What lines of code helped compute the large value that x has when it reaches that print statement? Let’s dig into the ... in the code above to find what code might be responsible.

Lines that directly change the value of x are part of the slice:

x = 0             # must be <= 100
...
x += bonus
...
print("x =", x)  # prints a value much greater than 100

The value of bonus at that point also contributes to the value of x, so that means that its slice does too. So we have to look at the lines that helped compute bonus at that point:

x = 0               # must be <= 100
bonus = getBonus()
...
x += bonus 
...
print("x =", x)     # prints a value much greater than 100

So the function getBonus() is now included in the slice, because it was responsible for computing bonus. (Technically we only need to include the slice of its return value, but we can simplify and just say that getBonus() itself is suspect.)

The slice also includes control statements that affected the execution of statements already in the slice:

x = 0                # must be <= 100
bonus = getBonus()  
...
if isWorthABonus(s):
    x += bonus
...
print("x =", x)      # prints a value much greater than 100

The if statement around x += bonus is part of the slice because it controls whether or not x was actually changed at that point. This also pulls in the function isWorthABonus(), and the value of s and its slice:

x = 0                # must be <= 100
bonus = getBonus()  
...
for s in salesList:
  ...
  if isWorthABonus(s):
    x += bonus
  ...

...
print("x =", x)      # prints a value much greater than 100

The enclosing for statement is included in the slice for two reasons: because it’s part of the slice of s, so it affects the if statement which is already in our slice, and because it affects the number of times that statements in our slice are executed (the if statement and x += bonus).

And now, because the for uses the variable salesList, we have to include its slice as well. It happens to be a function parameter:

def calculateTotalBonus(salesList): 
  ...
  x = 0;
  bonus = getBonus();
  ...
  for s in salesList:
    ...
    if isWorthABonus(s):
      x += bonus
    ...
  ...
  console.log("x=" + x)  // prints a value much greater than 100
  ...
  return x
}

We could dig further and see where salesList came from in the rest of the program, but let’s stop there for now. We’ve found the lines of code in this method that might be responsible for the bad value of x we discovered. Studying these lines can generate some useful hypotheses:

from the x+=bonus statement: maybe bonus is very large, so x+=bonus immediately goes out of range. This hypothesis would further imply that getBonus() returned a large value for bonus.
from the if statement: maybe isWorthABonus() returns true for too many sales, so the sum accumulated in x grows beyond 100.
from the for loop: maybe the entire loop is executing over too many sales, again making x too large from all the bonuses.
from the method signature: maybe a bad value is being passed in for salesList, with far too many sales on it.

The upshot is that careful slicing - which you can do just by reading the code - can help you systematically identify places where the bug could be, and also where it could not be. We can rule out the ... code in this example, because it doesn’t contribute to the bad value of x.

It’s worth noting that our design choices can help or hinder the efficiency of reasoning by slicing. One such design choice is immutability. When we see an immutable collection (or a constant value in another programming language), we can immediately knew that we do not have to look any further - no other lines can be in the slice for that collection as it cannot be altered. Immutability saved us a lot of time in reasoning and searching. When we saw s in salesList though, we didn’t have quite the same confidence. Certainly s could never be reassigned within the body of the loop, but if the element type of salesList is a mutable type, we would have to check the subsequent ... code to make sure no mutators were called on s or other aliases of the Sale object.

Another design choice that helps slicing is scope minimization. All the variables in this example were local variables, with minimal scope, so we only had to look close by for lines of code that might affect them. With instance variables, the slicing search might have to expand to include the entire class. For a global variable (gasp), the search expands to include the entire program.

27.5.1.1. Exercise Slicing - 1#

In the following code, which lines are part of the slice for the value of the variable tax at the end of the code?

total = 0.0
tax = 0.0
taxRate = 0.06
for item in items:
    price = item.getPrice()
    total += price
    
    if isOutOfStateCustomer:
        taxRate /= 2;
        
    if item.isTaxable():
        tax += price * taxRate;

total += tax
return total

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 4
      2 tax = 0.0
      3 taxRate = 0.06
----> 4 for item in items:
      5     price = item.getPrice()
      6     total += price

NameError: name 'items' is not defined

total = 0.0
tax = 0.0
taxRate = 0.06
for item in items:
price = item.getPrice()
total += price
if isOutOfStateCustomer:
taxRate /= 2;
if item.isTaxable():
tax += price * taxRate
total += tax
return total

27.5.1.2. Exercise Slicing - 2#

In the code below, which lines are part of the slice for the value of a at the end of the code?

a is a list of integers

def incrementAll(a):
    b = a
    for i in range(0,len(a)):
        b[i] += 1

    return len(b)                        

def incrementAll(a):
b = a
for i in range(0,len(a)):
b[i] += 1
return len(b)

27.5.2. Delta Debugging#

The process of isolating a small test case may also give you data that helps form a hypothesis, if it uncovers two closely-related test cases that bracket the bug, in the sense that one succeeds and one fails. For example, maybe mostCommonWords("c c, b") is broken, but mostCommonWords("c c b") is fine. Now you can examine the difference between the execution of these two test cases to help form your hypothesis. Which code is executed for the passing test case, but skipped for the failing test case, and vice versa? One hypothesis is that the bug lies in those lines of code, the delta between the passing run and the failing run.

This is a specific example of a general idea in bug finding called delta debugging, in which you compare successful runs with failing runs to try to localize the bug. Another kind of delta debugging is useful when a regression test starts failing. Using your version control system, you retrieve the most recent version that still passed the test, and then systematically explore the code-changes between the older working version and the newer failing version, until you find the change that introduced the bug. Delta debugging tools can automate this kind of search process, though like slicing tools they are not widely used yet.

27.5.3. Prioritize Hypotheses#

When making your hypothesis, you may want to keep in mind that different parts of the system have different likelihoods of failure. For example, old, well-tested code is probably more trustworthy than recently-added code. Python library code is probably more trustworthy than yours. The Python compiler and runtime, operating system platform, and hardware are increasingly more trustworthy, because they are more tried and tested. You should trust these lower levels until you have found good reason not to.

27.5.3.1. Priority Exercise#

Suppose you are debugging the quadraticRoots function, which appears to be producing wrong answers sometimes.

def quadraticRoots(a: number, b: number, c: number):
    """
    Solves quadratic equation ax^2 + bx + c = 0.
    
    Args:
       a: quadratic coefficient, requires a != 0
       b: linear coefficient
       c: constant term

    Returns:
    a list of the real roots of the equation
    """
    roots = []
    ...
    return roots

Put the follow items in the order that you should try them:

Try your code in different version of Python
Put print() statements throughout your method to display the intermediate values of the calculation.
Reproduce the bug.
Run a code coverage tool to see if there are lines that your test suite wasn’t reaching.

27.6. Step 3. Experiment#

Your hypothesis should lead to a prediction, such as “I think variable x has a bad value at this point” or even “I think this code is never executed.” Your experiment should be chosen to test the prediction. The best experiment is a probe, a gentle observation of the system that disturbs it as little as possible.

One familiar probe is a print statement. Print debugging has the advantage that it works for virtually every programming language. It has the disadvantage that it makes a change to the program, which you have to remember to revert. It’s too easy to end up with a program littered with print statements after a long debugging session. It’s also wise to be a little thoughtful when you write a debugging print statement. Rather than printing the same hi! in fifteen different places to trace how the program is executing, and losing track of which is which, print something clear and descriptive like start of calculateTotalBonus.

A more elaborate kind of print debugging is logging, which keeps informative print statements permanently in the code and turns them on or off with a global setting or a method call. Python contains a built-in logging library in its standard library. This library provides methods such as logging.info(), logging.warning(), logging.error(), and logging.critical - representing increasingly-higher levels of importance for the message being displayed. The logging framework allows the user to filter the log output to hide less-important messages, and messages sent to the most critical log levels, like logging.error(), are often displayed in red so that they stand out regardless of the filter. Additionally, this logging framework can also direct logging to a file or to a server across the network, can log structured data as well as human-readable strings, and can be used in deployment, not just development. Large, complex systems would be very hard to operate without logging.

Another kind of probe is an assertion that tests variable values or other internal state. In the example above where x is never allowed to exceed 100, we could insert assert(x <= 100) as a probe at any point. Assertions have the advantage that they don’t require you to inspect output by hand, and can even be left in the code after debugging if the test is universally true (some debugging assertions are only true for the specific test case you’re debugging, and those would need to be removed). Assertions have the disadvantage that in many languages (such as Java), they’re not turned on by default, so you can be fooled by an assertion that seems to be passing but is actually not even running. Assertions can be explicitly turned off as well.

A third kind of probe is setting a breakpoint with a debugger, which stops the running program at that point, and allows you to single-step through the program and inspect values of variables. With Python debuggers, we can also set a watch, which allows us to monitor a conditional value we define. Debuggers also allow us to inspect and change variables regardless of the scope in which they are defined. In other languages we can set watchpoints, which stops the programming when a change is made to a variable or memory location.

A debugger is a powerful tool that rewards the effort put into learning how to use it.

27.6.1. Probe Exercise#

Here is some code that has been successfully debugged, but still has some debugging probes left in it:

def convertCurrency(fromCurrency, fromValue, toCurrency):
    """
    Convert from one currency to another.
    
    Args:
       fromCurrency (str): currency that customer has (e.g. DOLLAR)
       fromValue (decimal): value of fromCurrency that customer has (e.g. $145.23)
       toCurrency (str): currency that customer wants (e.g. EURO).  
                         Must be different from fromCurrency.
                         
    Returns:
    value of toCurrency that customer will get, after applying the conversion rate and bank fee
    """
    
    assert(fromCurrency != toCurrency)
    
    rate = getConversionRate(fromCurrency, toCurrency)
    print("conversion rate:", rate)
    
    fee = getFee()
    assert(fee == 0.01)   # right now the bank charges 1%
    
    return fromValue * rate * (1-fee)

Which lines should be removed before committing and pushing into the code repository?
15) assert(fromCurrency !== toCurrency)
17) rate = getConversionRate(fromCurrency, toCurrency)
18) print("conversion rate:", rate)
20) fee = getFee()
21) assert(fee == 0.01) // right now the bank charges 1%
23) return fromValue * rate * (1-fee)

27.6.2. Debugger Exercise#

For this exercise, we will look at a hailstone sequence to see if the list of numbers converge to 1.

You will need to use Visual Studio Code or another Python debugger.

import sys

def hailstone_sequence(n): 
    """
    Compute a hailstone sequence.
    
    Args:
       n (int):  starting number for sequence.  Assumes n > 0.

    Returns:
    list containing a hailstone sequence starting with n and ending with 1.
    """
    result = [];
    
    while n != 1:
        result.append(n)
        if n % 2 == 0:
            n = n / 2
        else:
            n = 3 * n + 1
        
    result.append(n)
    print(max_of_list(result))
    return result


def max_of_list(list): 
    """
    Finds the maximum value of an array.  Note: Implemented function for
    demonstration purposes. In Python, we can simply use max(list)
    
    Args: 
      list: non-empty list containing only numerical items
    """
    max_value = sys.float_info.min
    for n in list:
        max_value = max(max_value, n)
    return max_value

print("hailstoneSequence(5)=", hailstone_sequence(5));

16
hailstoneSequence(5)= [5, 16, 8.0, 4.0, 2.0, 1.0]

Set a breakpoint on line 16 (result.append(n)), either by right-clicking to the left of the line number and choosing Add Breakpoint, or just clicking there to toggle the breakpoint. You should see a red dot appear on that line, and remain even after you move the mouse away.

Then select “Run” from the menu and then select “Start Debugging”. If asked, select debug the current Python file. This runs the program in the debugger. It should stop at the breakpoint.

How many elements are in result at this point? (Use the Variables pane to find out.)

Now look for the Step Over command, found as Run → Step Over in the menubar, but also as a toolbar button in the top of the Visual Studio Code window, which is more convenient to use repeatedly. Be careful to distinguish between Step Over and Step Into:

Step Over runs all the code on the current line and then moves to the next line.
Step Into enters the next method call on the current line, so it may jump to a method definition in very different part of the program. Step Into takes a smaller, more fine-grained step than Step Over.

Use Step Over to step over 6 statements.

How many elements are in array now?

Move forward until you reach line 23 print(maxOfList(result))), either by using Step Over repeatedly, or by setting a breakpoint there and using Continue (Run → Continue) to reach the breakpoint.

Now use Step Into to step into the method call on line 23.

What is the name of the method you are now in?

27.6.3. Swap Components#

If you hypothesize that the bug is in a module, and you happen to have a different implementation of it that satisfies the same interface, then one experiment you can do is to try swapping in the alternative. For example:

If you suspect your binarySearch() implementation, then substitute a simpler linearSearch() instead.
If you suspect the Python interpreter or a library, run with a different Python version or use a drop-in replacement for the library (for example many libraries exist as replacements for Python’s datetime library).
If you suspect the operating system, run your program on a different OS.
If you suspect the hardware, run on a different machine.

You can waste a lot of time swapping unfailing components, however, so don’t do this unless you have good reason to suspect a component. As we discussed under prioritizing hypotheses, the programming language, operating system, or hardware should be very low on your list of suspects.

27.6.4. One bug at a time#

It’s not unusual, while you’re trying to debug one problem, to discover other problems. Maybe you notice that some other module is returning wrong answers. Maybe while you’re reading your own code (effectively a self code review), you notice obvious mistakes that need to be fixed.

Keep a bug list. This helps you deal with fresh problems that arise during debugging. Write them down on your bug list so that you can come back to them later. A bug list can be as simple as a paper notebook or text file. For software development in a group, an online bug tracker like GitHub’s Issues tab is the best approach.

Don’t get distracted from the bug you’re working on. Reflect on whether the new problem is informative data for the bug you’re working on — does it lead to new hypotheses? But don’t immediately start a recursive debugging process for the new bug, because you may have a hard time popping your mental stack to return to the original bug. And don’t edit your code arbitrarily while you are debugging, because you don’t know whether those changes might affect your debugging experiments. Keep your code changes focused on careful, controlled probes of one bug at a time.

If the new problem is interfering with your ability to debug - for example, by crashing the program intermittently, so that you can’t reliably run your experiments - then you may need to reprioritize your bug fixing. Put the current bug down (unwinding or commenting out your experimental probes, and making sure the bug is on your bug list to fix later) and tackle the new bug instead.

27.6.5. Don’t fix yet#

It’s tempting to try to do an experiment that seems to fix the hypothesized bug, instead of a mere probe. This is almost always the wrong thing to do. First, it leads to a kind of ad hoc guess-and-test programming, which produces awful, complex, hard-to-understand code. Second, your fix may just mask the true bug without actually removing it - treating the symptom rather than the disease.

For example, if you’re getting a ValueError, don’t just add code that catches the exception and ignores it. Make sure you’ve understood why that exception was being thrown, and fix the real problem that was causing it.

27.6.5.1. Exercise: Premature fixes#

The following code has been debugged for a while:

def isAnagram(word1, word2):
    """
    Tests if word2 is an anagram of word1
    
    Args:
       word1 (str): string 
       word2 (str): another string

    Returns:
    true if and only if word1 is an anagram of word2 (i.e. a permutation of its characters)
    """
    if word1 == "" or word2 == "":
        return word1 == "" and word2 == ""
    
    try:
        word1 = sortCharacters(word1)
        word2 = sortCharacters(word2)
        return word1 == word2
    except: 
        return false
    

Which of its three return statements were probably added to patch over bugs, rather than fixing the real problem?

return word1 == "" and word2 == ""
return word1 == word2
except: return false

27.7. Step 4. Repeat#

After the experiment, reflect on the results and modify your hypothesis. If what you observed disagreed with what the hypothesis predicted, then reject that hypothesis and make a new one. If the observation agreed with the prediction, then refine the hypothesis to narrow down the location and cause of the bug. Then repeat the process with this fresh hypothesis.

27.7.1. Keep an audit trail#

This is one of Agans’ nine rules of debugging: Keep an Audit Trail.

If a bug takes more than a few minutes to find, or more than a couple iterations of the study-hypothesis-experiment loop, then you need to start writing things down, because the short-term memory in your brain will very quickly lose track of what’s working and what isn’t.

Keep a log in a text file of what you did, in what order, and what happened as a result. Include:

the hypothesis you are exploring now
the experiment you are trying now to test that hypothesis
what you observe as a result of the experiment:
- whether the test passed or failed this time
- the program output, especially your own debugging messages
- any stack traces

Systematic debugging techniques can rapidly exceed the ability of your brain to keep track of them, so start writing things down sooner rather than later. Minimizing a test case and delta debugging often require several iterations to zero in on an answer. If you do not write down which test case you are trying, and whether it passes or fails, then it will be hard to keep track of what you are doing.

27.7.2. Check the plug#

If you have been iterating on a bug and nothing makes sense, consider another of Agans’ rules: Check the Plug. What this means is you should question your assumptions. For example, if you turn on the power switch of a machine, and the machine does not start, maybe you should not be debugging the switch or the machine – but asking whether the machine is even plugged in? Or whether the electrical outlet itself has power?

An example of “checking the plug” in programming is to make sure your source code and object code are up to date. If none of your observations seem to make sense, one possible hypothesis is that the code you’re running (the object code) does not match the code you are reading (the source). To test this, pull the latest version from the repository and recompile all your program. This can be also a simple as making sure you are using the correct location if multiple copies of the source code exist. Within web applications, it may be necessary to restart the server to pick up the latest changes

27.7.3. If YOU didn’t fix it, it isn’t really fixed#

This is another one of Agans’ nine rules. In a complex system, sometimes a bug will suddenly disappear. Maybe it was something you did, but maybe not. If the system seems to be working now, and you don’t know how the bug got fixed… then most likely it isn’t fixed. The bug is just hiding again, masked by some change in environment, but ready to rear its ugly head again someday. We will see some examples of nasty bugs like this when we talk about concurrency, later in the course.

Systematic debugging helps build confidence that a bug is really fixed, and not just temporarily hidden. This is why you first reproduce the bug, so that you have seen the system in a failing state. Then you understand the cause of the bug without fixing it yet. Then finally apply a change to fix the bug using your knowledge of the cause. To gain confidence that you have really fixed it, you want to see that your change caused the system to transition from failing to working, and understand why.

27.8. Fix the Bug#

Once you find the bug and understand its cause, the third step is to devise a fix for it. Avoid the temptation to slap a patch on it and move on. Ask yourself whether the bug was a coding error, like a misspelled variable or interchanged method parameters, or a design error, like an underspecified or insufficient interface. Design errors may suggest that you step back and revisit your design, or at the very least consider all the other clients of the failing interface to see if they suffer from the bug too.

Look for related bugs, and newly-created ones. Think about whether any similar bugs might appear elsewhere in your code. If I just found a divide-by-zero error here, did I do that anywhere else in the code? Try to make the code safe from future bugs like this. Also consider what effects your fix will have. Will it break any other code?

Undo debugging probes. In the course of finding the bug, you may have commented out code, added print statements, or made other changes in order to probe the program’s behavior or make it faster to debug. Undo all those changes now, before you commit your fix to version control.

Make a regression test. After you have applied your fix, add the bug’s test case to your regression test suite, and run all the tests to assure yourself that (a) the bug is fixed, and (b) no new bugs have been introduced.

27.9. Other tips#

Get a fresh view. This is another of Agans’ nine rules of debugging. It often helps to explain your problem to someone else, even if the person you’re talking to has no idea what you’re talking about. This is sometimes called rubber-duck debugging or teddy-bear debugging. One computer lab had a big teddy bear sitting next to their help desk, with the rule that you had to “tell it to the bear” before you tried the human. The bear dealt with a surprising number of problems all by itself. Talking aloud about why you expect your code to work, and what it’s doing instead, can be good at helping you realize the problem yourself.

When the teddy bear doesn’t help, course staff and fellow students usually know what you’re talking about, so they’re even better to get help from. Outside this class, you may seek a fresh view from other team members, coworkers, or StackOverflow. The effort you put into minimizing your bug will help you make a minimal, reproducible example that you can use to ask a question online.

Another resource for how to ask questions: http://www.catb.org/~esr/faqs/smart-questions.html

Sleep on it. If you’re too tired, you won’t be an effective debugger. Trade latency for efficiency.

27.10. Suggested LLM Prompts#

Explain the concept of rubber duck debugging, where you explain your code line-by-line to an inanimate object or person. Describe how this strategy can help identify logical errors and improve code understanding.
Discuss the importance of reproducing the bug consistently. Provide examples of techniques to isolate and reproduce bugs, such as creating a minimal test case or writing a script to simulate the problematic scenario.
Describe the process of binary search debugging, where you systematically divide the code into smaller parts and test each part to identify the cause of the bug. Provide an example scenario and walk through the steps of this strategy.
Explain the use of printf/logging statements as a debugging strategy. Discuss how to effectively place these statements in the code to track variable values and program flow.
Discuss the importance of getting a stack trace and reading error messages carefully. Provide examples of how to interpret and understand different types of error messages and stack traces.
Describe the use of a debugger tool and its features, such as setting breakpoints, stepping through code, and inspecting variables. Explain how to effectively use a debugger to effectively understand a program’s state including variables and the code being executed.
Explain the strategy of time travel debugging, where you record the program’s execution and can step backwards or forwards through the execution to identify the root cause of the bug. Discuss the tools and techniques involved in this strategy.
Describe the strategy of code review or pair programming for debugging. Explain how having a second set of eyes on the code can help identify bugs and improve code quality. (From Julia Evans’ manifesto)
Discuss the importance of taking breaks and stepping away from the problem when stuck on a bug. Explain how this can help refresh your perspective and potentially lead to new insights or solutions.
What are David Agan’s nine rules for debugging? Provide an overview of the 9 nine rules including a short title and description. Then provide details with examples for each of the 9 nine rules. Use Python code when possible.

27.11. Review Questions#

What is the goal of debugging?
Describe a systematic debugging process. Why should such a process be used over just guessing?
Why is it important to find a small, repeatable test case that causes the failure?
What technique involves comparing successful runs to failing runs to isolate bug causes?
Which parts of a system are generally more trustworthy and less likely to contain bugs?
Describe three examples of probes.
What does “slicing” refer to when debugging?
How can design choices like immutability and minimal scope help slicing?
What is the main downside of using print statements for debugging?
What does “swapping components” refer to when debugging? Why is this generally not preferred?
What is “rubber duck debugging”?
How does a “debugger” fit into the debugging process?
Explain the difference between a breakpoint and a watchpoint for a debugger.
What does an assert statement do?
What is required for a hypothesis to be testable?

answers

27.12. Summary#

In this notebook, we looked at how to debug systematically:

reproduce the bug as a test case, and put it in your regression suite
find the bug using the scientific method:
- generate hypotheses using slicing, binary search, and delta debugging
- use minimially-invasive probes, like print statements, assertions, or a debugger, to observe program behavior and test the prediction of your hypotheses
fix the bug thoughtfully

Thinking about our three main measures of code quality, systematic debugging is essentially about safety from bugs: we’re trying to get rid of a bug, while using regression testing to keep it from coming back.

27.13. Exercise Answers#

27.13.1. Exercise: Reducing a bug to a test case - 1#

Answer: A,E
A. “chicken chicken beef”: One fewer chicken is shorter.
B. “Chicken Chicken Chicken beef”: This adds a new wrinkle, capitalization, so it’s not simplifying the current bad input. It’s making it more complex and trying to reveal a different bug. Good idea to have capitalization in your test cases for this spec, but focus on one bug at a time.
C. “chicken beef”: This is too short, because now chicken and beef are tied for the most common word.
D. “a b c”: Too different; introduces a three-way tie.
E. “c c c b”: Shortens the words themselves.

27.13.2. Exercise: Reducing a bug to a test case - 2#

Answer: B
You want to add a test case as a regression test against the risk of this bug appearing again. But since you have fixed both inputs with the same change, it is better to add the smaller, simpler of the two. Adding the “c b” test case would be a mistake, because it tests a stronger postcondition than the spec says.

27.13.3. Exercise: Scientific method#

“Study the data”. This is data that you received with the bug report.
“Hypothesize”. This is a hypothesis that you can test by running a simpler test case.
“Experiment”. This test case is an experiment that tests the hypothesis by replacing “chicken” and “beef” with simpler words.

27.13.4. Exercise: Stack trace#

line 5 in read_contents(filename)
line 18 in which main() was called.
On line 11 as the return value from get_filename()
text.txt. This is available as part of the exception message at the very end.

For reference, here is the code that generated the stack trace:

def get_filename():
    return input("Enter the data file name:")

def read_contents(filename):
    f = open(filename, "rt")
    contents = f.read()
    f.close()
    return contents

def process_file():
    name = get_filename()
    contents = read_contents(name)
    print(contents)
    
def main():
    process_file()
    
main()

---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[8], line 18
     15 def main():
     16     process_file()
---> 18 main()

Cell In[8], line 16, in main()
     15 def main():
---> 16     process_file()

Cell In[8], line 11, in process_file()
     10 def process_file():
---> 11     name = get_filename()
     12     contents = read_contents(name)
     13     print(contents)

Cell In[8], line 2, in get_filename()
      1 def get_filename():
----> 2     return input("Enter the data file name:")

File ~/Documents/GitHub/jupyternotebooks/venv/lib/python3.12/site-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
   1279 if not self._allow_stdin:
   1280     msg = "raw_input was called, but this frontend does not support input requests."
-> 1281     raise StdinNotImplementedError(msg)
   1282 return self._input_request(
   1283     str(prompt),
   1284     self._parent_ident["shell"],
   1285     self.get_parent("shell"),
   1286     password=False,
   1287 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.

27.13.5. Exercise: Slicing 1#

The expressions that reassign tax are part of the slice. So are (recursively) the slices of the values used in those expressions, i.e. taxRate and price. So are the if and for statements that control the execution of those lines.

answers: 2,3,4,5,7,8,9,10

27.13.6. Exercise: Slicing 2#

This example is subtle because of aliasing. The statement b=a makes b into an alias of a, which means that subsequent mutations of the array through b are also part of the slice of the value of a (in addition to the initial value passed in as an argument).

answers: 1,2,3,4

27.13.7. Exercise: Priority#

Ordering:

Reproducing the bug - seeing it fail yourself - makes it much easier to debug, and is always the best thing to start with.
Running the code coverage tool is not a bad next step, because it’s easy to run and may help localize the bug to untested code.
Inserting print statements to collect information takes more effort but the effort pays off.
Change Python interpreters is a good example of the “swap components” technique, but these are trusted components that should behave alike (at least for simple python like this), so they would be the last things to try.

27.13.8. Exercise: Probes#

The first assertion asserts the precondition of the method, so it can be kept.
The calls to getConversionRate() and getFee() are essential to the behavior of the method, so they must be kept.
The call to print is a debugging probe that should be removed.
The assertion that fee is 0.01 is a debugging probe too; the bank may change its fee in the future, so this assertion won’t be future-proof.
The return statement is essential to the method, so it must be kept

answers: 18,21

27.13.9. Exercise: Debugging#

0: The breakpoint stops before the statement executes, so the array is still empty.
2: At this point, the debugger should be at line 18, n=n/2.
max_of_list(): Line 23 actually has two method calls on it, but max_of_list() is called first, so Step Into enters it first.

27.13.10. Exercise: Premature Fixes#

Answers: 1,3

The special case for when one of the inputs is the empty string is likely to be a premature fix, because a well-designed sort should certainly be able to handle this case.
The return statement that handles exceptions is definitely papering over a mistake in the sort algorithm.
The other return statement is a necessary part of this approach to anagram testing.

27.14. License#

Sourced from https://web.mit.edu/6.031/www/sp22/classes/13-debugging/ Original license comments: Collaboratively authored by Rob Miller and Max Goldman, with contributions from Saman Amarasinghe, Adam Chlipala, Srini Devadas, Michael Ernst, John Guttag, Daniel Jackson, Martin Rinard, and Armando Solar-Lezama, and from Robert Durfee, Jenna Himawan, Stacia Johanna, Jessica Shi, Daniel Whatley, and Elizabeth Zhou. This work is licensed under CC BY-SA 4.0.

Updated to examples to Python, changed intro, added troubleshooting comments.