8. Strings#
Programmers constantly deal with text - reading it, processing it, and writing it. Think of the last few times that you used your computer or smartphone. How much did you deal with text versus numbers? User names, messages, stock symbols, etc.
To be an effective programmer, you must understand how to use, manipulate, and format strings.
In Python 3, strings are now a sequence of Unicode characters. Unicode is an international standard to provide consistent encoding and representation of text. Python 2 used ASCII for the character set instead. The ASCII standard provided a representation for 128 characters (only 95 of these were printable - the other 33 were control codes). In most situations, this difference is insignificant as the first 127 characters of Unicode overlap with ASCII. However, if you ever deal with non-English characters, Unicode and the UTF-8 encoding to store and transmit those characters becomes critical. By default, UTF-8 is the encoding for Python files.
The Unicode 14.0 standard, published in September 2021, defines 144,697 characters.
Sample Unicode characters: (Notice in the subsequent two code blocks that Unicode characters can be represented by their actual symbol or an escape sequence such as \u20ac.
1print('\u20ac') # euro
2print('\u2603') # snowman
€
☃
We can also convert Unicode characters to and from numbers with the ord()
and chr()
built-in functions. Fundamentally, everything for a computer is a number. Therefore, strings can also be thought of as a series of numbers.
1print('A to #:',ord('A'))
2print('65 to Unicode:',chr(65))
3print('Unicode to #:',ord('☃'))
4print('9731 to Unicode',chr(9731))
A to #: 65
65 to Unicode: A
Unicode to #: 9731
9731 to Unicode ☃
Try calling ord()
with more than one character, such as the string ‘hello’, to see what occurs.
1# make the call and see the resulting error.
Like Java, strings in Python are immutable ➞ you can not change their value once defined; any alteration creates a new string object. However, in C and C++, string values can be altered.
8.1. Creating Strings#
You have already seen three different methods for creating strings through these notebooks:
Literals
Converting from other types with
str()
input()
For literals, you can use either single quotes or double quotes. What you use should be based upon any project team standards and convenience for creating strings if they contain quotes or double quotes.
1print('Charles')
2print("Dickens")
3print("Charles Dickens's book, the Tale of Two Cities, is ... ")
4print('Charles Dickens\'s book, the Tale of Two Cities, is ... ')
Charles
Dickens
Charles Dickens's book, the Tale of Two Cities, is ...
Charles Dickens's book, the Tale of Two Cities, is ...
In the last example, notice that we escaped the ‘ in the middle of the string by using a backslash \
. We can also escape double qoutes with \"
.
1print("How do you pronouce the word \"tomato\"?")
How do you pronouce the word "tomato"?
Common escape sequences: (View a more extensive table)
Escape |
Result |
---|---|
|
newline character |
|
tab (used to align text) |
|
\ |
|
‘ |
|
“ |
1print("It was the best of times,\nit was the worst of times,\nit was the age of wisdom,\nit was the age of foolishness...")
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness...
If you need to include a \
in a string, rather than using an escape sequence, you can use a raw string. Specify a raw string in Python by prefixing a string value with an r
or R
.
1print(r"\\Here is a\nraw string")
2print(R"and\tanother\\") # r"and\tanother\" does not work as the tokenizer first converts \" to a double quote.
\\Here is a\nraw string
and\tanother\\
We can also create string literals with three single quotes('''
) or three double quotes("""
) rather than just '
or "
. While triple quotes can be used for short strings, they are most commonly used to create multiline strings. We’ve already seen this with some of the docstrings in earlier notebooks.
1# line continuation so we can start the string on the next line.
2# Note: no characters can occur after the "\" in a line
3opening = \
4"""It was the best of times,
5it was the worst of times,
6it was the age of wisdom,
7it was the age of foolishness..."""
8print(opening)
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness...
Triple quote strings are also convenient if you need to include single quotes or double quotes.
1print("""John's book stated "Programming is fun!" """) # syntax error with 4 quotes in a row. try removing the space
John's book stated "Programming is fun!"
Use str()
to convert another type into a string.
In the last line of the following block, we return the value of b
rather than printing it to show the output difference from the previous line. If the last line of a code block is a value, Jupyter considers this a return value and displays it automatically.
1print(type(1842))
2x = str(1842)
3print(type(x))
4print(x)
5print(type(None))
6y = str(None)
7print(type(y))
8print(y)
9b = str(True)
10print(type(b))
11print(b)
12b
<class 'int'>
<class 'str'>
1842
<class 'NoneType'>
<class 'str'>
None
<class 'str'>
True
'True'
Using input()
to get a string value from the user:
1x = input("Enter something: ")
2x
---------------------------------------------------------------------------
StdinNotImplementedError Traceback (most recent call last)
Cell In[11], line 1
----> 1 x = input("Enter something: ")
2 x
File ~/Documents/GitHub/jupyternotebooks/venv/lib/python3.12/site-packages/ipykernel/kernelbase.py:1281, in Kernel.raw_input(self, prompt)
1279 if not self._allow_stdin:
1280 msg = "raw_input was called, but this frontend does not support input requests."
-> 1281 raise StdinNotImplementedError(msg)
1282 return self._input_request(
1283 str(prompt),
1284 self._parent_ident["shell"],
1285 self.get_parent("shell"),
1286 password=False,
1287 )
StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
8.2. Concatenation#
We can concatenate (combine) two strings together with +
1"Duck" + "Duck" + "Go"
'DuckDuckGo'
We also concatenate literal strings by placing one immediately after another string:
1"test"'test'"""test"""
'testtesttest'
Python does not automatically insert spaces or other characters when concatenating strings.
8.3. Length#
Use len(str)
to get the length of a string. Often, we will use the length for validation. We can also use it to help access specific parts of the string (whether a single character or a substring.)
1len("test"*5) # the * 5 operator causes the string to be duplicated 5 times.
20
8.4. Accessing a Single Character#
To get a single character, we can use square brackets with the character offset inside the bracket. The offset starts at 0 (the leftmost character) and goes to the length of the string minus one.
1digits = "0123456789"
2print(digits[0])
3print(digits[9])
0
9
You can also index characters with negative integers. For example, -1 specifies the rightmost character (same as length -1), and -2 specifies the one before.
1print(digits[-1])
2print(digits[-2])
3print(digits[-10])
4print(digits[-len(digits)])
9
8
0
0
If you attempt to access a character outside of a length of the string, you will receive an IndexError:
1print(digits[10])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[17], line 1
----> 1 print(digits[10])
IndexError: string index out of range
Notice that we can not use this access method to change characters in a string - the string is immutable!
1digits[5] = 'A'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[18], line 1
----> 1 digits[5] = 'A'
TypeError: 'str' object does not support item assignment
8.5. Slicing for Substrings#
In addition to representing text, strings are the first example of sequences in these notebooks. (Lists are sequences as well.)
Extracting parts of a string occurs quite frequently. Often, we may search for a particular part of a string by looking for a specified delimiter (a value that separates fields or parts of a string) or just simply use the some value based off of the string’s length.
As mentioned above, strings are sequences of Unicode characters. The methods used below to manipulate strings apply to lists as tuples.
We can extract parts of a string (a substring) by using a slice. A slice is a range of index numbers separated by a colon within square brackets.
Slice |
Description |
---|---|
[:] |
extracts the entire sequence from start to end |
[start:] |
specifies the sequence from the start to the end |
[ : end ] |
specifies the sequence from the beginning to the end offset - 1 |
[ start : end ] |
specifies the sequence from the start to the end offset -1 |
[ start : end : step] |
specifies the sequence from the start to the end offset -1, skipping characters by step |
The following image summarizes how we use indexes and slices:
Source: https://www.faceprep.in/python/string-slicing-in-python/
1opening_line = "It was the best of times"
2print(len(opening_line))
3print(opening_line[:])
4print(opening_line[19:])
5print(opening_line[:2])
6print(opening_line[3:6])
7print(opening_line[1:10:2])
24
It was the best of times
times
It
was
twste
Try repeating the above code by using ‘abcdefghijklmnopqrstuvwxyz’ as the value for opening_line
.
As with accessing characters, we can also use negative offsets with slices. We can also combine positive and negative values.
1digits = "0123456789"
2print(digits[-3:])
3print(digits[:-6])
4print(digits[-6:-3])
5print(digits[-6:9])
6print(digits[-10:-1:2])
789
0123
456
45678
02468
The Python interpreter returns an empty string if you specify a slice where the start offset occurs after the end offset.
1digits[5:1]
''
Unlike accessing characters, if we specify an index out of the valid range, the minimum or maximum value is used in its place.
1digits[0:15]
'0123456789'
1digits[-12:]
'0123456789'
8.6. Splitting Strings#
To split a string into multiple strings based upon some separator within that string, you can call the method split()
. (Notice, we called split()
a method rather than a function as it belongs to an object of type string.) As shown in the below example, split(' ')
creates a list of the character sequences separated by a space.
split()
returns a list of each one of those strings. This guide discusses lists in a later page. For now, you can get the length of the resulting list with len()
and access an element a given using the []
operator.
1opening_line = "It was the best of times."
2word_list = opening_line.split(' ')
3print(word_list)
4print(type(word_list))
5print(len(word_list))
6print(word_list[3])
['It', 'was', 'the', 'best', 'of', 'times.']
<class 'list'>
6
best
1opening_line.split('e') #splitting on the character e
['It was th', ' b', 'st of tim', 's.']
1opening_line.split('the') #splitting on the string 'the'
['It was ', ' best of times.']
Not specifying a separator will split the string based on whitespace, including spaces, tabs, and newlines.
1s = "This is a\ntest\tof splitting whitespace."
2print(s)
3s.split()
This is a
test of splitting whitespace.
['This', 'is', 'a', 'test', 'of', 'splitting', 'whitespace.']
8.7. Joining Strings#
Use the method join()
to combine a list of strings into a single string. Rather than belonging to the list
class, join()
belongs to the str
(string) class. From a design point of view, this makes sense for join()
to belong to str
as we use a string as the value between the combined strings.
1split_list = ['This', 'is', 'a', 'test', 'of', 'joining', 'strings']
2print("".join(split_list))
3print(":".join(split_list))
4print(", ".join(split_list))
Thisisatestofjoiningstrings
This:is:a:test:of:joining:strings
This, is, a, test, of, joining, strings
8.8. Search and Replace#
Python contains multiple ways to find if one string contains another.
First, you can use the in
operator, which returns True
or False
.
1line = "can we search for the specified value in the string?"
2"value" in line
True
We can also use find()
or index()
to get the position where the string starts within another string. These two functions are the same, except find()
returns -1 if the substring is not found, while index()
raises an exception(ValueError
) if the substring is not present. As Python can use negative numbers to index characters in a string, we would need to check the return value prior to using it explicitly. By using index()
, we ignore checking the return value and rely upon exception processing to handle the case where the substring does not exist.
Both methods also allow you to limit the search based on the starting and, possibly, ending positions.
1help(line.index)
2print(line.index("the"))
Help on built-in function index:
index(...) method of builtins.str instance
S.index(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
18
To find the last occurence of a substring within a string, you can use rfind()
and rindex()
.
1print(line.rfind("the"))
41
The string class also offers a simple mechanism to search for values in the string and replace them with a different string, returning a new string.
string.replace(’searchValue’,’replaceValue’)
1s = "hello, how are you?"
2s.replace('o',"xxx")
'hellxxx, hxxxw are yxxxu?'
1s.replace('how','test')
'hello, test are you?'
1# demonstrating that s.replace() returns a new object reference/id
2print(id(s))
3t = s.replace(",", " -")
4print(t)
5print(id(t))
4606314480
hello - how are you?
4606321648
8.9. Striping Characters#
We can also remove any whitespace (’ ‘,’\t’,’\n’) from the start and/or end of a string. Quite often, programmers will perform this task to “clean up” any text a user may enter. (Or even, text received from other systems and files)
string.strip()
removes whitespace from the start and endstring.lstrip()
removes whitespace from the startstring.rstrip()
removes whitespace from the end
1s = " Hitchhiker's Guide to the Galaxy "
2s
" Hitchhiker's Guide to the Galaxy "
1s.strip()
"Hitchhiker's Guide to the Galaxy"
1s.lstrip()
"Hitchhiker's Guide to the Galaxy "
1s.rstrip()
" Hitchhiker's Guide to the Galaxy"
Python doesn’t limit the striping capabilities to whitespace. You can also specify a string argument that would strip any character in that string.
1s.strip(' Hi')
"tchhiker's Guide to the Galaxy"
Of course, you can use a single character as the argument.
1s = '!!Are you crazy?!!'
2s.strip('!')
'Are you crazy?'
8.10. String comparisons#
Strings use the same comparison operators as numbers use. In addition, Python compares strings in a lexicographic order (dictionary order) where uppercase letters have smaller values than lowercase letters.
1a = "alpha"
2b = "alp" +"ha"
3c = "test"
1print("a < b",a < b)
2print("a <= b", a <= b)
3print("a >= b", a >= b)
4print("a > b", a > b)
5print("a == b", a == b)
6print("a < c", a < c)
7print("a == c", a == c)
8print("a > c", a > c)
a < b False
a <= b True
a >= b True
a > b False
a == b True
a < c True
a == c False
a > c False
8.11. Notes#
Do not focus on trying to memorize all of the different functions available for strings - or, for that matter, many of the other objects that will presented. Rather, you should be familar with the general functionality available for strings and other object types. Then, use web searches, help, or LLMs to find the exact function and the associated arguments.
Become comfortable, though, with using []
to access a particular character and using slicing to access parts of string. Both of these concepts will appear as we look at other sequence types (e.g., lists and tuples) in upcoming pages.
8.12. Suggested LLM Prompts#
Compare and constrast ASCII versus Unicode representations.
Explain slicing in Python using string examples.
Explain raw strings in Python.
Explain the immutable nature of strings in Python and its implications. Discuss how to create new strings based on existing ones and how to concatenate or modify strings efficiently.
Provide a set of Python-based exercises or coding challenges that involve various string manipulation tasks, such as reversing a string, checking for palindromes, removing duplicate characters, and counting the occurrences of specific substrings.
8.13. Review Questions#
How are strings represented in Python?
Compare and contrast ASCII versus Unicode representations. Does Python 3 offer a choice of which representation can be used?
Can strings be changed once they are created? What advantages and disadvantages does this have?
How are strings indexed in Python? Provide examples for a single character, starting from the start, middle, and end of a string.
What advantages does negative indexing provide? Provide an example.
What function returns the length of the string?
How would you remove leading and trailing whitespace from a string representing a company’s address?
Given a string representing a company’s financial report, how would you replace all occurrences of the word “profit” with “revenue”?
8.14. Drill#
Create variable named
u
with the value “Duke University”.Get length of
u
.Get the string “Duke” using string slicing.
Find the position of “U” in the string.
Get the string “University” using string slicing and positive indice(s).
Get the string “University” using string slicing and negative indice(s).
Get the string “e U” using string slicing.
Get the string “v” from the string.
Test if the string “fintech” is in
u
.Create a variable named
x
with the value “ finance “.Strip the whitespace characters from
x
.Strip the whitespace characters from the left side of
x
.Replace all of the spaces in
x
with a ‘-‘.Convert
u
to all uppercase letters.Find the last occurence of “e” in
u
.Combine (concatenate)
u
andx
.Create a string of every other character from
u
.
8.15. Exercises#
How would you get more information about the
ord()
built-in function?How would you get more information about the
center()
method that belongs to thestr
classCreate a string literal for the value after the colon and assign it to the variable
test
: Let’s go run now!Create a multiline string literal for one of the verses of Row, Row, Row Your Boat
For these questions, create a variable called
excited
that has the following string repeated 10 times: “I’m so excited!\n”. For this, we can use string duplication:excited = "I'm so excited!\n" * 10
How long is the string in the variable
excited
? Use a function to get the answerUsing code, what is the first letter of
excited
? The last letter?Get a string (a substring) from
excited
with the last 9 characters.Get a substring from
excited
starting at the first position(0), but skip every 17th character. What is unusual about this result?
Split the following string by dashes:
a = "20220502-https://python.org-200-127.0.0.1"
You have just arrived at a dystopian society that does not believe in vowels. Write a function to remove any vowels from a string and return a result. For some reason, they don’t mind y’s.
Given the following string, extract the portion after the colon and convert it to an integer variable.
line = 'Conference Attendance:14000'
Use
find()
and string slicing to extract the string.Repeat the previous question, but use
split()
.Write a function, firstXCharacters, that returns the first x characters of the passed in string.
Write a function, moveFirstToLast, that takes a string as the argument and returns another string, but with the firstCharacter moved to the end of the string.
The web is built with HTML strings like “<i>Hello</i>” which shows Hello as italicized text. In this example, the “i” tag makes and which surround the word “Hello”. Complete the following function:
def make_tags(tag, line): """ Given tag and line strings, create the HTML string with tags around the word. Example: make_tags('i', 'Hello') returns '<i>Hello</i>' """
If you’d like more practice, look at some of the functions not covered in the exercises. Try comparing different strings.