Title | “Mastering the Basics of Python” |
Topic | developing mastery of basic Python syntax and functionality |
Date | Fri May 6 |
Time | 10am~11am PST |
Keywords | Python, the built-in data types, the built-in functions, the standard library, advanced syntax |
These sessions are designed for a broad audience of non-software engineers and software programmers of all backgrounds and skill-levels.
Our expected audience should comprise attendees with a…
During this session, we will endeavor to guide our audience to developing…
builtins
(types and
functions.)In previous seminars, we have used all manner of Python syntax and core functionality to demonstrate points about data analysis and software development. We have not called attention to some of the precise choices made in our code samples, preferring instead to discuss the use-case or theoretical topic at hand.
In this seminar, we will dive into some of the exacting, precise choices that we regularly make when writing even very simple pieces of Python code. While many of the topics we will discuss could be classified as “introductory” Python, we will approach them from the perspective of someone who has already written a good deal of code in Python, someone who is looking to revisit and solidify decisions they may subconsciously make every day in their code.
Sample Agenda:
builtin
types:
list
and a tuple
(and why is
mutability/immutability the least interesting aspect)dict
, what is a collections.defaultdict
, and what is
a collections.Counter
, and what is a pandas.Series
; how are
they similar, how do they differ, and how do they solve subtly
different problems?numpy.ndarray
and how does it conceptually differ from
a list
?set
and frozenset
types?builtin
functions:
key=
argument, how do we use it, why is it
preferrably to the Decorate-Sort-Undecorate/“Schwarzian
Transform” formulation? (and what is the model for doing sorts,
mins, and maxes in pandas/NumPy with a custom predicate?)id
function, and how can we use it to better
understand whether we are working with a view or a copy? (what is
the difference between early/late-binding? what is the difference
between live/snapshot views? how do we understand these questions
in the context of numpy
or pandas
?)iter
and next
function? what is the difference
between an iterator and an iterable? why does this matter, and
how can we use this knowledge effectively?map
and filter
functions? what is the
itertools
module? when might we use these instead of
comprehension syntax?lambda
syntax, why is it useful, and what does it tell
people who are reading our code?from module import f
, import
module
, import module as mod
, and from module import f as
func
?Did you enjoy this seminar? Did you learn something new that will help you as you as you write larger Python scripts and analyses and write libraries to empower your colleagues’ work.
In a future seminar, we can go deeper into new syntax added to Python ≥3.6, and new approaches to writing Python that have evolved in the past five years.
We can discuss…
dict
raise KeyError
?dict
raise KeyError
?Why does dict
raise KeyError
?
print("Let's take a look!")
A dict
represents a one-way mapping; it’s a way to relate to data-sets in a
one-to-many relationship.
hosts = {
'abc.corp.net': ...,
'def.corp.net': ...,
'xyz.corp.net': ...,
}
The dict
facilitates fast lookup, fast membership checking.
from IPython import get_ipython; run_line_magic = get_ipython().run_line_magic
from random import randrange
print(' list '.center(80, '\N{box drawings light horizontal}'))
for sz in [100, 1_000, 10_000, 100_000, 1_000_000]:
data = [randrange(100) for _ in range(sz)]
run_line_magic('time', '-1 in data')
print(' dict '.center(80, '\N{box drawings light horizontal}'))
for sz in [100, 1_000, 10_000, 100_000, 1_000_000]:
data = {randrange(100) for _ in range(sz)}
run_line_magic('time', '-1 in data')
You use d[k]
syntax to retrieve (“getitem
”) and set (“setitem
”) entries.
en_fr = {
'one': 'un',
'two': 'deux',
'three': 'trois',
}
en_fr['four'] = 'quatre'
en_word = 'five'
fr_word = en_fr.get(en_word, '')
print(f'To say {en_word!r} in French, you say {fr_word}')
print(f'To say {en_word!r} loudly in French, you say {fr_word.upper()}!')
But! if you look up an entry that doesn’t exist, you get a KeyError
!
The dict
has a .get
method which allows you to supply a default if the
entry is not found.
devices = {
'storage': 4,
'compute': 3,
}
# print(f"{devices['ai/ml'] = }")
print(f"{devices.get('ai/ml', 0) = }")
There are subtypes of the dict
class in the collections
module—e.g., collections.defaultdict
from collections import defaultdict
devices = defaultdict(int, {
'storage': 4,
'compute': 3,
})
print(f"{devices['ai/ml'] = }")
print(f"{devices['networking'] = }")
A collections.Counter
is a subtype of dict
with special behaviour and
methods for counting things (i.e., mapping entities to non-zero integer
values—“counts.”)
from collections import Counter
devices = Counter({
'storage': 4,
'compute': 3,
})
print(f"{devices['ai/ml'] = }")
print(f"{devices.most_common(1) = }")
KeyError
?dict
-based datatypes like collections.defaultdict
or collections.Counter
.en_fr = {
'one': 'un',
'two': 'deux',
'three': 'trois',
}
en_fr['four'] = 'quatre'
en_word = 'two'
fr_word = en_fr[en_word]
print(f'To say {en_word!r} in French, you say {fr_word}')
print(f'To say {en_word!r} loudly in French, you say {fr_word.upper()}!')
devices = {
'storage': 4,
'compute': 3,
}
print(f"{devices['ai/ml'] = }")
class mydict(dict):
def __missing__(self, key):
return ...
tuple
good for?tuple
good for?What is tuple
good for?
print("Let's take a look!")
The list
type represents a collection of items. We often loop over these
items with a for
loop.
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
for h in hosts:
print(f'{h = }')
It has a “human ordering.”
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
for h in sorted(hosts, reverse=True):
print(f'{h = }')
It can be mutated—i.e., changed in place—using xs[idx]
syntax or methods like
xs.append
:
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
hosts.append('ghi.corp.net')
hosts.insert(0, 'jkl.corp.net')
hosts[0] = hosts[0].replace('jkl.', 'klm.')
for h in sorted(hosts, reverse=True):
print(f'{h = }')
But the tuple
type also exists, seems to operate very similarly to the
list
, except it’s immutable. Why is this even useful?
hosts = (
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
)
hosts[0] = hosts[0].replace('abc.', 'bca.') # TypeError!
In addition to looping syntax, We have unpacking syntax in Python, which works
with both tuple
and list
, except it requires an exact match for unpacking
to work.
t = 1, 2
a, b, c = t
print(f'{a = }')
print(f'{b = }')
print(f'{c = }')
In addition to list
and tuple
, we also have set
, which represents a
mathematical set
(unique elements with typical set operations.) The set
type is mutable but does not have a “human ordering.”
all_hosts = {
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
}
active_hosts = {
'xyz.corp.net',
}
new_hosts = {
'ghi.corp.net'
}
print(f'{all_hosts - active_hosts = }')
print(f'{all_hosts | new_hosts = }')
The contents of a dict
are typically “homogeneous.”
d = {
'a': 1,
'b': 2,
'c': 3,
}
...
...
...
...
k = 'c'
v = d[k]
v + 1 # XXX: how do I know this will work?
The contents of a list
are typically “homogeneous” as well…
xs = [1, 2, 3, 4.0, 5+3j]
...
...
...
xs.append(6)
xs.clear()
...
...
for x in xs:
print(f'{x + 1 = }')
But the contents of a tuple
are typically “heterogeneous”… and I typically
use it with unpacking syntax.
from pandas import to_datetime
t = 'abc.corp.net', 16, to_datetime('2020-01-01')
...
...
...
...
...
...
host, ports, installed = t
A tuple
is a record (e.g., a row in a database) and a list
is a collection
(e.g., a table in a database.)
from pandas import to_datetime
devices = [
('abc.corp.net', 16, to_datetime('2020-01-01')),
('def.corp.net', 32, to_datetime('2020-02-06')),
('xyz.corp.net', 16, to_datetime('2020-01-08')),
]
for host, ports, installed in devices:
print(f'{host} was installed {to_datetime("2020-12-31") - installed} ago')
tuple
?list
: collection of similar entitiesdict
: mapping from (similar) unique entities to similar entitiestuple
: one entity with multiple fieldsset
: grouping of unique entitiesdict
)dict
vs OrderedDict
)from pandas import to_datetime
devices = [
('abc.corp.net', 16, to_datetime('2020-01-01'), 'cisco'),
('def.corp.net', 32, to_datetime('2020-02-06'), 'cisco'),
('xyz.corp.net', 16, to_datetime('2020-01-08'), 'infinera'),
]
vendors_by_ports = {}
for host, ports, installed, vendor in devices:
if ports not in vendors_by_ports:
vendors_by_ports[ports] = set()
vendors_by_ports[ports].add(vendor)
for ports, vendors in vendors_by_ports.items():
print(f'Ports: {ports} Vendors: {", ".join(vendors)}')
What are extended unpacking syntax and the additional unpacking generalisations?
Additional unpacking syntax was added in Python 3.0:
Additional unpacking generalisations were added in Python 3.5:
print("Let's take a look!")
Python has unpacking syntax for “destructuring” elements in an Iterable
—i.e.,
decomposing a collection into individual elements, binding each element to a
separate variable name.
t = 1, 2, 3
a, b, c = t
There are some idioms associated with this syntax, like swapping two values without a temporary…
x = 1
y = 20
print(' Before '.center(20, '\N{box drawings light horizontal}'))
print(f'{x = }')
print(f'{y = }')
x, y = y, x
print(' After '.center(20, '\N{box drawings light horizontal}'))
print(f'{x = }')
print(f'{y = }')
… or performing multiple variable assignments on one line.
x, y = 123, 456
print(f'{x = }')
print(f'{y = }')
Unpacking requires that you have exactly the same number of elements in the Iterable as variables you specify.
t = 1, 2, 3
a, b, c = t
A *
in unpacking syntax packs any additional items into a list
.
t = 1, 2, 3
a, b, c, *rest = t
print(f'{a = }')
print(f'{b = }')
print(f'{c = }')
print(f'{rest = }')
A *
in a list
literal unpacks elements from an Iterable
into a list
:
xs = [1, 2, 3]
ys = [4, 5, 6, 7]
ws = [xs, ys]
zs = [*xs, *ys]
print(f'{ws = }')
print(f'{zs = }')
A *
in a set
or tuple
literal does similar, but unpacks into a set
or
tuple
respectively:
xs = [1, 2, 3, 4]
ys = [4, 5, 6, 7]
ws = *xs, *ys
zs = {*xs, *ys}
print(f'{ws = }')
print(f'{zs = }')
A **
in a dict
does similar, but unpacks elements from a Mapping
into a
dict
, performing a merge.
d1 = {'a': 1, 'b': 2, 'c': 3 }
d2 = { 'b': 20, 'c': 30, 'd': 40}
d3 = {**d1, **d2}
print(f'{d3 = }')
We have a number of different ways to do merges in Python.
We can use a collections.ChainMap
…
from collections import ChainMap
d1 = {'a': 1, 'b': 2, 'c': 3 }
d2 = { 'b': 20, 'c': 30, 'd': 40}
d3 = ChainMap(d2, d1)
for k, v in d3.items():
print(f'{k = }: {v = }')
del d2['b']
del d2['c']
for k, v in d3.items():
print(f'{k = }: {v = }')
print(f'{d3 = }')
We can use itertools.chain
with .items()
…
from itertools import chain
d1 = {'a': 1, 'b': 2, 'c': 3 }
d2 = { 'b': 20, 'c': 30, 'd': 40}
d3 = dict(chain(d1.items(), d2.items()))
d3 = d1.copy()
d3.update(d2)
print(f'{d3 = }')
We can use the **
unpacking syntax or (in Python ≥3.9) we can use the |
operator.
d1 = {'a': 1, 'b': 2, 'c': 3 }
d2 = { 'b': 20, 'c': 30, 'd': 40}
d3 = {**d1, **d2}
d4 = d1 | d2
print(f'{d3 = }')
print(f'{d4 = }')
entries = [123, ..., ..., ..., ..., ..., 456]
if len(entries) < 2:
raise ValueError('...')
head = entries[0]
tail = entries[-1]
diff = head - tail
entries = [..., ..., ..., ..., ...]
head, *_, tail = entries
def process(in_use, in_maintenance):
for dev in in_use:
...
for dev in in_maintenance:
...
def process(in_use, in_maintenance):
for dev in in_use + in_maintenance: # possible TypeError!
...
process([..., ...], [..., ...])
process([..., ...], {..., ...})
from itertools import chain
def process(in_use, in_maintenance):
for dev in chain(in_use, in_maintenance):
...
list(chain(in_use, in_maintenance))
def process(in_use, in_maintenance):
return [*in_use, *in_maintenance]
def process(in_use, in_maintenance):
return {*in_use, *in_maintenance}
Why does comprehension syntax exist?
Comprehension syntax was added in Python 2.0 with:
It was further extened to dict
and set
in Python 2.7 and 3.0 with:
It was even further extended with async
syntax in Python 3.6 with:
print("Let's take a look!")
In Python, we have a for
-loop which we typically use as a “for-each” loop.
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
for h in hosts:
print(f'{h = }')
We are discouraged from using it as a C-style for
-loop:
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
for idx in range(len(hosts)):
print(f'{hosts[idx] = }')
We have iteration helpers to allow us to use this a “for-each” loop in many situations (and we are encouraged to write our own iteration helpers.)
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
datacenters = [
'ghi1',
'ghi2',
'klm1',
]
# for idx in range(len(hosts)):
# h, d = hosts[idx], datacenters[idx]
# print(f'{h} in {d}')
for h, d in zip(hosts, datacenters, strict=True):
print(f'{h} in {d}')
We also have comprehension syntax, but it is more limited than for-loop syntax:
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
names = []
domains = set()
for h in hosts:
n, d = h.split('.', 1)
names.append(n)
domains.add(d)
print(f'{names = }')
print(f'{domains = }')
… which can be rewritten as…
hosts = [
'abc.corp.net',
'xyz.corp.net',
'def.corp.net',
]
names = [h.split('.', 1)[0] for h in hosts]
domains = {h.split('.', 1)[-1] for h in hosts}
print(f'{names = }')
print(f'{domains = }')
We have a list
, set
, and dict
comprehension (but no tuple
comprehension.)
xs = [-3, -2, -1, 0, 1, 2, 3]
all_squares = [x**2 for x in xs]
uniq_squares = {x**2 for x in xs}
squared_xs = {x: x**2 for x in xs}
print(f'{all_squares = }')
print(f'{uniq_squares = }')
print(f'{squared_xs = }')
However, we cannot do the following with comprehension syntax…
from collections import defaultdict
xs = [-3, -2, -1, 0, 1, 2, 3]
squared_xs = defaultdict(set)
for x in xs:
squared_xs[x**2].add(x)
print(f'{squared_xs = }')
Comprehensions can have filters and multiple levels, but…
xss = [[-3, -2, -1], [0], [1, 2, 3]]
ys = []
for xs in xss:
for x in xs:
ys.append(x**2)
zs = [x**2 for xs in xss for x in xs]
print(f'{ys = }')
print(f'{zs = }')
xss = [[-3, -2, -1], [0], [1, 2, 3]]
ys = []
for xs in xss:
if len(xs) > 1:
for x in xs:
if x % 2 == 0:
ys.append(x**2)
zs = [x**2 for xs in xss if len(xs) > 1 for x in xs if x % 2 == 0]
print(f'{ys = }')
print(f'{zs = }')
… comprehensions cannot:
xs = [1, 2, 3, 4]
# good!
for x in xs:
print(f'{x = }')
# misleading! — why did you create a list you didn't care about?
[print(f'{x = }') for x in xs]
# can you “skim” this?
for x in xs:
for y in x:
if ...:
continue
...
if ...:
break
...
...
...
...
...
xs = [... for ... in ... if ...]
...
...
...
ys = [f(x) for x in xs if cond(x)]
...
...
...
Why do I need context managers?
Context managers were added to Python 2.5 with:
print("Let's take a look!")
We often want to work with some resource, like a file or a database connection. This resource requires some set-up and tear-down.
f = open(__file__)
...
...
...
0 / 0
...
...
f.close()
We cannot rely on ourselves to do the tear-down manually, because we could forget or because an error could occur which would cause our tear-down code not to run.
f = open(__file__)
try:
...
...
...
...
...
...
...
...
...
...
...
...
...
...
finally:
f.close()
We cannot rely on the garbage collector to do this work for us, because the lifetime of our resources isn’t guaranteed to be tightly scoped to a block.
Therefore, we need special syntax to ensure we can sequence two operations:
with open(__file__) as f:
print(f'{not f.closed = }')
...
print(f'{not f.closed = }')
This pattern is so general that we want to extend it to a variety of situations.
For example, database connections…
from sqlite3 import connect
with connect(':memory:') as conn:
pass
… temporary directories…
from tempfile import TemporaryDirectory
with TemporaryDirectory() as d:
pass
… or even configuration and settings.
from decimal import localcontext
with localcontext() as ctx:
ctx.prec = 20
...
As with any “extension” that implements the Python vocabulary, we use special
methods __enter__
and __exit__
.
class T:
def __enter__(self):
print('before')
def __exit__(self, exc_type, exc_value, traceback):
print('after')
with T():
print('inside')
Context managers are a sequencing mechanisms, but so are generators.
def g():
print('before')
yield
print('after')
gi = g()
next(gi)
print('inside')
next(gi, None)
We can adapt one mechanism to the other using the decorator contextlib.contextmanager
.
from contextlib import contextmanager
@contextmanager
def g():
print('before')
yield
print('after')
with g():
print('inside')
contextlib.contextmanager
decorator: write them often!with test_db() as db:
with test_data(db): # baseline
...
with test_data(db): # scenario #1
...
with test_data(db): # alternate baseline
...
with test_data(db): # alternate baseline
...
from contextlib import contextmanager
from sqlite3 import connect
from tempfile import TemporaryDirectory
from pathlib import Path
from random import choice, randrange
from string import ascii_lowercase
@contextmanager
def test_db():
create = '''
create table test (
name text
, value number
);
'''
drop = 'drop table test'
with TemporaryDirectory() as d:
d = Path(d)
with connect(d / 'test.db') as db:
try:
db.execute(create)
yield db
finally:
db.execute(drop)
@contextmanager
def test_data(db):
data = [
(''.join(choice(ascii_lowercase) for _ in range(2)), randrange(100))
for _ in range(10)
]
try:
db.executemany('insert into test values (?, ?)', data)
yield
finally:
db.executemany('delete from test where name=? and value=?', data)
with test_db() as db:
with test_data(db):
...
with test_data(db):
cur = db.execute('select name, sum(value) from test group by name limit 3')
for row in cur: print(f'{row = }')
with test_data(db):
...
with test_data(db):
...
asyncio
?asyncio
?Why do I need asyncio
?
Special syntax for asyncio
was added to Python 3.5 with:
print("Let's take a look!")
If I want to work concurrently, I have the following choices:
threading
multiprocessing
asyncio
In threading
, I have one process:
from threading import Thread
from queue import Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice
from time import sleep
@dataclass
class Job:
name : str
@classmethod
def from_random(cls):
name = ''.join(choice(ascii_lowercase) for _ in range(4))
return cls(name)
def producer(q):
while True:
for _ in range(choice([1, 2])):
j = Job.from_random()
print(f'Enqueueing job {j = }')
q.put(j)
sleep(1)
def consumer(name, q):
while True:
j = q.get()
print(f'Servicing job {j = } @ {name = }')
sleep(.1)
def main():
q = Queue()
pool = [
Thread(target=producer, kwargs={'q': q}),
Thread(target=consumer, kwargs={'q': q, 'name': 'consumer#1'}),
Thread(target=consumer, kwargs={'q': q, 'name': 'consumer#2'}),
]
for x in pool: x.start()
main()
In multiprocessing
, I have multiple process:
from multiprocessing import Process, Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice
from time import sleep
@dataclass
class Job:
name : str
@classmethod
def from_random(cls):
name = ''.join(choice(ascii_lowercase) for _ in range(4))
return cls(name)
def producer(q):
while True:
for _ in range(choice([1, 2])):
j = Job.from_random()
print(f'Enqueueing job {j = }')
q.put(j)
sleep(1)
def consumer(name, q):
while True:
j = q.get()
print(f'Servicing job {j = } @ {name = }')
sleep(.1)
def main():
q = Queue()
pool = [
Process(target=producer, kwargs={'q': q}),
Process(target=consumer, kwargs={'q': q, 'name': 'consumer#1'}),
Process(target=consumer, kwargs={'q': q, 'name': 'consumer#2'}),
]
for x in pool: x.start()
main()
Therefore:
threading
multiprocessing
multiprocessing.shared_memory
can reduce the penalty for sharing data
(but will not eliminate the runtime boundary)threading
multiprocessing
As a third option, we have asyncio
. On face, it looks similar to threading:
In asyncio
, I have one process:
Note the special async
and await
syntax:
from asyncio import gather, run, sleep as aio_sleep
from asyncio.queues import Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice
@dataclass
class Job:
name : str
@classmethod
def from_random(cls):
name = ''.join(choice(ascii_lowercase) for _ in range(4))
return cls(name)
async def producer(q):
while True:
for _ in range(choice([1, 2])):
j = Job.from_random()
print(f'Enqueueing job {j = }')
await q.put(j)
await aio_sleep(1)
async def consumer(name, q):
while True:
j = await q.get()
print(f'Servicing job {j = } @ {name = }')
await aio_sleep(.1)
async def main():
q = Queue()
tasks = [
producer(q=q),
consumer(q=q, name='consumer#1'),
consumer(q=q, name='consumer#1'),
]
await gather(*tasks)
run(main())
asyncio
?multiprocessing
threading
or asyncio
asyncio
threading
threading
or asyncio
asyncio
coöperative scheduling means everyone needs to coöperate. Consider
asyncio
on day one of your project.