seminars.fb

Seminar (Fri May 6): “Mastering the Basics of Python” (developing mastery of basic Python syntax and functionality)

   
Title “Mastering the Basics of Python”
Topic developing mastery of basic Python syntax and functionality
Date Fri May 6
Time 10am~11am PST
Keywords Python, the built-in data types, the built-in functions, the standard library, advanced syntax

Audience

These sessions are designed for a broad audience of non-software engineers and software programmers of all backgrounds and skill-levels.

Our expected audience should comprise attendees with a…

During this session, we will endeavor to guide our audience to developing…

Abstract

In previous seminars, we have used all manner of Python syntax and core functionality to demonstrate points about data analysis and software development. We have not called attention to some of the precise choices made in our code samples, preferring instead to discuss the use-case or theoretical topic at hand.

In this seminar, we will dive into some of the exacting, precise choices that we regularly make when writing even very simple pieces of Python code. While many of the topics we will discuss could be classified as “introductory” Python, we will approach them from the perspective of someone who has already written a good deal of code in Python, someone who is looking to revisit and solidify decisions they may subconsciously make every day in their code.

Sample Agenda:

What’s Next?

Did you enjoy this seminar? Did you learn something new that will help you as you as you write larger Python scripts and analyses and write libraries to empower your colleagues’ work.

In a future seminar, we can go deeper into new syntax added to Python ≥3.6, and new approaches to writing Python that have evolved in the past five years.

We can discuss…

Notes

Question: Why does dict raise KeyError?

Question: Why does dict raise KeyError?

Why does dict raise KeyError?

print("Let's take a look!")

The Basics

A dict represents a one-way mapping; it’s a way to relate to data-sets in a one-to-many relationship.

hosts = {
    'abc.corp.net': ...,
    'def.corp.net': ...,
    'xyz.corp.net': ...,
}

The dict facilitates fast lookup, fast membership checking.

from IPython import get_ipython; run_line_magic = get_ipython().run_line_magic
from random import randrange

print(' list '.center(80, '\N{box drawings light horizontal}'))
for sz in [100, 1_000, 10_000, 100_000, 1_000_000]:
    data = [randrange(100) for _ in range(sz)]
    run_line_magic('time', '-1 in data')

print(' dict '.center(80, '\N{box drawings light horizontal}'))
for sz in [100, 1_000, 10_000, 100_000, 1_000_000]:
    data = {randrange(100) for _ in range(sz)}
    run_line_magic('time', '-1 in data')

You use d[k] syntax to retrieve (“getitem”) and set (“setitem”) entries.

en_fr = {
    'one':   'un',
    'two':   'deux',
    'three': 'trois',
}
en_fr['four'] = 'quatre'

en_word = 'five'
fr_word = en_fr.get(en_word, '')

print(f'To say {en_word!r} in French, you say {fr_word}')
print(f'To say {en_word!r} loudly in French, you say {fr_word.upper()}!')

But! if you look up an entry that doesn’t exist, you get a KeyError!

Beyond the Basics

The dict has a .get method which allows you to supply a default if the entry is not found.

devices = {
    'storage':  4,
    'compute':  3,
}

# print(f"{devices['ai/ml']     = }")
print(f"{devices.get('ai/ml', 0) = }")

There are subtypes of the dict class in the collections module—e.g., collections.defaultdict

from collections import defaultdict

devices = defaultdict(int, {
    'storage':  4,
    'compute':  3,
})

print(f"{devices['ai/ml'] = }")
print(f"{devices['networking'] = }")

A collections.Counter is a subtype of dict with special behaviour and methods for counting things (i.e., mapping entities to non-zero integer values—“counts.”)

from collections import Counter

devices = Counter({
    'storage':  4,
    'compute':  3,
})

print(f"{devices['ai/ml'] = }")
print(f"{devices.most_common(1) = }")

Conclusions & Consequences: Why KeyError?

  1. Don’t handle errors unless you can meaningfully do something about them.
  2. Develop familiarity with dict-based datatypes like collections.defaultdict or collections.Counter.
en_fr = {
    'one':   'un',
    'two':   'deux',
    'three': 'trois',
}
en_fr['four'] = 'quatre'

en_word = 'two'
fr_word = en_fr[en_word]

print(f'To say {en_word!r} in French, you say {fr_word}')
print(f'To say {en_word!r} loudly in French, you say {fr_word.upper()}!')
devices = {
    'storage':  4,
    'compute':  3,
}

print(f"{devices['ai/ml']     = }")
class mydict(dict):
    def __missing__(self, key):
        return ...

Question: What is tuple good for?

Question: What is tuple good for?

What is tuple good for?

print("Let's take a look!")

The Basics

The list type represents a collection of items. We often loop over these items with a for loop.

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

for h in hosts:
    print(f'{h = }')

It has a “human ordering.”

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

for h in sorted(hosts, reverse=True):
    print(f'{h = }')

It can be mutated—i.e., changed in place—using xs[idx] syntax or methods like xs.append:

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]
hosts.append('ghi.corp.net')
hosts.insert(0, 'jkl.corp.net')
hosts[0] = hosts[0].replace('jkl.', 'klm.')

for h in sorted(hosts, reverse=True):
    print(f'{h = }')

But the tuple type also exists, seems to operate very similarly to the list, except it’s immutable. Why is this even useful?

hosts = (
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
)
hosts[0] = hosts[0].replace('abc.', 'bca.') # TypeError!

Beyond the Basics

In addition to looping syntax, We have unpacking syntax in Python, which works with both tuple and list, except it requires an exact match for unpacking to work.

t = 1, 2
a, b, c = t

print(f'{a = }')
print(f'{b = }')
print(f'{c = }')

In addition to list and tuple, we also have set, which represents a mathematical set (unique elements with typical set operations.) The set type is mutable but does not have a “human ordering.”

all_hosts = {
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
}

active_hosts = {
    'xyz.corp.net',
}

new_hosts = {
    'ghi.corp.net'
}

print(f'{all_hosts - active_hosts = }')
print(f'{all_hosts | new_hosts    = }')

The contents of a dict are typically “homogeneous.”

d = {
    'a': 1,
    'b': 2,
    'c': 3,
}

...
...
...
...

k = 'c'
v = d[k]
v + 1 # XXX: how do I know this will work?

The contents of a list are typically “homogeneous” as well…

xs = [1, 2, 3, 4.0, 5+3j]

...
...
...
xs.append(6)
xs.clear()
...
...

for x in xs:
    print(f'{x + 1 = }')

But the contents of a tuple are typically “heterogeneous”… and I typically use it with unpacking syntax.

from pandas import to_datetime

t = 'abc.corp.net', 16, to_datetime('2020-01-01')
...
...
...
...
...
...
host, ports, installed = t

A tuple is a record (e.g., a row in a database) and a list is a collection (e.g., a table in a database.)

from pandas import to_datetime

devices = [
    ('abc.corp.net', 16, to_datetime('2020-01-01')),
    ('def.corp.net', 32, to_datetime('2020-02-06')),
    ('xyz.corp.net', 16, to_datetime('2020-01-08')),
]

for host, ports, installed in devices:
    print(f'{host} was installed {to_datetime("2020-12-31") - installed} ago')

Conclusions & Consequences: Why tuple?

  1. Choose the appropriate type for “modelling” your data—expressing what this data means.
    • list: collection of similar entities
    • dict: mapping from (similar) unique entities to similar entities
    • tuple: one entity with multiple fields
    • set: grouping of unique entities
  2. Consider the behaviour of these types as consequences and guidances for their use.
    • mutable vs immutable
    • human ordered vs machine ordered
      • insertion ordered (e.g., dict)
      • ordering considered for equality or not (e.g., dict vs OrderedDict)
    • hashable vs not hashable
from pandas import to_datetime

devices = [
    ('abc.corp.net', 16, to_datetime('2020-01-01'), 'cisco'),
    ('def.corp.net', 32, to_datetime('2020-02-06'), 'cisco'),
    ('xyz.corp.net', 16, to_datetime('2020-01-08'), 'infinera'),
]

vendors_by_ports = {}
for host, ports, installed, vendor in devices:
    if ports not in vendors_by_ports:
        vendors_by_ports[ports] = set()
    vendors_by_ports[ports].add(vendor)

for ports, vendors in vendors_by_ports.items():
    print(f'Ports: {ports} Vendors: {", ".join(vendors)}')

Question: What are extended unpacking syntax and the additional unpacking generalisations?

Question: What are extended unpacking syntax and the additional unpacking generalisations?

What are extended unpacking syntax and the additional unpacking generalisations?

Additional unpacking syntax was added in Python 3.0:

Additional unpacking generalisations were added in Python 3.5:

print("Let's take a look!")

The Basics

Python has unpacking syntax for “destructuring” elements in an Iterable—i.e., decomposing a collection into individual elements, binding each element to a separate variable name.

t = 1, 2, 3
a, b, c = t

There are some idioms associated with this syntax, like swapping two values without a temporary…

x = 1
y = 20

print(' Before '.center(20, '\N{box drawings light horizontal}'))
print(f'{x = }')
print(f'{y = }')

x, y = y, x

print(' After '.center(20, '\N{box drawings light horizontal}'))
print(f'{x = }')
print(f'{y = }')

… or performing multiple variable assignments on one line.

x, y = 123, 456
print(f'{x = }')
print(f'{y = }')

Unpacking requires that you have exactly the same number of elements in the Iterable as variables you specify.

t = 1, 2, 3
a, b, c = t

Beyond the Basics

A * in unpacking syntax packs any additional items into a list.

t = 1, 2, 3
a, b, c, *rest = t
print(f'{a    = }')
print(f'{b    = }')
print(f'{c    = }')
print(f'{rest = }')

A * in a list literal unpacks elements from an Iterable into a list:

xs = [1, 2, 3]
ys = [4, 5, 6, 7]

ws = [xs, ys]
zs = [*xs, *ys]

print(f'{ws = }')
print(f'{zs = }')

A * in a set or tuple literal does similar, but unpacks into a set or tuple respectively:

xs = [1, 2, 3, 4]
ys = [4, 5, 6, 7]

ws = *xs, *ys
zs = {*xs, *ys}

print(f'{ws = }')
print(f'{zs = }')

A ** in a dict does similar, but unpacks elements from a Mapping into a dict, performing a merge.

d1 = {'a': 1, 'b':  2, 'c':  3         }
d2 = {        'b': 20, 'c': 30, 'd': 40}

d3 = {**d1, **d2}

print(f'{d3 = }')

We have a number of different ways to do merges in Python.

We can use a collections.ChainMap

from collections import ChainMap

d1 = {'a': 1, 'b':  2, 'c':  3         }
d2 = {        'b': 20, 'c': 30, 'd': 40}

d3 = ChainMap(d2, d1)
for k, v in d3.items():
    print(f'{k = }: {v = }')
del d2['b']
del d2['c']
for k, v in d3.items():
    print(f'{k = }: {v = }')
print(f'{d3 = }')

We can use itertools.chain with .items()

from itertools import chain

d1 = {'a': 1, 'b':  2, 'c':  3         }
d2 = {        'b': 20, 'c': 30, 'd': 40}

d3 = dict(chain(d1.items(), d2.items()))

d3 = d1.copy()
d3.update(d2)

print(f'{d3 = }')

We can use the ** unpacking syntax or (in Python ≥3.9) we can use the | operator.

d1 = {'a': 1, 'b':  2, 'c':  3         }
d2 = {        'b': 20, 'c': 30, 'd': 40}

d3 = {**d1, **d2}
d4 = d1 | d2

print(f'{d3 = }')
print(f'{d4 = }')

Conclusions & Consequences: What are extended unpacking syntax and the additional unpacking generalisations?

  1. Carefully consider syntax in terms of human readability and expressivity.
  2. Carefully consider syntax in terms of precision and eliminating amibiguity.
entries = [123, ..., ..., ..., ..., ..., 456]

if len(entries) < 2:
    raise ValueError('...')

head = entries[0]
tail = entries[-1]

diff = head - tail
entries = [..., ..., ..., ..., ...]
head, *_, tail = entries
def process(in_use, in_maintenance):
    for dev in in_use:
        ...
    for dev in in_maintenance:
        ...
def process(in_use, in_maintenance):
    for dev in in_use + in_maintenance: # possible TypeError!
        ...

process([..., ...], [..., ...])
process([..., ...], {..., ...})
from itertools import chain

def process(in_use, in_maintenance):
    for dev in chain(in_use, in_maintenance):
        ...
    list(chain(in_use, in_maintenance))
def process(in_use, in_maintenance):
    return [*in_use, *in_maintenance]

def process(in_use, in_maintenance):
    return {*in_use, *in_maintenance}

Question: Why even comprehension syntax?

Question: Why even comprehension syntax?

Why does comprehension syntax exist?

Comprehension syntax was added in Python 2.0 with:

It was further extened to dict and set in Python 2.7 and 3.0 with:

It was even further extended with async syntax in Python 3.6 with:

print("Let's take a look!")

The Basics

In Python, we have a for-loop which we typically use as a “for-each” loop.

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

for h in hosts:
    print(f'{h = }')

We are discouraged from using it as a C-style for-loop:

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

for idx in range(len(hosts)):
    print(f'{hosts[idx] = }')

We have iteration helpers to allow us to use this a “for-each” loop in many situations (and we are encouraged to write our own iteration helpers.)

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]
datacenters = [
    'ghi1',
    'ghi2',
    'klm1',
]


# for idx in range(len(hosts)):
#     h, d = hosts[idx], datacenters[idx]
#     print(f'{h} in {d}')

for h, d in zip(hosts, datacenters, strict=True):
    print(f'{h} in {d}')

We also have comprehension syntax, but it is more limited than for-loop syntax:

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

names = []
domains = set()
for h in hosts:
    n, d = h.split('.', 1)
    names.append(n)
    domains.add(d)

print(f'{names = }')
print(f'{domains = }')

… which can be rewritten as…

hosts = [
    'abc.corp.net',
    'xyz.corp.net',
    'def.corp.net',
]

names = [h.split('.', 1)[0] for h in hosts]
domains = {h.split('.', 1)[-1] for h in hosts}

print(f'{names = }')
print(f'{domains = }')

Beyond the Basics

We have a list, set, and dict comprehension (but no tuple comprehension.)

xs = [-3, -2, -1, 0, 1, 2, 3]

all_squares  = [x**2 for x in xs]
uniq_squares = {x**2 for x in xs}
squared_xs   = {x: x**2 for x in xs}

print(f'{all_squares  = }')
print(f'{uniq_squares = }')
print(f'{squared_xs   = }')

However, we cannot do the following with comprehension syntax…

from collections import defaultdict

xs = [-3, -2, -1, 0, 1, 2, 3]

squared_xs = defaultdict(set)
for x in xs:
    squared_xs[x**2].add(x)

print(f'{squared_xs   = }')

Comprehensions can have filters and multiple levels, but…

xss = [[-3, -2, -1], [0], [1, 2, 3]]

ys = []
for xs in xss:
    for x in xs:
        ys.append(x**2)

zs = [x**2 for xs in xss for x in xs]

print(f'{ys = }')
print(f'{zs = }')
xss = [[-3, -2, -1], [0], [1, 2, 3]]

ys = []
for xs in xss:
    if len(xs) > 1:
        for x in xs:
            if x % 2 == 0:
                ys.append(x**2)

zs = [x**2 for xs in xss if len(xs) > 1 for x in xs if x % 2 == 0]

print(f'{ys = }')
print(f'{zs = }')

… comprehensions cannot:

Conclusions & Consequences: Why does comprehension syntax exist?

  1. Choose more restricted approaches in order to make your code more readable.
  2. Prefer syntax for what it means not for terseness.
    • comprehension syntax means “create some new data via a simple mapping & filtering process”
xs = [1, 2, 3, 4]

# good!
for x in xs:
    print(f'{x = }')

# misleading! — why did you create a list you didn't care about?
[print(f'{x = }') for x in xs]
# can you “skim” this?
for x in xs:
    for y in x:
        if ...:
            continue
        ...
    if ...:
        break
        ...
...
...
...
...
xs = [... for ... in ... if ...]
...
...
...
ys = [f(x) for x in xs if cond(x)]
...
...
...

Question: Why do I need context managers?

Question: Why do I need context managers?

Why do I need context managers?

Context managers were added to Python 2.5 with:

print("Let's take a look!")

The Basics

We often want to work with some resource, like a file or a database connection. This resource requires some set-up and tear-down.

f = open(__file__)
...
...
...
0 / 0
...
...
f.close()

We cannot rely on ourselves to do the tear-down manually, because we could forget or because an error could occur which would cause our tear-down code not to run.

f = open(__file__)
try:
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
    ...
finally:
    f.close()

We cannot rely on the garbage collector to do this work for us, because the lifetime of our resources isn’t guaranteed to be tightly scoped to a block.

Therefore, we need special syntax to ensure we can sequence two operations:

with open(__file__) as f:
    print(f'{not f.closed = }')
    ...

print(f'{not f.closed = }')

Beyond The Basics

This pattern is so general that we want to extend it to a variety of situations.

For example, database connections…

from sqlite3 import connect

with connect(':memory:') as conn:
    pass

… temporary directories…

from tempfile import TemporaryDirectory

with TemporaryDirectory() as d:
    pass

… or even configuration and settings.

from decimal import localcontext

with localcontext() as ctx:
    ctx.prec = 20
    ...

As with any “extension” that implements the Python vocabulary, we use special methods __enter__ and __exit__.

class T:
    def __enter__(self):
        print('before')
    def __exit__(self, exc_type, exc_value, traceback):
        print('after')

with T():
    print('inside')

Context managers are a sequencing mechanisms, but so are generators.

def g():
    print('before')
    yield
    print('after')

gi = g()
next(gi)
print('inside')
next(gi, None)

We can adapt one mechanism to the other using the decorator contextlib.contextmanager.

from contextlib import contextmanager

@contextmanager
def g():
    print('before')
    yield
    print('after')

with g():
    print('inside')

Conclusions & Consequences: Why do I need context managers?

  1. Context managers are a way to manage any pairing of before/after actions.
    • This often occurs with resource management (activate/deactivate, open/close, &c.)
  2. Context managers are very easy to write using the contextlib.contextmanager decorator: write them often!
  3. Context managers show the nesting of operations very clearly.
with test_db() as db:
    with test_data(db): # baseline
        ...
        with test_data(db): # scenario #1
            ...
    with test_data(db): # alternate baseline
        ...
    with test_data(db): # alternate baseline
        ...
from contextlib import contextmanager
from sqlite3 import connect
from tempfile import TemporaryDirectory
from pathlib import Path
from random import choice, randrange
from string import ascii_lowercase

@contextmanager
def test_db():
    create = '''
        create table test (
            name text
          , value number
        );
    '''
    drop = 'drop table test'
    with TemporaryDirectory() as d:
        d = Path(d)
        with connect(d / 'test.db') as db:
            try:
                db.execute(create)
                yield db
            finally:
                db.execute(drop)

@contextmanager
def test_data(db):
    data = [
        (''.join(choice(ascii_lowercase) for _ in range(2)), randrange(100))
        for _ in range(10)
    ]
    try:
        db.executemany('insert into test values (?, ?)', data)
        yield
    finally:
        db.executemany('delete from test where name=? and value=?', data)

with test_db() as db:
    with test_data(db):
        ...
        with test_data(db):
            cur = db.execute('select name, sum(value) from test group by name limit 3')
            for row in cur: print(f'{row = }')
        with test_data(db):
            ...
    with test_data(db):
        ...

Question: Why do I need asyncio?

Question: Why do I need asyncio?

Why do I need asyncio?

Special syntax for asyncio was added to Python 3.5 with:

print("Let's take a look!")

The Basics

If I want to work concurrently, I have the following choices:

In threading, I have one process:

from threading import Thread
from queue import Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice
from time import sleep

@dataclass
class Job:
    name : str
    @classmethod
    def from_random(cls):
        name = ''.join(choice(ascii_lowercase) for _ in range(4))
        return cls(name)

def producer(q):
    while True:
        for _ in range(choice([1, 2])):
            j = Job.from_random()
            print(f'Enqueueing job {j = }')
            q.put(j)
        sleep(1)

def consumer(name, q):
    while True:
        j = q.get()
        print(f'Servicing job {j = } @ {name = }')
        sleep(.1)

def main():
    q = Queue()
    pool = [
        Thread(target=producer, kwargs={'q': q}),
        Thread(target=consumer, kwargs={'q': q, 'name': 'consumer#1'}),
        Thread(target=consumer, kwargs={'q': q, 'name': 'consumer#2'}),
    ]
    for x in pool: x.start()
main()

In multiprocessing, I have multiple process:

from multiprocessing import Process, Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice
from time import sleep

@dataclass
class Job:
    name : str
    @classmethod
    def from_random(cls):
        name = ''.join(choice(ascii_lowercase) for _ in range(4))
        return cls(name)

def producer(q):
    while True:
        for _ in range(choice([1, 2])):
            j = Job.from_random()
            print(f'Enqueueing job {j = }')
            q.put(j)
        sleep(1)

def consumer(name, q):
    while True:
        j = q.get()
        print(f'Servicing job {j = } @ {name = }')
        sleep(.1)

def main():
    q = Queue()
    pool = [
        Process(target=producer, kwargs={'q': q}),
        Process(target=consumer, kwargs={'q': q, 'name': 'consumer#1'}),
        Process(target=consumer, kwargs={'q': q, 'name': 'consumer#2'}),
    ]
    for x in pool: x.start()
main()

Therefore:

Beyond The Basics

As a third option, we have asyncio. On face, it looks similar to threading:

In asyncio, I have one process:

Note the special async and await syntax:

from asyncio import gather, run, sleep as aio_sleep
from asyncio.queues import Queue
from dataclasses import dataclass
from string import ascii_lowercase
from random import choice

@dataclass
class Job:
    name : str
    @classmethod
    def from_random(cls):
        name = ''.join(choice(ascii_lowercase) for _ in range(4))
        return cls(name)

async def producer(q):
    while True:
        for _ in range(choice([1, 2])):
            j = Job.from_random()
            print(f'Enqueueing job {j = }')
            await q.put(j)
        await aio_sleep(1)

async def consumer(name, q):
    while True:
        j = await q.get()
        print(f'Servicing job {j = } @ {name = }')
        await aio_sleep(.1)

async def main():
    q = Queue()
    tasks = [
        producer(q=q),
        consumer(q=q, name='consumer#1'),
        consumer(q=q, name='consumer#1'),
    ]
    await gather(*tasks)

run(main())

Conclusions & Consequences: Why do I need asyncio?

  1. Consider the following axes when introducing concurrency?
    • Am I compute bound? → use multiprocessing
    • Am I I/O bound? → use threading or asyncio
    • Do I want coöperative scheduling? → use asyncio
    • Do I want preëmptive scheduling? → use threading
    • Do I need to pass complex data without a runtime boundary? → use threading or asyncio
  2. asyncio coöperative scheduling means everyone needs to coöperate. Consider asyncio on day one of your project.