seminars.fb

Programming Fundamentals → “Complexity Analysis, Big-O, Data Structures, and Algorithms”

Seminar (Fri, Mar 12, 2021; 1 PM PST)

Theme: Programming Fundamentals

Topic: Complexity Analysis, Big-O, Data Structures, and Algorithms

Keywords: complexity analysis, big-o, data structures, and algorithms

Presenter James Powell james@dutc.io
Date Friday, March 12, 2021
Time 1:00 PM PST

Okay, so your code is slow. That’s no good! Nobody likes slow code. But what does it really matter? And when does it really matter? And what does it all really mean?

This seminar will present an introductory view of a critical topic in software development: complexity analysis and Big-O notation as it applies to our choices of algorithms and data structures in Python programmes. We will covering important questions like:

About

Don’t Use This Code; Training & Consulting

Don’t Use This Code is a professional training, coaching, and consulting company. We are deeply invested in the open source scientific computing community, and are dedicated to bringing better processes, better tools, and better understanding to the world.

Don’t Use This Code is growing! We are currently seeking new partners, new clients, and new engagements within Facebook for our expert consulting and training services.

Teams looking to better employ these tools would benefit from our wide range of training courses on offer, ranging from an intensive introduction to Python fundamentals to advanced applications of Python for building large-scale, production systems. Working with your team, we can craft targeted curricula to meet your training goals. We are also available for consulting services such as building scientific computing and numerical analysis systems using technologies like Python and React.

We pride ourselves on delivering top-notch training. We are committed to providing quality training, and we do so by investing in three key areas: our content, our processes, and our contributors.

James Powell; Consultant, Instructor, & Presenter

James Powell is a professional Python programmer and enthusiast. He got his start with the language by building reporting and analysis systems for proprietary trading offices; now, he uses his experience as a consultant for those building data engineering and scientific computing platforms for a wide range of clients using cutting-edge open source tools like Python and React.

He also currently serves as a Board Director, Chair, and Vice President at NumFOCUS, the 501(c)3 non-profit that supports all the major tools in the Python data analysis ecosystem (i.e., pandas, numpy, jupyter, matplotlib). At NumFOCUS, he helps build global open source communities for data scientists, data engineers, and business analysts. He helps NumFOCUS run the PyData conference series and has sat on speaker selection and organizing committees for 18 conferences. James is also a prolific speaker: since 2013, he has given over seventy (70) conference talks at over fifty (50) Python events worldwide.

Notes

Contents

Prologue

We will use the following simple context manager for assessing timings:

from contextlib import contextmanager
from time import perf_counter

@contextmanager
def timed(msg):
    try:
        start = perf_counter()
        yield
        stop = perf_counter()
    finally:
        print(f'{msg:<24} elapsed \N{greek capital letter delta}t: {stop-start:.2f}s')

Let’s quickly check that it works:

from time import sleep
with timed('test #1'):
    sleep(.1)
with timed('test #2'):
    sleep(.5)
with timed('test #3'):
    with timed('test #3-i'):
        sleep(.25)
    with timed('test #3-ii'):
        sleep(.25)
    with timed('test #3-iii'):
        sleep(.5)

Looks good!

Motivation

Measurements & Testing

Let’s take a quick quiz to see how well informed we are about what is fast and slow in Python.

Key Questions: - is Python “fast” or “slow”? - are lists and dicts “fast” or “slow”? - are generators “fast” or “slow”? - are comprehensions “fast” or “slow”? - is numpy “fast” or “slow”? - is pandas “fast” or “slow”?

Question: Which is faster?

# # generate a big bunch of numbers

from random import randint
with timed('i.   for-loop'):
    xs = []
    for _ in range(1_000_000):
        xs.append(randint(-1_000, 1_000))

from random import randint
with timed('ii.  list comprehension'):
    xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]

from random import randint
with timed('iii. generator expression'):
    xs = list(randint(-1_000, 1_000) for _ in range(1_000_000))

from itertools import starmap, repeat
from random import randint
with timed('iv.  list(itertools.starmap(…))'):
    xs = list(starmap(randint, repeat((-1_000, 1_000), 1_000_000)))

from itertools import starmap, repeat
from random import randint
with timed('v.   [*itertools.starmap(…)]'):
    xs = [*starmap(randint, repeat((-1_000, 1_000), 1_000_000))]

from numpy.random import randint
with timed('vi.  numpy.random.randint(…)'):
    xs = randint(-1_000, 1_000, size=1_000_000)

Question: which is faster?

from random import randint
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
with timed('i.   Python → sum([ … ])'):
    sum(xs)

from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
with timed('ii.  `numpy` → array(…).sum()', prec=6):
    xs.sum()

Question: which is faster?

from random import randint
dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
ys = [randint(-1_000, 1_000) for _ in range(1_000_000)]
with timed('i.   Python → dot([…], […])', prec=4):
    dot(xs, ys)

from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
ys = randint(-1_000, 1_000, size=1_000_000)
with timed('ii.  `numpy` → array(…).dot(…)', prec=4):
    xs.dot(ys)

from numpy.random import randint
dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))
xs = randint(-1_000, 1_000, size=1_000_000)
ys = randint(-1_000, 1_000, size=1_000_000)
with timed('ii.  `numpy` → dot(array(…), …)', prec=4):
    dot(xs, ys)

Question: which is faster?

from random import choice
from string import ascii_lowercase
xs = [''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)]
with timed('i.   py → … in […]', prec=4):
    'xxxxx' in xs


from pandas import array
from random import choice
from string import ascii_lowercase
xs = array([''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)])
with timed('iii. pd → … in array(…)', prec=4):
    'xxxxx' in xs

from pandas import Series
from random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=[''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)])
with timed('iv.  pd → … in Series(…)', prec=4):
    'xxxxx' in xs

from numpy import array
from numpy.random import choice
from string import ascii_lowercase
xs = choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel()
with timed('v.   np → … in array(<…>)', prec=4):
    'xxxxx' in xs

from pandas import Series
from numpy.random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel())
with timed('vi.  pd → … in Series(<…>)', prec=4):
    'xxxxx' in xs

from pandas import Series, Categorical
from numpy.random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=Categorical(choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel()))
with timed('vii. pd → … in Categorical(…))', prec=4):
    'xxxxx' in xs

from random import choice
from string import ascii_lowercase
xs = {''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)}
with timed('i.   py → … in {…}', prec=6):
    'xxxxx' in xs

Question: which is faster?

from random import randint
f = lambda x: x ** 2
xs = [randint(-1_000, 1_000) for _ in range(100_000)]
with timed('i.   py → comprehension', prec=4):
    [f(x) for x in xs]

#  from random import randint
#  f = lambda x: x ** 2
#  xs = [randint(-1_000, 1_000) for _ in range(100_000)]
#  with timed('ii.  py → [*map(…)]', prec=4):
#      [*map(f, xs)]

#  from random import randint
#  f = lambda x: x ** 2
#  xs = [randint(-1_000, 1_000) for _ in range(100_000)]
#  with timed('iii. py → list(map(…))', prec=4):
#      list(map(f, xs))

#  from numpy.random import randint
#  f = lambda x: x ** 2
#  xs = randint(-1_000, 1_000, size=100_000)
#  with timed('iv.  numpy → vectorised', prec=4):
#      f(xs)

Question: which is faster?

from numpy.random import randint

xs = randint(-1000, 1000, size=1_000_000)
#  with timed('iv.  numpy → vectorised', prec=4):
#      f(xs)
from numpy import empty_like, where
from numpy.random import randint
from contextlib import contextmanager

results = []
@contextmanager
def _attempt(*args, **kwargs):
    with timed(*args, **kwargs):
        yield
    results.append(res)

xs = randint(-1_000, 1_000, size=1_500_000)
#  # x > 0 → x²
#  # x ≤ 0 → x³

with _attempt('i.   `empty_like(…)`', prec=4):
    res = empty_like(xs)
    res[xs >  0] = xs[xs >  0] ** 2
    res[xs <= 0] = xs[xs <= 0] ** 3

res = xs.copy()
with _attempt('ii.  `.copy()` & in-place', prec=4):
    res[xs >  0] **= 2
    res[xs <= 0] **= 3

with _attempt('iii. `numpy.where(…)`', prec=4):
    res = where(xs > 0, xs ** 2, xs ** 3)

with _attempt('iv.   mask', prec=4):
    res = ((xs > 0) * (xs ** 2)) + ((xs <= 0) * (xs ** 3))

with _attempt('v.    mask (precompute)', prec=4):
    mask = xs > 0
    res = mask * (xs ** 2) +  ~mask * (xs ** 3)

#  assert all((x == y).all() for x, y in zip(results, results[1:]))
#  print('All equivalent!')

The results of the above may be perplexing, but it means we need a better view of “performance” and “scale.”

Big-O Notation & Algorithmic Complexity

First, let’s introduce a notation used frequencly in Computer Science: “big-O.”

from time import sleep

def f(dataset):
    sleep(.5) # set-up time
    for x in dataset:
        sleep(.01) # ongoing time

def g(dataset):
    sleep(.01) # set-up time
    for x in dataset:
        sleep(.02) # ongoing time

with timed('i.  f(<10 elements>)'):
    f(range(10))

with timed('ii.  g(<10 elements>)'):
    g(range(10))

with timed('iii. f(<100 elements>)'):
    f(range(100))

with timed('iv.  g(<100 elements>)'):
    g(range(100))
from numpy import linspace
from matplotlib.pyplot import plot, show

xs = linspace(1, 100, 101)
f_t = .5  + xs * 0.01
g_t = .01 + xs * 0.02

plot(xs, f_t)
plot(xs, g_t)
show()

Formally:

Big-O notation boils down to: f(x) = O(g(x)) as x → ∞ ∃ M ∈ ℤ and x ∈ ℝ, s.t. |f(x)| ≤ Mg(x), ∀ x ≥ x₀

Other “Bachman-Landau” notations:

See Wikipedia: Big-O Notation for more background.

But, typically, we will only care about (and be able to assess) Big-O.

Within Big-O, we typically ignore:

from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
ys = xs ** 2 # O(n)

from random import randint
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
ys = [x**2 for x in xs] # O(n)
from itertools import islice
from random import randint
xs = [randint(-1_000, 1_000) for _ in range(100_000)]

results = {}
with timed('i.   `list` vs `set` vs `dict`'):
    uniques = []
    for x in xs: # O(n)
        if x not in uniques: # O(n) → O(n²)
            uniques.append(x)
results['list'] = uniques

with timed('ii.  `list` vs `set` vs `dict`'):
    uniques = {*''}
    for x in xs: # O(n)
        if x not in uniques: # O(1) → O(n)
            uniques.add(x)
results['set'] = uniques
#  print(f'{[*islice(results["list"], 10)] = }')
#  print(f'{[*islice(results["set" ], 10)] = }')

with timed('iii. `list` vs `set` vs `dict`'):
    uniques = {}
    for x in xs: # O(n)
        if x not in uniques: # O(1) → O(n)
            uniques[x] = None
    uniques = [*uniques]
    results['dict'] = uniques
print(f'{[*islice(results["list"], 10)] = }')
print(f'{[*islice(results["dict"], 10)] = }')

Typically, we can identify and discuss algorithms and operations on data structures in terms of the following bounding functions:

Let’s look at a comparison of the following:

from random import randint
for size in [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]:
    xs = [randint(-1_000, 1_000) for _ in range(size)]
    with timed(f'… in […] {size = :<9,}', prec=6):
        99_999 in xs

This makes sense: the Python list is solely “human-ordered.” As a consequence, there is insufficient machine-available structure to perform this search without checking every single element. As a consequence, we should see a direct relationship between the number of elements and the time it takes to perform the search.

from random import randint
for size in [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]:
    xs = {randint(-1_000, 1_000) for _ in range(size)}
    with timed(f'… in  {size = :<9,}', prec=6):
        99_999 in xs

This makes sense: the Python set is solely “machine-ordered.” As a consequence, there is sufficient machine-avialable structure to perform this search without checking every single element. Instead, the Python interpeter computes a hash of the search key, uses that has to identify the likely position where this data would be stored, and directly checks that position. In the case of possible collision (since the hash function must be compressed to an array index, and, as the domain of the array indices will be smaller than the domain of possible input values, by the pigeonhole principle, there must be overlaps,) the search through the structure should be relatively short. This is managed by the Python set ensuring that its “load factor” is always below a certain amount; thus mitigating the possibility of collisions turning this search into a linear scan.

We typically see constant time operations corresponding to operations where we have sufficient machine structuring to directly identify the answer, e.g.,

for val in (10**x for x in range(20)):
    x = float(val) - 1
    with timed(f'float(…) + 1 = {val:<9,e}', prec=6):
        x + 1
for val in (10**x for x in range(20)):
    x = int(val) - 1
    with timed(f'int(…) + 1 = {val:<9,e}', prec=6):
        x + 1
for val in [1, 10, 10**10, 10**20, 10**30, 10**40]:
    x = int(val) - 1
    with timed(f'int(…) + 1 = {val:<9,e}', prec=6):
        x + 1
for p, val in ((p, 10**p) for p in [0, 1, 10, 100, 1_000, 10_000, 100_000, 1_000_000, 2_000_000, 4_000_000]):
    x = int(val) - 1
    with timed(f'int(…) + 1 = 10**{p:<1,.0f}', prec=6):
        x + 1

Let’s take a look at common operations of the following complexity:

Recall:

from matplotlib.pyplot import plot, show
from numpy import linspace, log
xs = linspace(1, 10, 100)
plot(xs, log(xs),      color='red')   # O(lnx)
plot(xs, xs,           color='green') # O(x)
plot(xs, xs * log(xs), color='blue')  # O(xlnx)
show()

Logarithmic time complexity is typically encountered when we can “divide and conquer.”

For example, searching through a structure that is already sorted.

from numpy.random import randint
from numpy import searchsorted

xs = randint(-1000, 1000, size=1_000_000)
with timed('i. unsorted', prec=6):
    99_999 in xs

xs.sort()
with timed('ii. sorted', prec=6):
    searchsorted(xs , 99_999)
from numpy.random import randint

for size in [10**x for x in range(8)]:
    xs = randint(-1000, 1000, size=size)
    with timed(f'unsorted {size = :,}', prec=6):
        99_999 in xs
from numpy.random import randint
from numpy import searchsorted

for size in [10**x for x in range(8)]:
    xs = randint(-1000, 1000, size=size)
    xs.sort()
    with timed(f'sorted {size = :,}', prec=6):
        searchsorted(xs, 99_999)

This makes sense; if the structure is sorted, we can look at the median element, determine if the element we want is larger or smaller, and immediately eliminate half of the items we want to compare. At each successive step, we can eliminate half of the items under consideration!

Recall: the pandas index can be sorted & knows that it is sorted!

from pandas import Series
from numpy.random import randint
s = Series(0, index=randint(-1000, 1000, size=10))
print(f'{s.index = }')
print(f'{s.index.is_monotonic = }')
#  s = s.sort_index()
#  print(f'{s.index = }')
#  print(f'{s.index.is_monotonic = }')
from pandas import Series
from numpy.random import randint, normal

for size in [10**x for x in range(8)]:
    s = Series(normal(size=size), index=randint(-1_000, 1_000, dtype='int16', size=size))
    s = s.sort_index()
    with timed(f'.iloc with {size = :,}', prec=6):
        try:
            #  s.iloc[0]
            s.loc[s.index[0]]
        except KeyError:
            pass

(Well, the world is not always fair.)

Linearithmic operations often appear in fast sorts themselves! (The fastest possible comparison sort is linearithmic in time.)

from numpy.random import randint
from numpy import searchsorted

for size in [10**x for x in range(8)]:
    xs = randint(-1000, 1000, size=size)
    with timed(f'sorted {size = :,}', prec=6):
        xs.sort()

Let’s take a look at common operations of the following complexity:

Quadratic operations are very dangerous in practice, because they grow so quickly that code that works fine on a test dataset can quickly prove completely intractible on a production dataset.

from numpy import convolve
from numpy.random import normal

for size in [*(10**x for x in range(1, 7)), 2_500_000, 5_000_000, 10_000_000]:
    xs = normal(size=size//2)
    #  k  = normal(size=3)
    k  = normal(size=size//2)
    with timed(f'sorted {size = :,}', prec=6):
        convolve(xs, k)

Putting It to Use

So how do we operationalise this information?

  1. Analyse your problem for clear markers for Big-O complexity.
  1. Develop better intuition.

Simple example:

from pandas import to_datetime
from datetime import timedelta

us_holidays = to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])
def nth_business_date(dt, n, holidays=us_holidays):
    while n > 0:
        dt += timedelta(days=1)
        while dt.weekday() in {5, 6} or dt in holidays:
            dt += timedelta(days=1)
        n -= 1
    return dt

for _ in range(1):
    today = to_datetime('today')
    print(f'{today = }')
    nbd = nth_business_date(today, n=2)
    print(f'{nbd   = }')
    dts = [
        nth_business_date(today, n=30),
        nth_business_date(today, n=60),
        nth_business_date(today, n=90),
    ]
    print(f'{dts   = }')
from pandas import to_datetime
from datetime import timedelta

us_holidays = {*to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])}
def nth_business_date(dt, n, holidays=us_holidays):
    while n > 0:
        dt += timedelta(days=1)
        while dt.weekday() in {5, 6} or dt in holidays:
            dt += timedelta(days=1)
        n -= 1
    return dt

today = to_datetime('today')
nbd = nth_business_date(today, n=2)
dts = [
    nth_business_date(today, n=30),
    nth_business_date(today, n=60),
    nth_business_date(today, n=90),
]
for _ in range(1):
    print(f'{today = }')
    print(f'{nbd   = }')
    print(f'{dts   = }')
from pandas import to_datetime
from datetime import timedelta
from itertools import islice, count

us_holidays = {*to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])}
def business_dates(dt, holidays=us_holidays):
    all_dates = (dt + timedelta(days=n) for n in count())
    return (dt for dt in all_dates if dt.weekday() not in {5, 6} and dt not in holidays)

today = to_datetime('today')
nbd = [*islice(business_dates(today), 2)]
dts = [*islice(business_dates(today), 91)]
dts = dts[30], dts[60], dts[90]

for _ in range(1):
    print(f'{today = }')
    print(f'{nbd   = }')
    print(f'{dts   = }')

1. 2. 3.