Theme: Programming Fundamentals
Topic: Complexity Analysis, Big-O, Data Structures, and Algorithms
Keywords: complexity analysis, big-o, data structures, and algorithms
Presenter | James Powell james@dutc.io |
Date | Friday, March 12, 2021 |
Time | 1:00 PM PST |
Okay, so your code is slow. That’s no good! Nobody likes slow code. But what does it really matter? And when does it really matter? And what does it all really mean?
This seminar will present an introductory view of a critical topic in software development: complexity analysis and Big-O notation as it applies to our choices of algorithms and data structures in Python programmes. We will covering important questions like:
Don’t Use This Code is a professional training, coaching, and consulting company. We are deeply invested in the open source scientific computing community, and are dedicated to bringing better processes, better tools, and better understanding to the world.
Don’t Use This Code is growing! We are currently seeking new partners, new clients, and new engagements within Facebook for our expert consulting and training services.
Teams looking to better employ these tools would benefit from our wide range of training courses on offer, ranging from an intensive introduction to Python fundamentals to advanced applications of Python for building large-scale, production systems. Working with your team, we can craft targeted curricula to meet your training goals. We are also available for consulting services such as building scientific computing and numerical analysis systems using technologies like Python and React.
We pride ourselves on delivering top-notch training. We are committed to providing quality training, and we do so by investing in three key areas: our content, our processes, and our contributors.
James Powell is a professional Python programmer and enthusiast. He got his start with the language by building reporting and analysis systems for proprietary trading offices; now, he uses his experience as a consultant for those building data engineering and scientific computing platforms for a wide range of clients using cutting-edge open source tools like Python and React.
He also currently serves as a Board Director, Chair, and Vice President at NumFOCUS, the 501(c)3 non-profit that supports all the major tools in the Python data analysis ecosystem (i.e., pandas, numpy, jupyter, matplotlib). At NumFOCUS, he helps build global open source communities for data scientists, data engineers, and business analysts. He helps NumFOCUS run the PyData conference series and has sat on speaker selection and organizing committees for 18 conferences. James is also a prolific speaker: since 2013, he has given over seventy (70) conference talks at over fifty (50) Python events worldwide.
We will use the following simple context manager for assessing timings:
from contextlib import contextmanager
from time import perf_counter
@contextmanager
def timed(msg):
try:
start = perf_counter()
yield
stop = perf_counter()
finally:
print(f'{msg:<24} elapsed \N{greek capital letter delta}t: {stop-start:.2f}s')
Let’s quickly check that it works:
from time import sleep
with timed('test #1'):
sleep(.1)
with timed('test #2'):
sleep(.5)
with timed('test #3'):
with timed('test #3-i'):
sleep(.25)
with timed('test #3-ii'):
sleep(.25)
with timed('test #3-iii'):
sleep(.5)
Looks good!
numpy
, pandas
, and other tools.Let’s take a quick quiz to see how well informed we are about what is fast and slow in Python.
Key Questions:
- is Python “fast” or “slow”?
- are list
s and dict
s “fast” or “slow”?
- are generators “fast” or “slow”?
- are comprehensions “fast” or “slow”?
- is numpy
“fast” or “slow”?
- is pandas
“fast” or “slow”?
Question: Which is faster?
# # generate a big bunch of numbers
from random import randint
with timed('i. for-loop'):
xs = []
for _ in range(1_000_000):
xs.append(randint(-1_000, 1_000))
from random import randint
with timed('ii. list comprehension'):
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
from random import randint
with timed('iii. generator expression'):
xs = list(randint(-1_000, 1_000) for _ in range(1_000_000))
from itertools import starmap, repeat
from random import randint
with timed('iv. list(itertools.starmap(…))'):
xs = list(starmap(randint, repeat((-1_000, 1_000), 1_000_000)))
from itertools import starmap, repeat
from random import randint
with timed('v. [*itertools.starmap(…)]'):
xs = [*starmap(randint, repeat((-1_000, 1_000), 1_000_000))]
from numpy.random import randint
with timed('vi. numpy.random.randint(…)'):
xs = randint(-1_000, 1_000, size=1_000_000)
Question: which is faster?
from random import randint
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
with timed('i. Python → sum([ … ])'):
sum(xs)
from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
with timed('ii. `numpy` → array(…).sum()', prec=6):
xs.sum()
Question: which is faster?
from random import randint
dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
ys = [randint(-1_000, 1_000) for _ in range(1_000_000)]
with timed('i. Python → dot([…], […])', prec=4):
dot(xs, ys)
from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
ys = randint(-1_000, 1_000, size=1_000_000)
with timed('ii. `numpy` → array(…).dot(…)', prec=4):
xs.dot(ys)
from numpy.random import randint
dot = lambda xs, ys: sum(x * y for x, y in zip(xs, ys))
xs = randint(-1_000, 1_000, size=1_000_000)
ys = randint(-1_000, 1_000, size=1_000_000)
with timed('ii. `numpy` → dot(array(…), …)', prec=4):
dot(xs, ys)
Question: which is faster?
from random import choice
from string import ascii_lowercase
xs = [''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)]
with timed('i. py → … in […]', prec=4):
'xxxxx' in xs
from pandas import array
from random import choice
from string import ascii_lowercase
xs = array([''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)])
with timed('iii. pd → … in array(…)', prec=4):
'xxxxx' in xs
from pandas import Series
from random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=[''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)])
with timed('iv. pd → … in Series(…)', prec=4):
'xxxxx' in xs
from numpy import array
from numpy.random import choice
from string import ascii_lowercase
xs = choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel()
with timed('v. np → … in array(<…>)', prec=4):
'xxxxx' in xs
from pandas import Series
from numpy.random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel())
with timed('vi. pd → … in Series(<…>)', prec=4):
'xxxxx' in xs
from pandas import Series, Categorical
from numpy.random import choice
from string import ascii_lowercase
xs = Series(dtype='float64', index=Categorical(choice([*ascii_lowercase], size=(4, 100_000)).view('<U4').ravel()))
with timed('vii. pd → … in Categorical(…))', prec=4):
'xxxxx' in xs
from random import choice
from string import ascii_lowercase
xs = {''.join(choice(ascii_lowercase) for _ in range(4)) for _ in range(100_000)}
with timed('i. py → … in {…}', prec=6):
'xxxxx' in xs
Question: which is faster?
from random import randint
f = lambda x: x ** 2
xs = [randint(-1_000, 1_000) for _ in range(100_000)]
with timed('i. py → comprehension', prec=4):
[f(x) for x in xs]
# from random import randint
# f = lambda x: x ** 2
# xs = [randint(-1_000, 1_000) for _ in range(100_000)]
# with timed('ii. py → [*map(…)]', prec=4):
# [*map(f, xs)]
# from random import randint
# f = lambda x: x ** 2
# xs = [randint(-1_000, 1_000) for _ in range(100_000)]
# with timed('iii. py → list(map(…))', prec=4):
# list(map(f, xs))
# from numpy.random import randint
# f = lambda x: x ** 2
# xs = randint(-1_000, 1_000, size=100_000)
# with timed('iv. numpy → vectorised', prec=4):
# f(xs)
Question: which is faster?
from numpy.random import randint
xs = randint(-1000, 1000, size=1_000_000)
# with timed('iv. numpy → vectorised', prec=4):
# f(xs)
from numpy import empty_like, where
from numpy.random import randint
from contextlib import contextmanager
results = []
@contextmanager
def _attempt(*args, **kwargs):
with timed(*args, **kwargs):
yield
results.append(res)
xs = randint(-1_000, 1_000, size=1_500_000)
# # x > 0 → x²
# # x ≤ 0 → x³
with _attempt('i. `empty_like(…)`', prec=4):
res = empty_like(xs)
res[xs > 0] = xs[xs > 0] ** 2
res[xs <= 0] = xs[xs <= 0] ** 3
res = xs.copy()
with _attempt('ii. `.copy()` & in-place', prec=4):
res[xs > 0] **= 2
res[xs <= 0] **= 3
with _attempt('iii. `numpy.where(…)`', prec=4):
res = where(xs > 0, xs ** 2, xs ** 3)
with _attempt('iv. mask', prec=4):
res = ((xs > 0) * (xs ** 2)) + ((xs <= 0) * (xs ** 3))
with _attempt('v. mask (precompute)', prec=4):
mask = xs > 0
res = mask * (xs ** 2) + ~mask * (xs ** 3)
# assert all((x == y).all() for x, y in zip(results, results[1:]))
# print('All equivalent!')
The results of the above may be perplexing, but it means we need a better view of “performance” and “scale.”
First, let’s introduce a notation used frequencly in Computer Science: “big-O.”
from time import sleep
def f(dataset):
sleep(.5) # set-up time
for x in dataset:
sleep(.01) # ongoing time
def g(dataset):
sleep(.01) # set-up time
for x in dataset:
sleep(.02) # ongoing time
with timed('i. f(<10 elements>)'):
f(range(10))
with timed('ii. g(<10 elements>)'):
g(range(10))
with timed('iii. f(<100 elements>)'):
f(range(100))
with timed('iv. g(<100 elements>)'):
g(range(100))
from numpy import linspace
from matplotlib.pyplot import plot, show
xs = linspace(1, 100, 101)
f_t = .5 + xs * 0.01
g_t = .01 + xs * 0.02
plot(xs, f_t)
plot(xs, g_t)
show()
Formally:
Big-O notation boils down to: f(x) = O(g(x)) as x → ∞ ∃ M ∈ ℤ and x ∈ ℝ, s.t. |f(x)| ≤ Mg(x), ∀ x ≥ x₀
Other “Bachman-Landau” notations:
See Wikipedia: Big-O Notation for more background.
But, typically, we will only care about (and be able to assess) Big-O.
Within Big-O, we typically ignore:
from numpy.random import randint
xs = randint(-1_000, 1_000, size=1_000_000)
ys = xs ** 2 # O(n)
from random import randint
xs = [randint(-1_000, 1_000) for _ in range(1_000_000)]
ys = [x**2 for x in xs] # O(n)
from itertools import islice
from random import randint
xs = [randint(-1_000, 1_000) for _ in range(100_000)]
results = {}
with timed('i. `list` vs `set` vs `dict`'):
uniques = []
for x in xs: # O(n)
if x not in uniques: # O(n) → O(n²)
uniques.append(x)
results['list'] = uniques
with timed('ii. `list` vs `set` vs `dict`'):
uniques = {*''}
for x in xs: # O(n)
if x not in uniques: # O(1) → O(n)
uniques.add(x)
results['set'] = uniques
# print(f'{[*islice(results["list"], 10)] = }')
# print(f'{[*islice(results["set" ], 10)] = }')
with timed('iii. `list` vs `set` vs `dict`'):
uniques = {}
for x in xs: # O(n)
if x not in uniques: # O(1) → O(n)
uniques[x] = None
uniques = [*uniques]
results['dict'] = uniques
print(f'{[*islice(results["list"], 10)] = }')
print(f'{[*islice(results["dict"], 10)] = }')
Typically, we can identify and discuss algorithms and operations on data structures in terms of the following bounding functions:
Let’s look at a comparison of the following:
from random import randint
for size in [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]:
xs = [randint(-1_000, 1_000) for _ in range(size)]
with timed(f'… in […] {size = :<9,}', prec=6):
99_999 in xs
This makes sense: the Python list
is solely “human-ordered.” As a
consequence, there is insufficient machine-available structure to perform this
search without checking every single element. As a consequence, we should see a
direct relationship between the number of elements and the time it takes to
perform the search.
from random import randint
for size in [1, 10, 100, 1_000, 10_000, 100_000, 1_000_000]:
xs = {randint(-1_000, 1_000) for _ in range(size)}
with timed(f'… in {size = :<9,}', prec=6):
99_999 in xs
This makes sense: the Python set
is solely “machine-ordered.” As a
consequence, there is sufficient machine-avialable structure to perform this
search without checking every single element. Instead, the Python interpeter
computes a hash of the search key, uses that has to identify the likely
position where this data would be stored, and directly checks that position. In
the case of possible collision (since the hash function must be compressed to
an array index, and, as the domain of the array indices will be smaller than
the domain of possible input values, by the pigeonhole principle, there must be
overlaps,) the search through the structure should be relatively short. This is
managed by the Python set
ensuring that its “load factor” is always below a
certain amount; thus mitigating the possibility of collisions turning this
search into a linear scan.
We typically see constant time operations corresponding to operations where we have sufficient machine structuring to directly identify the answer, e.g.,
dict
or Python set
)list
or numpy.ndarray
)float
addition but not int
addition)for val in (10**x for x in range(20)):
x = float(val) - 1
with timed(f'float(…) + 1 = {val:<9,e}', prec=6):
x + 1
for val in (10**x for x in range(20)):
x = int(val) - 1
with timed(f'int(…) + 1 = {val:<9,e}', prec=6):
x + 1
for val in [1, 10, 10**10, 10**20, 10**30, 10**40]:
x = int(val) - 1
with timed(f'int(…) + 1 = {val:<9,e}', prec=6):
x + 1
for p, val in ((p, 10**p) for p in [0, 1, 10, 100, 1_000, 10_000, 100_000, 1_000_000, 2_000_000, 4_000_000]):
x = int(val) - 1
with timed(f'int(…) + 1 = 10**{p:<1,.0f}', prec=6):
x + 1
Let’s take a look at common operations of the following complexity:
Recall:
from matplotlib.pyplot import plot, show
from numpy import linspace, log
xs = linspace(1, 10, 100)
plot(xs, log(xs), color='red') # O(lnx)
plot(xs, xs, color='green') # O(x)
plot(xs, xs * log(xs), color='blue') # O(xlnx)
show()
Logarithmic time complexity is typically encountered when we can “divide and conquer.”
For example, searching through a structure that is already sorted.
from numpy.random import randint
from numpy import searchsorted
xs = randint(-1000, 1000, size=1_000_000)
with timed('i. unsorted', prec=6):
99_999 in xs
xs.sort()
with timed('ii. sorted', prec=6):
searchsorted(xs , 99_999)
from numpy.random import randint
for size in [10**x for x in range(8)]:
xs = randint(-1000, 1000, size=size)
with timed(f'unsorted {size = :,}', prec=6):
99_999 in xs
from numpy.random import randint
from numpy import searchsorted
for size in [10**x for x in range(8)]:
xs = randint(-1000, 1000, size=size)
xs.sort()
with timed(f'sorted {size = :,}', prec=6):
searchsorted(xs, 99_999)
This makes sense; if the structure is sorted, we can look at the median element, determine if the element we want is larger or smaller, and immediately eliminate half of the items we want to compare. At each successive step, we can eliminate half of the items under consideration!
Recall: the pandas
index can be sorted & knows that it is sorted!
from pandas import Series
from numpy.random import randint
s = Series(0, index=randint(-1000, 1000, size=10))
print(f'{s.index = }')
print(f'{s.index.is_monotonic = }')
# s = s.sort_index()
# print(f'{s.index = }')
# print(f'{s.index.is_monotonic = }')
from pandas import Series
from numpy.random import randint, normal
for size in [10**x for x in range(8)]:
s = Series(normal(size=size), index=randint(-1_000, 1_000, dtype='int16', size=size))
s = s.sort_index()
with timed(f'.iloc with {size = :,}', prec=6):
try:
# s.iloc[0]
s.loc[s.index[0]]
except KeyError:
pass
(Well, the world is not always fair.)
Linearithmic operations often appear in fast sorts themselves! (The fastest possible comparison sort is linearithmic in time.)
from numpy.random import randint
from numpy import searchsorted
for size in [10**x for x in range(8)]:
xs = randint(-1000, 1000, size=size)
with timed(f'sorted {size = :,}', prec=6):
xs.sort()
Let’s take a look at common operations of the following complexity:
Quadratic operations are very dangerous in practice, because they grow so quickly that code that works fine on a test dataset can quickly prove completely intractible on a production dataset.
from numpy import convolve
from numpy.random import normal
for size in [*(10**x for x in range(1, 7)), 2_500_000, 5_000_000, 10_000_000]:
xs = normal(size=size//2)
# k = normal(size=3)
k = normal(size=size//2)
with timed(f'sorted {size = :,}', prec=6):
convolve(xs, k)
So how do we operationalise this information?
O(xlnx)
) and try to avoid repeated sorting (or loss of structure leading to repeated sorting)dict
is both “human-” (“insertion-”) and “machine-”ordered in Python ≥3.6.set
or a dict
where possible; a sorted structure otherwise.[*map(f, xs)]
being faster than list(map(f, xs))
or [f(x) for x in xs]
); these may not persist from interpreter version to interpreter versionSimple example:
from pandas import to_datetime
from datetime import timedelta
us_holidays = to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])
def nth_business_date(dt, n, holidays=us_holidays):
while n > 0:
dt += timedelta(days=1)
while dt.weekday() in {5, 6} or dt in holidays:
dt += timedelta(days=1)
n -= 1
return dt
for _ in range(1):
today = to_datetime('today')
print(f'{today = }')
nbd = nth_business_date(today, n=2)
print(f'{nbd = }')
dts = [
nth_business_date(today, n=30),
nth_business_date(today, n=60),
nth_business_date(today, n=90),
]
print(f'{dts = }')
from pandas import to_datetime
from datetime import timedelta
us_holidays = {*to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])}
def nth_business_date(dt, n, holidays=us_holidays):
while n > 0:
dt += timedelta(days=1)
while dt.weekday() in {5, 6} or dt in holidays:
dt += timedelta(days=1)
n -= 1
return dt
today = to_datetime('today')
nbd = nth_business_date(today, n=2)
dts = [
nth_business_date(today, n=30),
nth_business_date(today, n=60),
nth_business_date(today, n=90),
]
for _ in range(1):
print(f'{today = }')
print(f'{nbd = }')
print(f'{dts = }')
from pandas import to_datetime
from datetime import timedelta
from itertools import islice, count
us_holidays = {*to_datetime(['2021-01-01', '2021-07-04', '2021-05-31', '2021-12-25'])}
def business_dates(dt, holidays=us_holidays):
all_dates = (dt + timedelta(days=n) for n in count())
return (dt for dt in all_dates if dt.weekday() not in {5, 6} and dt not in holidays)
today = to_datetime('today')
nbd = [*islice(business_dates(today), 2)]
dts = [*islice(business_dates(today), 91)]
dts = dts[30], dts[60], dts[90]
for _ in range(1):
print(f'{today = }')
print(f'{nbd = }')
print(f'{dts = }')
1. 2. 3.