Making Python programs fast

Kovid Goyal

September 21, 2024 @ PyCon India

sw.kovidgoyal.net/talks/pycon2024

The need for speed

Single core performance flatlined
Performance == energy saving == battery life
Users love performant software
I feel the need, the need for speed

Is it even possible?

Python is slow (100x slowdown)
No heroes to save us
Nonetheless, it is possible

Terminal	Language	MB/s
Konsole	C++	27
Alacritty	Rust	54
GNOME Terminal	C++	62
kitty	Python + C	135

A little bit about kitty

A terminal emulator
Make terminals a powerful platform
Uses Python for its UI and control
Uses C and GPU for performance

Because I wanted vim with squiggly underlines

Case study: base64

Encode binary data as text
3 input bytes → 4 letters


        │Enc│  │ode│      │d  │
        └─┬─┘  └─┬─┘  …   └─┬─┘
          │      │          │
          ↓      ↓          ↓
         RW5j   b2Rl  …    ZA==

Used widely in communication, escape codes in kitty

base64: A baseline

Rate: 1.6 GB/s

Rate: 1.2 MB/s 1000x slower!!

Add some street smarts

Rate: 7.0 MB/s 5x faster

Intermezzo: memoryview

Rate: 6.1 MB/s almost same speed

Binary data ⇒ use memoryview

Parallelism to the rescue?

Rate: 27.8 MB/s 4x noslice

20x naive, but 50x slower than native

Cheating not faster, just wider

Python + C

Python’s performance superpower: C code

from base64 import standard_b64encode

Rate: 1.3 GB/s almost native base64

Phew! Are we done?

Python’s Achilles heel: stdlib bitrot

We can do much better

Intermezzo: SIMD

Single Instruction Multiple Data
Upto 64 bytes at a time
base64 encoding: 3 byte chunks
aklomp/base64


    void base64_encode(
        const char *src, size_t srclen, char *out, size_t *outlen
    )

Python C API

Where strong men fear to tread

Rate: 21.5 GB/s 20x native!!

Needed boilerplate

Streaming base64 encoding

HTML 5 Parsing

The WHATWG HTML 5 spec is some 1500 pages
html5lib is awesome!
It parses the HTML 5 spec at ~1.1 MB/s
How much better can we do?
Not like base64, need object representation

An efficient object representation

The lxml package
The Gumbo C HTML 5 parser
A simple translation: kovidgoyal/html5-parser
Under 1000 lines of C code
Rate: 28.1 MB/s ~28x

Premature optimization…

… is the root of all evil

Use the cProfile module

Code that is in the hot path
Code that has a nice data boundary

An example of profiling

Users reported opening certain e-books in calibre was very slow
Run function to preprocess book in profiler

CSS parsing is the culprit!

Do less work and do it in C++

Just need to transform a few properties
Basic parsing is easy
~1200 lines of C++

2x speedup here worst case 60x speedup

Convert those entities in C

HTML entities: & name | number ;
Use SIMD to find &
Use perfect hashing (gperf)
~200 lines of C

Further 50% speedup for a total 4x speedup

A short tour of SIMD

Vector of input values → Vector of output values
No control flow
- Masks
- Shifts and broadcasts
- Boolean operators
- Arithmetic operators
- Counting bits
Originally conceived for maths

Example: Let’s find Mr. X

│s│o│m│e│t│e│x│t│ … in SIMD register A
└─┴─┴─┴─┴─┴─┴─┴─┘
│x│x│x│x│x│x│x│x│ … in SIMD register B
└─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B

│0│0│0│0│0│0│255│0│ … in SIMD register C
└─┴─┴─┴─┴─┴─┴───┴─┘
Number of leading zeroes is the answer

│X│X│X│X│X│X│X│X│ … in SIMD register D
└─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B | A cmpeq D

4 CPU instructions to search up to 64 bytes with no branches!

A bird’s eye view of base64

Recall: 3 bytes become 4

  3 * 8 == 24 == 4 * 6 bits
  2 ^ 6 == 64 (aka base64)

SIMD: Multiple groups: 12 → 16

  [AAAB | BBCC | CDDD | 0000] → [0AAA | 0BBB | 0CCC | 0DDD ]
  Focus on a single word: 0AAA
  Spread out its 24 bits into four groups of 6
  0AAA = [ 00000000 | aaaaaa | bbbbbb | cccccc | dddddd ]
  0AAA = [ 00aaaaaa | 00bbbbbb | 00cccccc | 00dddddd ]

Done in parallel on 16, 32 or 64 bytes at a time

Branchless base64 lookup

No control flow!


        i range shift
        0 … 25 A … Z ord('A')
        26 … 51 a … z ord('a') - 26
        52 … 61 0 … 9 ord('0') - 52
        62 + ord('+') - 62
        63 / ord('/') - 63

i	range	shift
0 … 25	A … Z	ord('A')
26 … 51	a … z	ord('a') - 26
52 … 61	0 … 9	ord('0') - 52
62	+	ord('+') - 62
63	/	ord('/') - 63

Branchless lookup pseudo code

CPUs have SIMD instruction for one_when

Thanks to Wojciech Muła for this algorithm

Key takeaways

Python programs can be very fast

Profile and identify
Look for pure Python opportunities
Eliminate extra copies
Use memoryview for binary data

Don’t be afraid of C

Slides: sw.kovidgoyal.net/talks/pycon2024

Benchmark code: sw.kovidgoyal.net/talks/pycon2024/code.py