The need for speed
- Single core performance flatlined
- Performance == energy saving == battery life
- Users love performant software
- I feel the need, the need for speed
Is it even possible?
- Python is slow (100x slowdown)
- No heroes to save us
- Nonetheless, it is possible
Terminal | Language | MB/s |
Konsole | C++ | 27 |
Alacritty | Rust | 54 |
GNOME Terminal | C++ | 62 |
kitty | Python + C | 135 |
A little bit about kitty
- A terminal emulator
- Make terminals a powerful platform
-
Uses Python for its UI and control
- Uses C and GPU for performance
Because I wanted vim with squiggly underlines
Case study: base64
- Encode binary data as text
- 3 input bytes → 4 letters
│Enc│ │ode│ │d │
└─┬─┘ └─┬─┘ … └─┬─┘
│ │ │
↓ ↓ ↓
RW5j b2Rl … ZA==
Used widely in communication, escape codes in kitty
base64: A baseline
Rate: 1.6
GB/s
Rate: 1.2
MB/s 1000x slower!!
Add some street smarts
Rate: 7.0
MB/s 5x faster
Intermezzo: memoryview
Rate: 6.1
MB/s almost same speed
Binary data ⇒ use memoryview
Parallelism to the rescue?
Rate: 27.8
MB/s 4x noslice
20x naive, but 50x slower than native
Cheating not faster, just wider
Python + C
Python’s performance superpower: C code
from base64 import standard_b64encode
Rate: 1.3
GB/s almost native base64
Phew! Are we done?
Python’s Achilles heel: stdlib bitrot
We can do much better
Intermezzo: SIMD
- Single Instruction Multiple Data
- Upto 64 bytes at a time
- base64 encoding: 3 byte chunks
- aklomp/base64
void base64_encode(
const char *src, size_t srclen, char *out, size_t *outlen
)
Python C API
Where strong men fear to tread
Rate:
21.5
GB/s 20x native!!

Streaming base64 encoding
HTML 5 Parsing
- The WHATWG HTML 5 spec is some 1500 pages
- html5lib is awesome!
- It parses the HTML 5 spec at
~1.1
MB/s
- How much better can we do?
- Not like base64, need object representation
An efficient object representation
Premature optimization…
- … is the root of all evil
- Use the cProfile module
- Code that is in the hot path
- Code that has a nice data boundary
An example of profiling
- Users reported opening certain e-books in calibre was very slow
- Run function to preprocess book in profiler
CSS parsing is the culprit!
Do less work and do it in C++
- Just need to transform a few properties
- Basic parsing is easy
- ~1200 lines of C++
2x speedup here worst case 60x speedup
Convert those entities in C
- HTML entities:
& name | number ;
- Use SIMD to find
&
- Use perfect hashing (gperf)
- ~200 lines of C
Further 50% speedup for a total 4x speedup
A short tour of SIMD
- Vector of input values → Vector of output values
- No control flow
- Masks
- Shifts and broadcasts
- Boolean operators
- Arithmetic operators
- Counting bits
- Originally conceived for maths
Example: Let’s find Mr. X
│s│o│m│e│t│e│x│t│ … in SIMD register A
└─┴─┴─┴─┴─┴─┴─┴─┘
│x│x│x│x│x│x│x│x│ … in SIMD register B
└─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B
│0│0│0│0│0│0│
255│0│ … in SIMD register C
└─┴─┴─┴─┴─┴─┴───┴─┘
Number of leading zeroes is the answer
│X│X│X│X│X│X│X│X│ … in SIMD register D
└─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B | A cmpeq D
4 CPU instructions to search up to 64 bytes with no branches!
A bird’s eye view of base64
- Recall: 3 bytes become 4
3 * 8 == 24 == 4 * 6 bits
2 ^ 6 == 64 (aka base64)
-
SIMD: Multiple groups: 12 → 16
[AAAB | BBCC | CDDD | 0000] → [0AAA | 0BBB | 0CCC | 0DDD ]
Focus on a single word: 0AAA
Spread out its 24 bits into four groups of 6
0AAA = [ 00000000 | aaaaaa | bbbbbb | cccccc | dddddd ]
0AAA = [ 00aaaaaa | 00bbbbbb | 00cccccc | 00dddddd ]
- Done in parallel on 16, 32 or 64 bytes at a time
Branchless base64 lookup
No control flow!
i | range | shift |
0 … 25 | A … Z | ord('A') |
26 … 51 | a … z | ord('a') - 26 |
52 … 61 | 0 … 9 | ord('0') - 52 |
62 | + | ord('+') - 62 |
63 | / | ord('/') - 63 |
Branchless lookup pseudo code
CPUs have SIMD instruction for
one_when
Key takeaways
Python programs can be very fast
- Profile and identify
- Look for pure Python opportunities
- Eliminate extra copies
- Use memoryview for binary data
Don’t be afraid of C