Making Python programs fast

Kovid Goyal

September 21, 2024 @ PyCon India

sw.kovidgoyal.net/talks/pycon2024

The need for speed

  • Single core performance flatlined
  • Performance == energy saving == battery life
  • Users love performant software
  • I feel the need, the need for speed

Is it even possible?

  • Python is slow (100x slowdown)
  • No heroes to save us
  • Nonetheless, it is possible
  • TerminalLanguageMB/s
    KonsoleC++27
    AlacrittyRust54
    GNOME TerminalC++62
    kittyPython + C135

A little bit about kitty

  • A terminal emulator
  • Make terminals a powerful platform
  • Uses Python for its UI and control
  • Uses C and GPU for performance
Because I wanted vim with squiggly underlines

Case study: base64

  • Encode binary data as text
  • 3 input bytes → 4 letters
  • 
            │Enc│  │ode│      │d  │
            └─┬─┘  └─┬─┘  …   └─┬─┘
              │      │          │
              ↓      ↓          ↓
             RW5j   b2Rl  …    ZA==
            
  • Used widely in communication, escape codes in kitty

base64: A baseline

Rate: 1.6 GB/s
Rate: 1.2 MB/s 1000x slower!!

Add some street smarts

Rate: 7.0 MB/s 5x faster

Intermezzo: memoryview

Rate: 6.1 MB/s almost same speed
Binary data ⇒ use memoryview

Parallelism to the rescue?

Rate: 27.8 MB/s 4x noslice
20x naive, but 50x slower than native
Cheating not faster, just wider

Python + C

Python’s performance superpower: C code
from base64 import standard_b64encode
Rate: 1.3 GB/s almost native base64

Phew! Are we done?

Python’s Achilles heel: stdlib bitrot
We can do much better

Intermezzo: SIMD

  • Single Instruction Multiple Data
  • Upto 64 bytes at a time
  • base64 encoding: 3 byte chunks
  • aklomp/base64

    void base64_encode(
        const char *src, size_t srclen, char *out, size_t *outlen
    )

Python C API

Where strong men fear to tread
Rate: 21.5 GB/s 20x native!!

Needed boilerplate

Streaming base64 encoding

HTML 5 Parsing

  • The WHATWG HTML 5 spec is some 1500 pages
  • html5lib is awesome!
  • It parses the HTML 5 spec at ~1.1 MB/s
  • How much better can we do?
  • Not like base64, need object representation

An efficient object representation

Premature optimization…

  • … is the root of all evil
  • Use the cProfile module
  • Code that is in the hot path
  • Code that has a nice data boundary

An example of profiling

  • Users reported opening certain e-books in calibre was very slow
  • Run function to preprocess book in profiler
CSS parsing is the culprit!

Do less work and do it in C++

  • Just need to transform a few properties
  • Basic parsing is easy
  • ~1200 lines of C++
2x speedup here worst case 60x speedup

Convert those entities in C

  • HTML entities: & name | number ;
  • Use SIMD to find &
  • Use perfect hashing (gperf)
  • ~200 lines of C
Further 50% speedup for a total 4x speedup

A short tour of SIMD

  • Vector of input values → Vector of output values
  • No control flow
    • Masks
    • Shifts and broadcasts
    • Boolean operators
    • Arithmetic operators
    • Counting bits
  • Originally conceived for maths

Example: Let’s find Mr. X

│s│o│m│e│t│e│x│t│ … in SIMD register A
└─┴─┴─┴─┴─┴─┴─┴─┘
│x│x│x│x│x│x│x│x│ … in SIMD register B └─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B
│0│0│0│0│0│0│255│0│ … in SIMD register C └─┴─┴─┴─┴─┴─┴───┴─┘
Number of leading zeroes is the answer
│X│X│X│X│X│X│X│X│ … in SIMD register D └─┴─┴─┴─┴─┴─┴─┴─┘
C = A cmpeq B | A cmpeq D
4 CPU instructions to search up to 64 bytes with no branches!

A bird’s eye view of base64

  • Recall: 3 bytes become 4
      3 * 8 == 24 == 4 * 6 bits
      2 ^ 6 == 64 (aka base64)
    
  • SIMD: Multiple groups: 12 → 16
      [AAAB | BBCC | CDDD | 0000] → [0AAA | 0BBB | 0CCC | 0DDD ]
      Focus on a single word: 0AAA
      Spread out its 24 bits into four groups of 6
      0AAA = [ 00000000 | aaaaaa | bbbbbb | cccccc | dddddd ]
      0AAA = [ 00aaaaaa | 00bbbbbb | 00cccccc | 00dddddd ]
    
  • Done in parallel on 16, 32 or 64 bytes at a time

Branchless base64 lookup

No control flow!
irangeshift
0 … 25A … Zord('A')
26 … 51a … zord('a') - 26
52 … 610 … 9ord('0') - 52
62+ord('+') - 62
63/ord('/') - 63

Branchless lookup pseudo code

CPUs have SIMD instruction for one_when
Thanks to Wojciech Muła for this algorithm

Key takeaways

Python programs can be very fast
  • Profile and identify
  • Look for pure Python opportunities
  • Eliminate extra copies
  • Use memoryview for binary data
Don’t be afraid of C