Overview
This lecture explains how optimizing CPU emulation led to improvements in GPU emulation in the RPCS3 emulator, boosting performance from 166 FPS to 200 FPS through better verification, checksum usage, hashing, and vectorization techniques.
Background: RPCS3 SPU and Code Verification
- RPCS3 can statically recompile Power PC code, but not SPU code due to its self-modifying design.
- SPU code verification checks for code modification before execution using XOR and OR operations to compare expected and current code.
- SIMD instructions like AVX and AVX-512 can accelerate verification by processing more instructions in parallel.
- AVX-512’s VP ternlog can combine XOR and OR operations, reducing instruction count.
Further Optimizations in Code Verification
- Optimizing for code size can improve instruction cache usage, using instructions like REP MOVSB and REP CMPSB.
- Modern CPUs, especially AMD Zen 4/5, accelerate REP CMPSB for rapid memory comparisons.
- REP CMPSB was considered but not implemented; instead, a checksum approach provided better results.
Using Checksums for Verification
- Checksums, such as those used in network protocols, detect data modifications with minimal memory and computation.
- A large 512-bit checksum is memory and cache efficient compared to storing full instruction copies.
- The checksum method is twice as fast as full comparison code in SPU verification.
Graphics Emulation and LTO (Link Time Optimization)
- Ninja Gaiden Sigma’s framerate improved over time; main bottleneck became GPU emulation (RSX).
- Enabling LTO only for RPCS3 core (not third-party code) allowed inlining and dead code removal, improving FPS from 166 to 180.
Shader Hashing and Hash Optimization
- FNV hashing was originally used for shaders, which is simple and fast but not optimal for large data.
- Shader constants need exclusion from hashes due to patching by games, complicating the hashing process.
- Switched to a rotated checksum for hashing, allowing vectorization, raising FPS further to 185.
- Manual vectorization with AVX-512 exploited hardware fully, jumping FPS from 105 (scalar) to 193 (vectorized).
Bit Set Optimization in Hashing
- Bit sets track which shader instructions contain constants, allowing efficient exclusion from hashing.
- Bit sets are more cache-friendly than boolean arrays but require more complex indexing and traversal.
- AVX-512's wide registers and vperm2b instruction enabled efficient masked operations and further speedup.
Final Optimization and Results
- Additional logic was added to skip redundant hashing, further increasing non-AVX-512 path speed to 193 FPS.
- On AVX-512-capable CPUs, optimized code reached 200 FPS, removing the bottleneck altogether.
Key Terms & Definitions
- SPU (Synergistic Processing Unit) — A processor used in PS3, supports self-modifying code.
- SIMD (Single Instruction, Multiple Data) — Executes the same operation on multiple data points simultaneously.
- AVX/AVX-512 — Advanced vector extensions; modern CPU instruction sets for high-performance SIMD processing.
- Checksum — A value used to verify data integrity by summing or otherwise processing data bytes.
- LTO (Link Time Optimization) — Compiler feature that optimizes across multiple files during the linking stage.
- Bit set — A data structure using individual bits to track conditions (e.g., which shader slots have constants).
Action Items / Next Steps
- Review examples of SIMD and checksum implementations for verification tasks.
- Study LTO compiler options and their effects on program performance and size.
- Explore the trade-offs between different data structures (bit sets vs. arrays) in code optimization.