I've been trying to optimize this piece of code for over 5 years. But every time I tried, I couldn't get a satisfying result. But one day, I optimized an entirely unrelated piece of code, and bam, it led me to a solution so simple that it made me feel stupid. Optimizing one piece of our CPU emulation led to a series of optimizations to our graphics emulation that brought this scene from an average of 166 frames per second to an average of 200 frames per second. Let's explain everything in chronological order, starting with the CPU optimization. Let me give you some background first. RPCS3 can statically recompile the Power PC executable ahead of time since it can't selfmodify. For the SPU, that's not possible. The SPU has 256 kilob of local storage, which has to hold both data and code. By design, the SPU supports selfmodifying code. There's a portion of code in RPCS3 that detects modifications to the SPU code so that we can properly handle it. We're going to start off by talking about the SPU verification code. The purpose of this code is to discover if the code was modified and it runs just before we actually execute our block of recompiled spu code. It works in a simple way. We load some data from the spu and we load some expected data to compare it with. We exor the data together which should produce all zeros and then or the data into an accumulator that we'll check once we've finished comparing all of the data. That way any discrepancies in the code will be found. In our example, we exor cafe babe together with cafe beef which will produce a non-zero result in our second element. Any non-zero element at the end of our operation means that some piece of data differed. Since the SPU op codes are 32 bits wide, we can compare four of them at the same time with SIMD instructions. That means that for every four SPU instructions we're checking, we need two load instructions, one exor instruction, and one or instruction. How can we make that faster? Well, by using more modern SIMD instruction sets like AVX, we can compare eight SPU instructions at once. Sweet. And you probably already guessed, you can compare 16 nest view instructions at once by using AVX 512. The AVX 512 code can be made even more efficient thanks to the VP turnlog instruction. VP turnlog is a three-input instruction that can do any combination of bitwise operations on the three inputs. In our case, that means we can do both the exor and the OR operation at the same time. So with AVX 512 and VP turn log, we only need three x86 instructions per 16 SPU instructions. Okay, so how can we possibly make this code any faster? We obviously have to load all the data since we're comparing it and we're using three operand instructions to minimize the load on the arithmetic and logic unit. Well, we can actually try optimizing for code size. You see, in the world of x86, there are some special instructions that are implemented in microode. Reps SB is a good example of this. This instruction takes two pointers as arguments along with a count of bytes and will copy memory from one pointer to another. In the 80s, the rep prefect family of instructions were popular for the ability to save on code size. After all, memory was a scarce resource back then. But as time moved on, these instructions grew less popular. Why? Well, RIP MOV SB only copies one bite per cycle. So, when you call me copy and C, you get a large piece of SIMD code that copies data instead. But some Intel engineers realize that there's missed potential here. Yes, RepB only copies one bite at a time. But what if the microode could detect a large copy and just emit micro ops that use the same SIMD micro ops? Yes, the speed should be about the same, but we can save on instruction cache usage here. Memory is no longer so scarce, but most CPUs today still have just 32 kilobytes of instruction cache. Any savings here are worthwhile. And so you'll find that there's a certain flag in modern CPUs, erb enhanced rep mb. Intel's solution is to advertise this bit, which doesn't actually signify a difference in instruction set, but it indicates that Repav SB can handle large copies in performant manner. Yes, even memcopy is aware of this flag and will use Repav SB if the data you're copying is large enough. The advantage is that we're only executing one instruction from the point of view of the CPU's front end. Only one instruction is decoded and the microode sequencer handles the rest. One of the other instructions for the rep prefix family of instructions is rep compassb. Repump SB compares bytes just like how Rep Mavs SB copies bytes. I think you see where I'm getting at here. Thing is, the newest Intel CPUs still aren't fast here. It's actually AMD who is first with implementing an efficient rep compb implementation. Specifically, it's fast on the Zen 4 and Zen 5 CPUs. Anything older and it's still going to be slowest in. Oh yeah, and AMD didn't mention this fact in the Zen 4 optimization manual, but they did mention it in the tuning guide for Zen 4 server CPUs. I wrote some benchmarks before implementing rep compassb into RPCS3 and found that the performance was excellent even matching mem at small sizes and when the instruction cache was cold repb obliterated the performance of mem awesome. So how much faster does rep compassb make rpcs3? Actually I got to come clean. I never implemented rep compassb and rpcs3. You see I came up with something faster than a direct comparison. What is a checksum anyways? Well, it's used for error detection. To create a checksum, we just need to sum all of the data we want to check for errors. Then we store that checksum alongside the data. Later, if we want to check if the data changed, we just need to generate the checksum again. The generated checksum doesn't match the stored checksum, then we have an error. For instance, in the IPv4 protocol, a 16- bit checksum is transmitted alongside data to determine packet integrity. In theory, there's a 1 in 65,536 where 2 to the^ 16 chance that the error is undetectable. That is two different pieces of data end up producing the same check sum. If we instead use a 32-bit check sum, the chance it decreases to 1 in4 bill294,967,296 or 2 ^ of 32. Now with a 512 bit checksum, we're going to be looking at a checkum collision rate of about 2 ^ of 512. Now the real probability is going to be different since we're talking about valid SPU off codes and not true unordered random data. So then why do we use such a large check sum when the internet makes do with just a 16 bit check sum? Well, if you're transmitting data across the internet, you want to maximize bandwidth. A larger checksum would decrease the amount of real data you could transmit. Since the 512-bit checksum is exclusively held in memory, it only has a minimal overhead. We could use a smaller checksum which would save some memory but it would actually be computationally more expensive since we need to horizontally add our 512-bit checksum to create a smaller size checksum. Essentially all the memory savings of a smaller checksum would be eaten up by the additional instructions needed. Even though we're using a massive 512-bit checksum, the new code is actually much more memory and cache friendly. After all, the original code needed to store an entire copy of the expected spu instructions to compare against. Additionally, modern CPUs have much more hardware that can operate on values inside registers than the hardware that can perform loads or stores. So even when all the data resides in cache, the checksum variant will still be twice as fast as the original full comparison code. One of my favorite scenes to benchmark is this one in Ninja Guiden Sigma. When I first started looking into RPCS3's code, I opened up the spot and filed up a profiler and noticed some code was taking up way too much CPU time. I investigated the RPCS3 code, but wasn't able to fix it properly myself. I opened an issue on GitHub with my findings and RPCS3's lead graphics developer, KD11, was able to fix it the very next day. Since then, I got a lot better at programming and started contributing meaningful improvements to RPCS3. But I focus on the Power PC and SPU recompilers, not the graphics code. You see, Ninja Gaiden Sigma is a simple game which doesn't make heavy use of the SPUs. So, the bottleneck is the host CPU thread that emulates the PlayStation 3 GPU, the RSX. Even so, I would return back to this area every once in a while and to see the improvements. When I first started on RPCS3, the game couldn't reach 60 frames per second, but now it's reaching nearly 166 frames per second. Nice. One of the things I tried all the way back in 2019 was compiling the emulator with link time optimizations enabled. But when using the GCC compiler with LTO, the RPCS3 binary created would crash in the LLVM recompiler. Oops. With the LLVMbased Clang compiler, the RPCS3 binary would crash in the QT libraries. Oops. Through no fault of our own, we couldn't enable this feature in the compiler. Despite years passing, LTO still caused issues with third party code. One day, I learned that LTO can be enabled for just a portion of your program. Oh, enabling LTO for just the core of RPCS3 and not any of the third party code speeds up performance from around 166 frames per second to 180 frames per second. So why is LTO so much faster? Well, when compiling programs without LTO, each C or C++ file in the program is first compiled, then optimized. When code in one file calls into code from another file, we can't see what that code is yet. Instead, we leave a stub in the compiled code and later once all the individual files are compiled, we link all the stubs together so that code from all the files can correctly call into other files. This is where LTO comes in. When LTO is enabled, the optimizations are deferred until link time. So, the optimizer can do things like inline functions that are only ever called once, delete code that's never called, and so on. It generally makes the whole program just a little bit faster and considerably smaller. The other nice thing here is that it makes the bottlenecks more obvious. If we spin up a profiler, we can see that some parts of the code are dominating execution time. The shader hashing part of the code has held my attention for some time. With LTO now enabled, it appears to be an even larger bottleneck than before. There's a lot of different approaches to hashing. RPCS3 uses the Fowler null VO or FNV hash here. FNV is no longer considered cryptographically secure, but that doesn't matter for our use case. FNV is actually decently fast at hashing through small data sizes since the code is relatively simple. However, for large data sizes, more modern non-cryptographic hashing techniques such as XX hash can be faster. I wanted to integrate a more advanced form of hashing into this code. However, there's a bit of a problem. See, the shaders have floatingoint constants in between RSX instructions. You could hash the constants alongside the shader instructions, but then some games would end up compiling a nearly infinite number of shaders. See, games like to patch the constants in the shader programs between draws. This means that every time a new constant is patched into the shaders, we would need to recompile shaders. Ouch. So, it's necessary that we avoid including the constants in our hashes. Now, these third party hash implementations are not well suited for this. Either I could copy the data to a new buffer without the constants and feed that into third party hashing code, or I could just adapt or reimplement existing hashing code to make it skip the constants. This is where I was stuck for some time. But after replacing the comparisons in the spu code with checksums, I now wondered, does the quality of the hash actually need to be good at all? Is a checksum good enough as a hash? I was nervous about using a 64-bit checksum as a hash. So I came up with this, a checksum where the inputs are rotated. Now, I don't know if we face hash collisions if we don't rotate the inputs, but I'm nervous about them. And it's hard to test every PlayStation 3 game for shader hash collisions, so I included the rotation. The performance boost from this rotated checksum code was pretty great, bringing us from around 180 FPS in the LTO build all the way to 185 FPS in our test scene. Pretty slick. Now, here's where things get hard to explain. Part of the reason I chose this rotated check sum setup is because I knew it would be more amendable to vectorization than the FNV code. However, it's still too difficult for the compiler to successfully auto vectorize this code. And look, if it's tough for the compiler to make the transformation, then it's going to be hard to explain to you guys, too. So, buckle up because you're about to be smarter than the compiler. This loop uses a bit set to skip over hashing the elements that contain constants. The size of the bit set is 512 bits since the PlayStation 3 uses a DirectX9 error graphics chip and has a max of 512 instructions in vertex shaders. We use a bit set here because it takes up less space than using a boolean. An array of booleans would take 512 * 8 bits. Or in other words, the bit set only needs 64 bytes, while the boolean array would need 512 bytes. Okay, great. But there's actually a trade-off here. It's trivial for us to index into array of booleans, but indexing into the bit set is a bit more complex. The implementation here chooses to read 64 bits each loop iteration, mask off an individual bit, and branchlessly select between 64-bit elements once it's masked off all 64 bits off of the current element. This means that both an array of bools and a bit set require the same 512 load instructions to fully traverse. Performance-wise, this means that the bit set requires more instructions to traverse, but the bit set could still be faster since it's more cache friendly than array of bools. H because traversing the bit field is non-trivial. It means that the compiler isn't good enough to auto vectorize this code. So, we need to do it manually. The nice thing about AVX 512 is that the vectors are 512 bits wide. So, we can just fit the entire bit set into one register and transform it from there. Okay, I lied. See, the limit on vertex program instructions should be 512 instructions, but the Ratchet and Clank developers are insane and found a way to stuff more instructions in a single program than should be possible. So, the bits that we're actually using has 544 elements in it, which means that we have the first 512 bits in one register, the next 32 elements in another register. Yuck. Luckily, there is a shuffle instruction that takes two registers as input. We can use the mm512 permute x2var ep intrinsic, which correlates to the verm 2b instruction to replicate one bite of the bit set across an entire 512-bit vector. Using vperm 2b for this purpose feels like overkill, but it's the right tool for the job. We only need to replicate one bite across a whole vector, but verm2b can do so much more. It feels like when you get the quad damage in quake and everything you touch just turns into giblets. It feels weird comparing coding and assembly to violence, but this is the only way it can be reconciled in my head. VP perm 2B is a violent instruction and I'm using a nuke to avoid extra load operations in the loop. Anyways, next we can compare against the bit set twice to create two mass that will let us skip the instruction slots that contain constants. We compare against this constant for the lower four bits of our bite and compare against this constant for the upper four bits of our bite. Next, we use mass load instructions. The mass feature means that they won't load any data for slots that are masked out by our bit set. Nice. Finally, we actually perform the rotating and check something of our data. Each loop iteration chews through eight elements or a,024 bits of input data. We don't need a remainder loop for when the data isn't a multiple of eight elements because the instruction mask will gracefully handle that case. Once all the data has been read, we use this readuse intrinsic here, which will emit all the necessary code to turn our 512-bit number into a 64-bit result. The resulting code runs four times as fast as the scaler version on my machine and brings the average FPS up from 105 to 193 frames per second. Awesome. Remember that the scalar version needs 544 load instructions to traverse the bit set alone along with,088 extra loads in the process of hashing the shader data. In comparison, the AVX 512 version only needs 136 load instructions. Oh, yeah, and all my benchmarking was done on my Zen 4 machine, which is half the speed of Zen 5 when we're talking about these wide AVX 512 instructions. So, on newer machines, you can expect this code to be 8 times faster than the scaler version, which is honestly pretty disgusting. Another RPCS3 developer, Elad, saw what I was doing and decided to take things even further. He realized that we didn't always need to hash the shader again and added logic to skip rehashing the shader if we didn't need to. This brought the nonAVX 512 path up to 193 FPS, matching what the AVX 512 path was at previously. The AVX 512 path increased even further to nearly 200 frames per second. In the end, this code went from a potential bottleneck to being so fast the code barely even shows up on the profiler. Anyways, did you guys like the video? Like and subscribe if you want to see more. See you.