Welcome to what (hopefully) could become the first ever on-going series on this blog. In “What the Bug?!?”, I want to share my experiences with certain bugs in games (and sometimes other software) which, in some way or another, I consider particularly remarkable. This can mean many different things. These bugs could be quite severe, sneaky, well hidden and difficult to find, dumb, or even just plain funny. (I know, I’ve more or less promised a blog post about my experiences working on Darksiders Warmastered Edition a while ago, but there are so many things I want to say about it that I didn’t find the time and motivation yet – plese remain patient). Today, I want to talk about not only one, but actually two bugs I’ve experienced just recently while facing the beast (C++) – one of those totally being my own fault. (Hence the “stupid”) ( ͡° ͜ʖ ͡°)
Both of the bugs occured on our current (unannounced) project, a port of an older game, have to do with the evil nature of C++ and were particularly hard to find.
So for roughly the past two weeks, I’ve been working on migrating the code of our project from an old Havok version to the most recent one. If you don’t know me, let me tell you that I’m not a huge fan of big middleware when it comes to porting games. Middleware certainly has its place, even in game development, but when it comes to porting an old game, it can very easily become a pain in the ass. In most cases, the API of middleware changes a lot over time, so the older the game you’re porting, the more you have to adjust its code to get the middleware back running. Most of the time, staying on older versions of the middleware in question isn’t possible, either, because those old versions usually don’t support the new systems you’re trying to port the game to, and more often than not, you don’t even get full access to the middleware’s source code, so you can’t even port everything yourself (which is actually a legit solution sometimes – e.g. for Darksiders, we actually ported an old version of Scaleform to different systems rather than trying to switch to a newer version of Scaleform). So long story short, for our current project, we need to migrate to a newer version of Havok and this has taken away quite some time already and caused a bunch of problems.
The first of these problems had to do with memory allocation. If you’re not into game development, memory allocation might be a problem you don’t often have to bother with. Even if you’re among the relatively few people (outside of game development) who are still programming software or libraries in C or C++, it’s likely that malloc() and free() or new() and delete() are all you ever really need to use. For games, on the other hand, this is quite different. Especially in the case of console games. When developing games for consoles, there is a good number of reasons why standard memory allocations may not be enough for you.
- On some consoles, they may not even be (fully) supported (though this mostly affects older consoles, I think).
- RAM on consoles is very limited and you generally want to have some control over what is loaded into memory and where. This problem is certainly getting smaller and smaller the further into the future we get, but since 4K is also becoming a thing now, requiring higher texture resolutions, we’re still far from the point where the amount of RAM in your console is trivial.
- For certain things, there may be certain memory requirements. For example: when decoding videos, there may be a requirement for where in RAM your video frames should go and how the memory should be aligned.
- Even performance can be a huge factor. For example: certain memory areas could use different memory buses with different speeds, so when performing a task that needs to access a huge block of memory very often, it may be favorable to put that memory into an area with a faster bus.
Long story short, when developing games, chances are high that you’re going to use your own memory allocators, and naturally, middleware designed specifically for games is aware of this fact and also uses custom memory allocators (usually with some means of hooking your own allocators into them).
For the project we’re currently working on, we’ve set up Havok to use a free list memory allocator with a memory block provided by the game. If you don’t know how a free list allocator works, here is a short rundown: it maintains a linked list of elements representing free memory blocks (usually located at the start of said blocks) where each element has a pointer containing the address of the next memory block that is “free” (or null, if there are no more free blocks left). Now whenever you allocate some memory, the allocator looks for an element in the list pointing to a memory block that is large enough for your request (some allocators maintain multiple lists for different block sizes to make this easier and faster), then returns the address to that memory block and removes the element from the list. When you free that memory again, the allocator adds the element back to the list.
Getting back to our first bug now. This bug was a hard crash that always occured in the same location after a certain amount of allocations via this free list allocator. Havok tried to request a certain number of memory blocks from this allocator and upon iterating its free list, the allocator came across an element that was very clearly pointing to an invalid memory block, making the game crash trying to read from it. Inspecting the different elements in the linked list, one fact quickly became apparent: the element pointing to the invalid memory block was located at a very supicious address itself. Most elements in the list had addresses somewhere in the range of 0x6XXXXXXX. This particular element had an address very far away in the range of 0x9XXXXXXX. This memory clearly didn’t seem to belong to the allocator, so it was time to find out how it got there. This actually turned out to be quite tricky. Even with address space layout randomization deactivated, that one suspicious list element always ended up at a different location, making it impossible to just use data breakpoints on it. My only idea here was to use conditional breakpoints with a very generic condition of [cpp]address >= 0x90000000 && address < 0xA0000000[/cpp]. If you’ve ever used conditional breakpoints, you know how slow these things are. This certainly didn’t help getting closer to a solution. On top of things, after stepping into this rabbit hole for quite some time, it all just led me back to the beginning: into the same allocation function where the crash had occured in the first place. Unfortunately not very helpful.
However, tracing the events in my head, I started thinking something along the lines of “well, if a memory allocation is where the game crashes, maybe freeing some memory is what actually causes the crash”, and believe it or not, this line of thought actually led me straight to the answer. Placing a conditional breakpoint in the allocator’s free function revealed the code location where this faulty element was actually added to the allocator for the first time.
So what happened? Something quite remarkable, and this is where some of the evil nature of C++ comes into play: operator overloading. Surely, operator overloading is one of the coolest and most useful features of C++. For example: when writing your own vector class, overloading its math operators to make common math operations more readable and more intuitve makes perfect sense. It doesn’t stop there. C++ is very flexible and even lets you overload things like cast operators and new/delete operators. The latter is where it gets interesting, but also quite dangerous.
You see, as mentioned above, for middleware it’s quite common to evolve their APIs over time, and in the case of Havok, one of the APIs that was greatly affected by this was the memory API. You could go so far as to claim that the memory API was basically completely rewritten at some point. While doing so, the devlopers of Havok probably figured that it would make sense to overload the new and delete operators of all their classes so that calling them would actually use their own memory allocators. That’s exactly what they did. Now unfortunately, there probably was a minor oversight by them in this regard and this minor oversight in rare cases can actually lead to quite some trouble: they forgot (or decided against) overloading the placement new operator of those classes as well.
A placement new, if you’re not familiar with C++, is basically a new that only constructs an object in a certain memory location rather than also allocating memory for it. This is meant for storing objects in memory you’ve already allocated rather than leaving memory allocation up to the new call itself. Now to be fair, this alone might not be a major issue and I’m not even entirely sure if the blame here is on the Havok developers or on the original developers of the game, using Havok code in unintended ways. It’s probably both to some extent, and it also required a second change by the Havok developers, related to reference counting, to ultimately cause this bug.
Getting to the point: the original developers of the game decided to rely on placement news, coupled with their own memory blocks, for creating some of their Havok shapes, yet they also decided to rely on Havok’s reference counting to dispose of these shapes. Apparently, this worked just fine using the original version of Havok, but upon switching to a newer version, this actually made the game crash in in the way described above. If you followed what I’ve been talking about just now, you can probably see where this is going, so I’ll try to be short: when creating Havok shapes, the game places them in its own memory buffers using placement new, but for destroying these shapes, it relies on Havok’s reference couting, which internally is set to just call delete on any object it destroys. This will make use of the overloaded delete operators on those classes, which are set to route all delete requests through Havok’s memory system. This means that when deleting objects this way, Havok’s memory allocators will try to reclaim the memory used for them. At least in the case of the free list allocator, there doesn’t seem to be any validity check for reclaiming memory, so the allocator just inserts it into its linked list and tries to reuse it later. Since in this particular case, this memory doesn’t actually belong to the free list allocator – it belongs to the game – and the game actually reuses and overwrites this memory later at some point, it’s just a matter of time before this leads to a certain crash. That’s exactly what happened and finding this bug took more than a full day. Certainly not the longest I’ve ever spent searching for a bug, but long enough to start feeling a bit of despair.
Whew, this was quite a lot of text to explain that one bug. I wish I could have made it more graphical and less dry. And we’re not even done yet. There is still a second bug left to talk about, so I’ll try to keep it short. This time, the blame is entirely on me, so there’s even something to laugh about! ;-)
One of the biggest new features in C++ compared to C surely was the introduction of classes and inheritance. A powerful tool that can help you achieving certain things faster, but that also comes at its price and with its own set of dangers. For those reasons, our philosophy is to only use certain C++ features scarcely. Of course you don’t always have full control over this. When porting an old game or when using middleware, you get what you get, and Havok in particular relies heavily on inheritance and polymorphism. From my experience, the latter is one of the most dangerous aspects of object-oriented programming and is exactly what we try to avoid when there are good alternatives, but sometimes you just roll with it.
Back to our port: while migrating the code to the new Havok version, there was a situation where a virtual function had been removed from a Havok base class, but was still present in its derived classes, and the game actually called this function via a pointer to a base class object. By now, we’ve found out that the original developers of the game had actually edited their Havok source code (which they had full access to, unlike us), so in hindsight, it’s entirely possible that this function had never been part of the base class to begin with and that the original developers just hacked it in there. Whatever the case, this function was now missing from the base class, yet we still had to get the game to compile with the new Havok version somehow.
This is the part where experienced C++ programmers can point their fingers at me and take a good laugh, but being the naive and inexperienced programmer I am, I decided to just hack this function into one of the Havok headers. Now this isn’t something I usually do (nor recommend doing), but when already working on a game with a mess of a code base (certainly the case here), and especially when working on a tight schedule, you sometimes dare and decide to pull a “what could go wrong”. Well, in this particular case, we found out rather quickly “what can go wrong”, although again we had to invest more than a day into a needless bug hunt. To be fair here, had we compiled Havok entirely from source, I might have gotten away with my impatient header hack, but we were linking against pre-compiled Havok libraries here, so… yeah, repetition is not recommended.
So what exactly was the problem? Well, it turned out that upon changing that one header, I had completely fucked up the virtual function pointer table of a class, so upon trying to call one of its virtual functions, the application actually jumped into an entirely different function, making the game crash due to incompatible arguments. To be honest, I’m not sure if I had ever found this solution all by myself with my current expertise, despite the symptoms being rather suggestive (mostly because by the time the bug occured, I had already forgotten about my header hack). Thankfully, my boss has been in the industry for way longer than me (a few decades by now), so as a team effort, finding the problem’s source didn’t take as long as it could have.
This post has gotten really long and I apologize for this. I wish I could have spiced it up some more with imagery, but I can’t really think of anyhting else relevant to the topic, so this will have to do for now. If you have stuck here until to the end, you certainly deserve a medal and my gratitude. I hope you could at least find some enjoyment from reading this and maybe even learn a thing or to (I know I have). If you did, I would certainly love to see you again on this blog for future posts. See ya!