jan xx-28, 2023
no screenshots today, because today was all about debugging the worldrep merge tool. all very technical stuff, so skip this if youre not here for programming blather.
so, the problem i had been wrestling with on and off for the past week or two was this: i merged the two worldreps into one. it seemed cromulent. but if i loaded the level into dromed, dromed crashed the moment i moved the camera into any "in the world" space, as it tried to render the view from inside a cell.
unfortunately, the olddark dromed with debug symbols wasnt at all helpful here, because it didnt crash! it just didnt render anything. so i was having to wade through the newdark dromed disassembly in visual studio, while correlating it with the disassembly/decompile in ghidra, and correlate that with the leaked source, just to try to begin to figure out what was happening. problem was, keeping track of things in visual studio was a mess, because of ASLR. Address Space Layout Randomisation is a security feature in windows that loads programs at a different base address in memory every time it launches, which means that the addresses of the functions and variables in the program change also. which means that i would launch dromed, put breakpoints in various functions and inspect various variables for one run, and then hit the crash, realise i needed more info from what was happening before the crash, and so have to relaunch dromed—and now all my breakpoints and memory references were wrong. just far too much overhead, i couldnt juggle all that and try to understand what dromed was actually doing.
all i had managed to discover up to this point was that it was trying to access the render info for a cell that was offscreen, and had never been prepared for rendering. if i had messed up the bsp nodes when doing the merge, or the "destination cell" ids in the portal info, that would have been an obvious cause. but no, i double checked that. even wrote a function to dump the merged bsp tree to a graphviz file so i could generate a diagram, making it much easier to check for correctness. here are the two trees that were the input:

and merged, with a new root node inserted, and all the node and cell ids renumbered:

and all that looked just fine. i couldnt for the life of me figure out why this offscreen cell was ending up in the list of cells to be rendered! so i set the whole merging thing aside for a while to focus on building out the ruins (i.e. the stuff from the previous post).
anyway, yesterday i learned of a way to disable ASLR by patching dromed.exe with this very helpful utility. with that, dromed.exe now always launched with the same base address every time, so i could finally keep consistent breakpoints and memory dumps between runs. which meant i could finally effectively trace what was happening. i set up a few breakpoints at the entry point of various functions that would print out the relevant arguments, and similarly a few leading up to the crash point itself. here's what they printed (with some explanatory comments afterwards):
Code:
// at the beginning of the render pass:
initialize_first_region( 2 ) // camera is in cell 2, so put it in the list to be rendered.
setup_cell( 0x0a6b0970 ) // prep cell 2's pointer for rendering.
examine_portals( 0x0a6b0970 ) // find other cells connected to 2 that also need rendering.
add_region( 1, ..., 0x0a6b0970 ) // found cell 1, add it to the list.
setup_cell( 0x0a6b0850 ) // prep cell 1's pointer for rendering.
examine_portals( 0x0a6b0850 ) // find other cells connected to 2 that also need rendering.
// (there are no more to find that would be onscreen).
// now later in the render pass, just before the crash:
> active_regions[ 0 ] // look up the first cell in the list...
> wr_cell[ 2 ] // its cell 2 (just as expected)
> cell ptr: 0x0x0a6b0970 // and here is its pointer again.
> active_regions[ 1 ] // look up the second cell in the list...
> wr_cell[ 0 ] // its cell 0?? how? it was cell 1 that got added to the list above!
> cell ptr: 0x0x0a6b07f0 // yeah this is cell 0's pointer. wtf is going on?
Exception thrown at 0x00552801 in DromEd.exe: 0xC0000005: Access violation writing location 0x00000028.
it still didnt make sense, but now that i could step through all this in repeated runs, exploring different pieces of the code along the way, i finally discovered where the list of cells to render was getting mangled. and it was in a function called "sort_via_bsp()", whose job was to sort the list of cells to render in front-to-back order. this didnt make sense to me: this function has a simple job, which obviously worked just fine ordinarily. and my bsp tree all looked correct! how could it be messing up when walking my merged bsp tree but just fine with either of the original two? and then i stopped looking at the logic and maths that sort_via_bsp() was doing, and noticed something small, something obvious and usually unremarkable: it was checking a "Marked" flag on each leaf node it encountered; and if the flag was set, writing that node's cell id into the (sorted) list of cells to render. to make that work, just before sort_via_bsp() is called, a function called setup_bsp() is responsible for setting those flags: it first calls unmark_bsp() with the root node, to clear all the Marked flags; and then for each cell in the (unsorted) list of cells to render, it sets the flag. so this Marked flag simply means "this cell is going to be rendered this turn". all very ordinary, and normally i would never have batted an eye at this code. but…
but i remembered the flags i had seen days earlier in the bsp trees i was using as input. two of the flags made sense for the data structure, but the third flag, "Marked" had been set on some nodes, even in the .mis on disk, and i didnt understand what it meant. i had looked up where this flag was used, seen it was only used for this common clear-and-mark pattern that meant it was transient and its value on disk didnt matter at all, and disregarded it. i wasnt even checking the flag when generating the visual graphs. but now that it was implicated, i regenerated the graphs with an 'M' for the marked nodes. the pictures above are output from this updated graph generation, and you can see that all of the non-leaf nodes in the "top" worldrep are marked; and all of the nodes in the "bottom" worldrep are unmarked. and when merged, my root node obviously was unmarked, because why would i set this flag on it? after, every node gets the marked flag cleared at the start of the render pass, right? right?
nope! turns out the unmark_bsp() function responsible for clearing the flags is written recursively. and to avoid walking more of the bsp tree than it needs to, if it encounters a node that isnt marked, it concludes that its subtrees are also entirely unmarked, and takes a shortcut: it doesnt recurse any deeper from there. and this shortcut is perfectly reasonable and in line with how the marked flag is set later on. but with my new root node that was not marked, this shortcut meant that the entire tree was never getting unmarked.
so that finally explained how cell 0 had ended up in the sorted list: as you can see in the merged graph, its parent, node 11, had the marked flag set (as do the parents for cells 1 and 2, which were the cells that were actually supposed to be rendered; so they wouldve been marked regardless).
so this little unremarkable flag, that honestly, being transient, should never have been written to disk at all, turned out to be the cause of days of pain for me.
like most bugs, once the cause is understood, the fix is easy. when i merge the two worldreps, i go through every node and clear the marked flag. it really should never have been written to disk, so lets just clear it from the on-disk bsp tree. and with that, my crash went away! the merged worldrep rendered perfectly.
of course, i immediately encountered a new crash! but that will be a job to debug another day…