Now, what I'm working on is a retro-ish top-down 2d game, kinda like a cross between Berzerk and Galaxian, but somewhat modernised, with the kind of enemy overload you see in Geometry Wars. I'm a big fan of swarming, so there's gonna be some of that.
One of the key things I'm aiming for here is a solid frame rate with no slowdowns. The key to doing something like this, especially on a "low performance" platform like the PSP, is keeping the underlying algorithms not only as fast as possible, but, above all, stable in terms of execution time. There's no point[1] having (for example) a swarming algorithm that's blazing fast most of the time, but that suddenly takes exponential time under a specific set of circumstances.
Effectively, these are underlying engine considerations. If your engine then allows "user scripting code" behaviours[2], that might get blown away (although, again, see [1]), but that's not something we particularly need to worry about here.
So. What algorithms do we need, and what can we do about them?
- Collision Detection
- Swarming
- "Physics"
- Sorting
Code: Select all
typedef callback_fn_t void (*fn (void* object, void** near_objects, ...));How to use this?
Let's assume we keep our objects in a big, pre-allocated, array. That's not a bad way of doing things, it keeps our memory usage bounded (at the cost of limiting the number of objects in the game), and means that, instead of passing around pointers to objects, we can pass around indexes into that array. Something like this:
Code: Select all
#define N_GAME_OBJECTS 1024
typedef struct { ... } game_object_t;
extern game_object_t * game_object_array;
Somewhere in game_object_t we have a morton code, which is an integer, which we update for each object as part of the "physics" step. This is done by interleaving the bits of its integer x and y (and z, in 3 dimensions) coordinates[5]; for optimisation purposes this is an integer operation that can be done in "spare" cycles on the MIPS core if we're using the VFPU for the physics engine.
Then we need to build an index of the game object array, based on "liveness" and morton code. It might look (a bit) like this:
Code: Select all
typedef struct {
unsigned int live;
unsigned int morton_code;
unsigned int object_index;
} object_index_t;
object_index_t object_index[N_GAME_OBJECTS];
Finding near objects, then, involves finding the range of morton codes we're interested in (see [5] again), scanning the index until we hit the first of those codes, and then dealing with each object in turn until we hit the end of the range[6].
But we can do better than that. Just finding the near objects is O(logn), so processing is O(logn^2) and processing for all objects is O(nlogn^2). The number of morton codes we are dealing with is fixed, so if that's a reasonable number we can "bucket" the index and make lookup O(1)[7]. In my case (for reasons I'll explain later), the morton code is only 8 bits, meaning I have a bucket index as follows:
Code: Select all
typedef union {
struct {
uint16_t low_index;
unit16_t high_index;
} i;
uint32_t v;
} object_index_index_t;
object_index_index_t object_index_index[256];
object_index_index_t dead_objects;
Now, there's a lot of literature on sorting. An awful lot of literature. You could lose a lifetime just reading it all. So I'll cut to the chase as to what I decided.
I wanted a sort algorithm that was stable - that had the same best and worst case complexity. I also wanted something I could parallelise using the VFPU if at all possible. Now, one of the most common parallelised sorts (at least, the GPGPU world) is the Radix Sort. It's already stupidly fast, and good fit for sorting integers (which, if you've been paying attention, our morton code is). Unfortunately, it's an integer algorithm, and the PSP is rather lacking in integer parallelisation. Whilst looking at ways to [ab]use the VFPU to do this, I came across the http://lucasz.dk mirror of the ps2dev forums, particularly the "VFPU diggins" thread, which, although incomplete and in some cases wrong, is one of the better references to the VFPU out there given that http://wiki.fx-world.org is dead.
In that thread, I noticed the vsrt*.q opcodes, which looked rather familiar. And, let's face it, 'vsrt' sounds a lot like 'vector sort', right? And it hit me. They're sorting network primitives (as it happens, vsrt2.q is one of the wrongly documented opcodes in that thread, which threw me off). In particular, they are opcodes for doing a bitonic sort of 4 values - vsrt1/2 sort ascending, vsrt3/4 sort descending, as follows:
Code: Select all
vsrt1.q R000, R000
vsrt2.q R000, R000
vsrt1.q R000, R000
Code: Select all
vsrt1.q R100, R000
vsrt1.q R101, R001
vsrt1.q R102, R002
vsrt1.q R103, R003
Code: Select all
vmin.q C100, C000, C010
vmax.q C110, C000, C010
vmin.q C120, C020, C030
vmax.q C120, C020, C030
There's a downside. The vsrt* (and vmin / vmax) opcodes only sort one set of floating point values. They don't update the condition flags, and there's no obvious way to sort something like object_index_t above without using vi2f, vcmp, vmov and vcmov*. All of those have loads of stall cycles, and start hitting performance severely, even though it can all be done in "straight through" code, and the approach restricts us to sorting 64 elements at best.
But I've written (amongst other things) IEEE 754 float implementations, and I know how float sorting works. Which is where the aforementioned 8 bit morton code comes in. A floating point number consists of 3 things.
- A sign bit, s
- An exponent, e
- A mantissa, m
Code: Select all
(-1 ^ s) * (1 + m) * (2 ^ e))Code: Select all
seeeeeee emmmmmmm mmmmmmmm mmmmmmmmThat gives us a new definition of object_index_t
Code: Select all
typedef union {
struct {
unsigned int index:22;
unsigned int one:1;
unsigned int morton:8;
unsigned int dead:1;
} unpacked;
float packed;
} object_index_t;
Code: Select all
object_index_t __attribute__ ((aligned (16))) v[4];
v[0].unpacked.dead = 0;
v[0].unpacked.morton = 0;
v[0].unpacked.one = 1;
v[0].unpacked.index = 1;
v[1].unpacked.dead = 0;
v[1].unpacked.morton = 1;
v[1].unpacked.one = 1;
v[1].unpacked.index = 0;
__asm__ volatile (
"lv.q R000, 0(%0)\n"
"vmin.s S020, S000, S010\n"
"vmax.s S030, S000, S010\n"
"sv.q R000, 0(%0)\n"
:: "r" (v) : "memory");
printf("%e %e -> %e %e\n", v[0].packed, v[1].packed, v[2].packed, v[3].packed);
Which leaves the construction of a sorting network for a given size of object pool. this is a hard problem. A very hard problem. Green's 16-input network is, if not optimal, not bettered since 1969, and relatively easy to map onto VFPU opcodes.

Here's a vfpu version (note that, due to the use of vmin and vmax, we must use 2 4x4 matrices, meaning we can "only" deal with sorting 4 sets, or 64 values, before hitting main memory).
Code: Select all
// Step 1. 4 cycles
vmin.q C100, C000, C010
vmax.q C110, C000, C010
vmin.q C120, C020, C030
vmax.q C130, C020, C030
// Step 2. 4 cycles
vmin.q C000, C100, C120
vmin.q C010, C110, C130
vmax.q C020, C100, C120
vmax.q C030, C110, C130
// Step 3. 4 cycles
vmin.q R100, R000, R001
vmax.q R101, R000, R001
vmin.q R102, R002, R003
vmax.q R103, R002, R003
// Step 4. 4 cycles
vmin.q R000, R100, R102
vmin.q R001, R101, R103
vmax.q R002, R100, R102
vmax.q R003, R101, R103
// S000 and S033 are now stable, we're down to 14 cycles per stage
// Step 5. 14 cycles. Probably further optimisable
vmin.s S111, S011, S022
vmax.s S122, S011, S022
vmin.s S121, S021, S012
vmax.s S112, S021, S012
vmin.s S110, S010, S020
vmax.s S120, S010, S020
vmin.s S130, S030, S003
vmax.s S103, S030, S003
vmin.s S101, S001, S002
vmax.s S102, S001, S002
vmin.s S131, S031, S032
vmax.s S132, S031, S032
vmin.s S113, S013, S023
vmax.s S123, S013, S023
// Step 6. 8 cycles
vmin.s S020, S120, S102
vmax.s S002, S120, S102
vmin.s S032, S132, S123
vmax.s S023, S132, S123
vmin.s S010, S110, S101
vmax.s S001, S110, S101
vmin.s S031, S131, S113
vmax.s S013, S131, S113
// Copies. 6 cycles
vmin.s S030, S130, S130
vmin.s S021, S121, S121
vmin.s S031, S131, S131
vmin.s S022, S122, S122
vmin.s S032, S132, S132
vmin.s S003, S003, S003
// S010 and S023 are now stable, 12 cycles per stage
// Step 7. 12 cycles
vmin.s S120, S020, S001
vmax.s S101, S020, S001
vmin.s S111, S011, S021
vmax.s S121, S011, S021
vmin.s S122, S012, S022
vmax.s S122, S012, S022
vmin.s S132, S032, S013
vmax.s S113, S032, S013
vmin.s S131, S031, S003
vmax.s S103, S031, S003
vmin.s S130, S030, S002
vmax.s S102, S030, S002
// S120 and S113 are stable, 10 cycles per stage
// Step 8. 8 cycles
vmin.s S030, S130, S111
vmax.s S011, S130, S111
vmin.s S031, S131, S112
vmax.s S012, S131, S112
vmin.s S021, S121, S102
vmax.s S002, S121, S102
vmin.s S022, S122, S103
vmax.s S003, S122, S103
// Copies. 2 cycles
vmin.s S001, S101, S101
vmin.s S032, S132, S132
// Step 9. 10 cycles
vmin.s S130, S030, S001
vmax.s S101, S030, S001
vmin.s S111, S011, S021
vmax.s S121, S011, S021
vmin.s S131, S031, S002
vmax.s S102, S031, S002
vmin.s S112, S012, S022
vmax.s S122, S012, S022
vmin.s S132, S032, S003
vmax.s S102, S032, S003
// S130 -> S111 and S122 -> S103 are stable
// Step 10. 4 cycles
vmin.s S021, S121, S131
vmax.s S131, S121, S131 // Store in S1xx, as final
vmin.s S002, S102, S112
vmax.s S112, S102, S112 // Store in S1xx, as final
// copy final results. 4 cycles
vmin.s S121, S021, S021
vmin.s S102, S002, S002
vmin.p R100, R000, R000
vmin.p R123, R023, R023
You'll note that, past step 4, pretty much everything is performed on single registers, and there's a certain amount of copying required as well. Ideally, if we could coalesce all the single-register ops down to 3 or 4 register ops we would be seeing something like 40 cycles. The problem is the non-parallelisation of the operations themselves; Batcher's odd-even network may well be significantly faster even if it has more comparisons on paper". Additional benefits are that 32 and 64 input networks are already defined, and the algorithm can be extended to any power of 2, see http://sparkydots.blogspot.fr/2015/05/b ... twork.html
Here's a picture of the 16 input network :

Converting the picture into code is a more or less mechanical task that's left as an exercise to the reader. My implementation is 60 cycles total for a 16 element sort. How much do we like that? Quite a lot, in fact especially as it scales upwards to 64, and with some memory shuffling, 1024 elements and beyond. Trivially. My 1024 element sort is around 8,000 cycles, which gives us plenty of room to play and is competitive with sorting 16 elements on the CPU.
I hope this is of help to some people in showing how you can get very respectable performance out of the PSP's relatively anaemic hardware.
Next post : VFPU accelerated time-adjusted verlet schemes, and interleaving MIPS with VFPU code for fun and profit.
See you next time
Simon
[1] Unless you can detect and artificially bound worst case behaviour, of course.
[2] And it probably should if you're counting on writing more than one game, unless you particularly like delving into the rabbit hole that is writing game engines
[3] Octrees for 3 dimensions, but the behaviour is the same
[4] Here's a hint. You don't.
[5] See relevant literature
[6] The range is often disjoint, so multiple scan / process steps may be required
[7] Actually O(m) where m is the disjointness of the block we're looking for, but we can treat all ranges as purely consecutive and this becomes O(1) at the cost of potentially processing more non-hits.
Advertising


