Advertising (This ad goes away for registered users. You can Login or Register)

Performance, VFPU, Sorting example (long and technical)

Forum rules
Forum rule Nº 15 is strictly enforced in this subforum.
Post Reply
tufty
Posts: 12
Joined: Sun Jul 28, 2013 5:50 pm

Performance, VFPU, Sorting example (long and technical)

Post by tufty » Thu Jul 09, 2015 3:05 pm

So, it's been a while since I did any PSP stuff (beyond getting that last dang weapon in Peace Walker, etc), but I recently found myself at work, with my laptop and nothing (paid) to do. So why not get back to programming the game I was working on before?

Now, what I'm working on is a retro-ish top-down 2d game, kinda like a cross between Berzerk and Galaxian, but somewhat modernised, with the kind of enemy overload you see in Geometry Wars. I'm a big fan of swarming, so there's gonna be some of that.

One of the key things I'm aiming for here is a solid frame rate with no slowdowns. The key to doing something like this, especially on a "low performance" platform like the PSP, is keeping the underlying algorithms not only as fast as possible, but, above all, stable in terms of execution time. There's no point[1] having (for example) a swarming algorithm that's blazing fast most of the time, but that suddenly takes exponential time under a specific set of circumstances.

Effectively, these are underlying engine considerations. If your engine then allows "user scripting code" behaviours[2], that might get blown away (although, again, see [1]), but that's not something we particularly need to worry about here.

So. What algorithms do we need, and what can we do about them?
  • Collision Detection
  • Swarming
  • "Physics"
  • Sorting
Now, Collision Detection and Swarming (and thus "Physics", which subsumes them) are dependent on finding the game objects near to an object, performance of a naive implementation will be O(n^3) or perhaps O(nlogn^2) at best. There isn't much we can do about the complexity of that apart from making the lookup as fast as possible and combining the two. Indeed, by moving the swarming out of the engine proper, and making it a user code scripting callback, we can start to generalise the "shape" of user code callbacks to something like this:

Code: Select all

typedef callback_fn_t void (*fn (void* object, void** near_objects, ...));
Now, there's lots of ways of doing "near object" finding, quadtrees[3] are a common, high performance option. There's a lots of literature on this, and how-tos with example code all over the place. What's often missed, though, is that actually building a tree is a massive waste of memory and cycles. In fact, most of the how-tos are utterly useless unless you have an actual need[4] for the tree itself. In 99.9% of cases, it's far more efficient to assign the objects a code for the order in which they appear in a tree traversal, and build an index based on that. For relevant literature, see Z Order Curves and, particularly, Morton Order or Morton Code.

How to use this?

Let's assume we keep our objects in a big, pre-allocated, array. That's not a bad way of doing things, it keeps our memory usage bounded (at the cost of limiting the number of objects in the game), and means that, instead of passing around pointers to objects, we can pass around indexes into that array. Something like this:

Code: Select all

#define N_GAME_OBJECTS 1024
typedef struct { ... } game_object_t;
extern game_object_t * game_object_array;
... and then either statically or dynamically allocate the array.

Somewhere in game_object_t we have a morton code, which is an integer, which we update for each object as part of the "physics" step. This is done by interleaving the bits of its integer x and y (and z, in 3 dimensions) coordinates[5]; for optimisation purposes this is an integer operation that can be done in "spare" cycles on the MIPS core if we're using the VFPU for the physics engine.

Then we need to build an index of the game object array, based on "liveness" and morton code. It might look (a bit) like this:

Code: Select all

typedef struct {
  unsigned int live;
  unsigned int morton_code;
  unsigned int object_index;
} object_index_t;

object_index_t object_index[N_GAME_OBJECTS];
The important thing is that we have an index for *each* object, regardless of whether it is live or not; assuming we sort on "liveness" as the major index, we can always find a slot for a new object by looking at the end of the array.

Finding near objects, then, involves finding the range of morton codes we're interested in (see [5] again), scanning the index until we hit the first of those codes, and then dealing with each object in turn until we hit the end of the range[6].

But we can do better than that. Just finding the near objects is O(logn), so processing is O(logn^2) and processing for all objects is O(nlogn^2). The number of morton codes we are dealing with is fixed, so if that's a reasonable number we can "bucket" the index and make lookup O(1)[7]. In my case (for reasons I'll explain later), the morton code is only 8 bits, meaning I have a bucket index as follows:

Code: Select all

typedef union {
  struct {
    uint16_t low_index;
    unit16_t high_index;
  } i;
  uint32_t v;
} object_index_index_t;

object_index_index_t object_index_index[256];
object_index_index_t dead_objects;
...all of which is relatively easy to understand, and more or less trivial to code. And, all of a sudden, everything starts to be bounded by the speed of creating the index, which involves a sort of the object array. That's the 4th algorithm in the list, and the one I have been getting to in the stuff above, in a round-about way.

Now, there's a lot of literature on sorting. An awful lot of literature. You could lose a lifetime just reading it all. So I'll cut to the chase as to what I decided.

I wanted a sort algorithm that was stable - that had the same best and worst case complexity. I also wanted something I could parallelise using the VFPU if at all possible. Now, one of the most common parallelised sorts (at least, the GPGPU world) is the Radix Sort. It's already stupidly fast, and good fit for sorting integers (which, if you've been paying attention, our morton code is). Unfortunately, it's an integer algorithm, and the PSP is rather lacking in integer parallelisation. Whilst looking at ways to [ab]use the VFPU to do this, I came across the http://lucasz.dk mirror of the ps2dev forums, particularly the "VFPU diggins" thread, which, although incomplete and in some cases wrong, is one of the better references to the VFPU out there given that http://wiki.fx-world.org is dead.

In that thread, I noticed the vsrt*.q opcodes, which looked rather familiar. And, let's face it, 'vsrt' sounds a lot like 'vector sort', right? And it hit me. They're sorting network primitives (as it happens, vsrt2.q is one of the wrongly documented opcodes in that thread, which threw me off). In particular, they are opcodes for doing a bitonic sort of 4 values - vsrt1/2 sort ascending, vsrt3/4 sort descending, as follows:

Code: Select all

  vsrt1.q R000, R000
  vsrt2.q R000, R000
  vsrt1.q R000, R000
... will sort R000 into ascending order, in 3 cycles. No wait states, no stalls, just blinding fast sort. For comparison, a naive version written in C and compiled with -O3 took around 300 cycles in the best case. Even allowing for the overhead of loading values onto and off the VFPU (6 more cycles, and another 3 of pipeline stall), that's still orders of magnitude faster. So I started thinking about how to make a larger version - after all, we've got 128 floating point values in the VFPU, if we can sort 128 at a time, that's a helluva lot of bang for your buck. As a hint for creating sorting networks, the following 2 blocks of code are equivalent in execution time and function, and the second construction can compensate for the "missing" vsrt* combinations (and we won't be using them again after this):

Code: Select all

  vsrt1.q R100, R000
  vsrt1.q R101, R001
  vsrt1.q R102, R002
  vsrt1.q R103, R003

Code: Select all

  vmin.q C100, C000, C010
  vmax.q C110, C000, C010
  vmin.q C120, C020, C030
  vmax.q C120, C020, C030
But.

There's a downside. The vsrt* (and vmin / vmax) opcodes only sort one set of floating point values. They don't update the condition flags, and there's no obvious way to sort something like object_index_t above without using vi2f, vcmp, vmov and vcmov*. All of those have loads of stall cycles, and start hitting performance severely, even though it can all be done in "straight through" code, and the approach restricts us to sorting 64 elements at best.

But I've written (amongst other things) IEEE 754 float implementations, and I know how float sorting works. Which is where the aforementioned 8 bit morton code comes in. A floating point number consists of 3 things.
  • A sign bit, s
  • An exponent, e
  • A mantissa, m
It also has an implied base b, in the case of 32 bit floats that is 2. So a 32 bit float is, effectively,

Code: Select all

(-1 ^ s) * (1 + m) * (2 ^ e))
Now, those 3 elements could be usefully considered as the 3 elements of object_index_t; as long as we can fit our morton code into the exponent and keep the mantissa within a certain range, sorting should work. And rather than taking high tens of cycles per four elements, we're back to single digits again. So, what does a 32 bit float look like? the bits are encoded as follows:

Code: Select all

seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
The mantissa m is 24 bits, with the highest bit being implied as 1 for normalised numbers. There is, however, the caveat that normalised numbers cannot have an exponent of zero. As we want to use the exponent to hold our morton code, and we want to use morton code 0, we have to use subnormal numbers; in order for these to compare properly we need to set the high bit of the mantissa to 1. So normalised numbers will have a mantissa of 0b11xxxxxxxxxxxxxxxxxxxxxx and subnormals 01xxxxxxxxxxxxxxxxxxxxxx where the high bit is implied.

That gives us a new definition of object_index_t

Code: Select all

typedef union {
  struct {
    unsigned int index:22;
    unsigned int one:1;
    unsigned int morton:8;
    unsigned int dead:1;
  } unpacked;
  float packed;
} object_index_t;
The question, of course, is "given that subnormals are IEEE 754-2008, does the VFPU work with that?". Or, what is the result of this code?

Code: Select all

  object_index_t __attribute__ ((aligned (16))) v[4];
  
  v[0].unpacked.dead = 0;
  v[0].unpacked.morton = 0;
  v[0].unpacked.one = 1;
  v[0].unpacked.index = 1;
  v[1].unpacked.dead = 0;
  v[1].unpacked.morton = 1;
  v[1].unpacked.one = 1;
  v[1].unpacked.index = 0;

  __asm__ volatile (
    "lv.q R000, 0(%0)\n"
    "vmin.s S020, S000, S010\n"
    "vmax.s S030, S000, S010\n"
    "sv.q R000, 0(%0)\n"
    :: "r" (v) : "memory");
  
  printf("%e %e -> %e %e\n", v[0].packed, v[1].packed, v[2].packed, v[3].packed);

The answer, happily enough, is that the VFPU sorts these values "correctly". I'm not certain if the VFPU's behaviour with other operations of subnormals is as would be expected, but frankly it doesn't matter. the underlying "float-nature" of the values are irrelevant, all we want is to be able to sort.

Which leaves the construction of a sorting network for a given size of object pool. this is a hard problem. A very hard problem. Green's 16-input network is, if not optimal, not bettered since 1969, and relatively easy to map onto VFPU opcodes.

Image

Here's a vfpu version (note that, due to the use of vmin and vmax, we must use 2 4x4 matrices, meaning we can "only" deal with sorting 4 sets, or 64 values, before hitting main memory).

Code: Select all

  // Step 1.  4 cycles
  vmin.q C100, C000, C010
  vmax.q C110, C000, C010
  vmin.q C120, C020, C030
  vmax.q C130, C020, C030
  // Step 2. 4 cycles
  vmin.q C000, C100, C120
  vmin.q C010, C110, C130
  vmax.q C020, C100, C120
  vmax.q C030, C110, C130
  // Step 3.  4 cycles
  vmin.q R100, R000, R001
  vmax.q R101, R000, R001
  vmin.q R102, R002, R003
  vmax.q R103, R002, R003
  // Step 4.  4 cycles
  vmin.q R000, R100, R102
  vmin.q R001, R101, R103
  vmax.q R002, R100, R102
  vmax.q R003, R101, R103
  // S000 and S033 are now stable, we're down to 14 cycles per stage
  // Step 5.  14 cycles.  Probably further optimisable
  vmin.s S111, S011, S022
  vmax.s S122, S011, S022
  vmin.s S121, S021, S012
  vmax.s S112, S021, S012
  vmin.s S110, S010, S020
  vmax.s S120, S010, S020
  vmin.s S130, S030, S003
  vmax.s S103, S030, S003
  vmin.s S101, S001, S002
  vmax.s S102, S001, S002
  vmin.s S131, S031, S032
  vmax.s S132, S031, S032
  vmin.s S113, S013, S023
  vmax.s S123, S013, S023
  // Step 6. 8 cycles
  vmin.s S020, S120, S102
  vmax.s S002, S120, S102
  vmin.s S032, S132, S123
  vmax.s S023, S132, S123
  vmin.s S010, S110, S101
  vmax.s S001, S110, S101
  vmin.s S031, S131, S113
  vmax.s S013, S131, S113
  // Copies.  6 cycles
  vmin.s S030, S130, S130
  vmin.s S021, S121, S121
  vmin.s S031, S131, S131
  vmin.s S022, S122, S122
  vmin.s S032, S132, S132
  vmin.s S003, S003, S003
  // S010 and S023 are now stable, 12 cycles per stage
  // Step 7.  12 cycles
  vmin.s S120, S020, S001
  vmax.s S101, S020, S001
  vmin.s S111, S011, S021
  vmax.s S121, S011, S021
  vmin.s S122, S012, S022
  vmax.s S122, S012, S022
  vmin.s S132, S032, S013
  vmax.s S113, S032, S013
  vmin.s S131, S031, S003
  vmax.s S103, S031, S003
  vmin.s S130, S030, S002
  vmax.s S102, S030, S002
  // S120 and S113 are stable, 10 cycles per stage
  // Step 8. 8 cycles
  vmin.s S030, S130, S111
  vmax.s S011, S130, S111
  vmin.s S031, S131, S112
  vmax.s S012, S131, S112
  vmin.s S021, S121, S102
  vmax.s S002, S121, S102
  vmin.s S022, S122, S103
  vmax.s S003, S122, S103
  // Copies. 2 cycles
  vmin.s S001, S101, S101
  vmin.s S032, S132, S132
  // Step 9. 10 cycles
  vmin.s S130, S030, S001
  vmax.s S101, S030, S001
  vmin.s S111, S011, S021
  vmax.s S121, S011, S021
  vmin.s S131, S031, S002
  vmax.s S102, S031, S002
  vmin.s S112, S012, S022
  vmax.s S122, S012, S022
  vmin.s S132, S032, S003
  vmax.s S102, S032, S003
  // S130 -> S111 and S122 -> S103 are stable
  // Step 10.  4 cycles
  vmin.s S021, S121, S131
  vmax.s S131, S121, S131 // Store in S1xx, as final
  vmin.s S002, S102, S112
  vmax.s S112, S102, S112 // Store in S1xx, as final
  // copy final results.  4 cycles
  vmin.s S121, S021, S021
  vmin.s S102, S002, S002
  vmin.p R100, R000, R000
  vmin.p R123, R023, R023
Final cycle count : 4 + 4 + 4 + 4 + 14 + 14 + 12 + 10 + 10 + 4 + 4 = 84 cycles. I couldn't get a C version down to under 5000 cycles on fully sorted input.

You'll note that, past step 4, pretty much everything is performed on single registers, and there's a certain amount of copying required as well. Ideally, if we could coalesce all the single-register ops down to 3 or 4 register ops we would be seeing something like 40 cycles. The problem is the non-parallelisation of the operations themselves; Batcher's odd-even network may well be significantly faster even if it has more comparisons on paper". Additional benefits are that 32 and 64 input networks are already defined, and the algorithm can be extended to any power of 2, see http://sparkydots.blogspot.fr/2015/05/b ... twork.html

Here's a picture of the 16 input network :

Image

Converting the picture into code is a more or less mechanical task that's left as an exercise to the reader. My implementation is 60 cycles total for a 16 element sort. How much do we like that? Quite a lot, in fact especially as it scales upwards to 64, and with some memory shuffling, 1024 elements and beyond. Trivially. My 1024 element sort is around 8,000 cycles, which gives us plenty of room to play and is competitive with sorting 16 elements on the CPU.

I hope this is of help to some people in showing how you can get very respectable performance out of the PSP's relatively anaemic hardware.

Next post : VFPU accelerated time-adjusted verlet schemes, and interleaving MIPS with VFPU code for fun and profit.

See you next time

Simon

[1] Unless you can detect and artificially bound worst case behaviour, of course.
[2] And it probably should if you're counting on writing more than one game, unless you particularly like delving into the rabbit hole that is writing game engines
[3] Octrees for 3 dimensions, but the behaviour is the same
[4] Here's a hint. You don't.
[5] See relevant literature
[6] The range is often disjoint, so multiple scan / process steps may be required
[7] Actually O(m) where m is the disjointness of the block we're looking for, but we can treat all ranges as purely consecutive and this becomes O(1) at the cost of potentially processing more non-hits.
Advertising
Last edited by tufty on Thu Jul 09, 2015 5:46 pm, edited 1 time in total.

User avatar
noname120
Developer
Posts: 776
Joined: Thu Oct 07, 2010 4:29 pm

Re: Performance, VFPU, Sorting example (long and technical)

Post by noname120 » Thu Jul 09, 2015 5:27 pm

I am genuinely stunned by the quality of this publication.
It was informative, accurate, and interesting at a time.

Thank you for this, I am definitely looking to read more posts from you.
Advertising
Funny stuff
<yifanlu> I enjoy being loud and obnoxious
<yifanlu> rooting an android is like getting a hooker pregnant
<xerpi> I sometimes think I should leave all this stressing **** and be a farmer instead

User avatar
haxxey
Big Beholder
Posts: 567
Joined: Sat Jul 21, 2012 10:52 am
Location: Lurking /talk

Re: Performance, VFPU, Sorting example (long and technical)

Post by haxxey » Thu Jul 09, 2015 5:46 pm

This is a great write-up. Thanks for this!
We are such stuff as dreams are made on, and our little life is rounded with a sleep.

User avatar
Omega2058
Developer
Posts: 246
Joined: Tue Sep 28, 2010 4:27 am
Contact:

Re: Performance, VFPU, Sorting example (long and technical)

Post by Omega2058 » Thu Jul 09, 2015 6:49 pm

This was fun to read and the formatting was well done. Thanks for putting time into this.

tufty
Posts: 12
Joined: Sun Jul 28, 2013 5:50 pm

Re: Performance, VFPU, Sorting example (long and technical)

Post by tufty » Mon Jul 27, 2015 6:54 am

So, I realised there was a bug in my code above, particularly with regards to stuffing morton codes and object indices into a floating point value. Although the PSP handles subnormal values nicely, NaN (not a number) values cause all sorts of heck to break loose. And there's a lot of NaNs in floating point representation. Remember, 32 bit floating point numbers look like this in memory:
Image

Now, any number with an exponent of 0xff is considered to be not a number. With a fractional (mantissa) part of zero, we have ±infinity, but all the other values are NaN. All 16777214 of them. The fact there are an absurd number of actual "not a number" values in IEEE754 representation is, frankly, ridiculous, but that's not our real problem. Our real problem is that we lose a possible morton code. And given that an 8 bit morton code is already pretty small, we can't really afford to take it down to 6 bits.

So the question becomes, can we avoid creating NaN values and expand the range of morton codes?

The first bit is easy enough to do. We take one or more bits of the exponent and make sure they are always zero. No exponent of 0xFF, no NaN. "Simples", as I am led to believe the youth of today might say.

The second bit requires some thought. A trivial fix would be to set the top bit of the exponent to zero, and shuffle the morton code "down one". We were previously setting the top bit of the mantissa to 1 in order to "renormalise" denormal numbers, but instead we shift the bottom bit of the morton code into that position. So we now have morton codes 0x00 and 0x01 being denormal, and 0x02 upwards being normal. Assuming an index of 0 for all numbers, we get:

Code: Select all

Morton Exponent Mantissa Value
  0x00     0x00 0x000000 ±0
  0x01    0x00 0x400000 ±(2^-126)*(2^23) = ±2^-103
  0x02    0x01 0x800000 ±(2^-126)*(2^24) = ±2^-102
The reason we have a 2^24 in the 0x02 case is due to the fact it's a normalise float, and thus has an implied 1 in bit 24 of its mantissa. So our trivial fix does, in fact, work. Encoded this way and then sorted as floating point values, 0x00 < 0x01 < 0x02 and so on. Following from this means that we can use larger morton codes, overflowing the low order bits into the mantissa, which implies a greater depth to our "quadtree", and thus better discrimination for object-in-area queries.

Here's the data structure I'm currently using, which allows for 16 bit morton codes (a very healthy 256 by 256 grid) and up to 16,384 individual objects. I'm not using that many objects :)

Code: Select all

union {
  float float_rep;
  struct {
    unsigned mantissa:23;
    unsigned exponent:8;
    unsigned sign:1;
  } underlying_float_rep;
  struct {
    unsigned object_index:14;
    unsigned morton_code:16;
    unsigned nan_killer:1;
    unsigned dead:1;
  } encoded;
}
You'll notice that, as I'm using the sign bit to imply "dead", our sorted in memory array for n objects with m objects allocated, will have the first n-m objects "dead". Object allocation implies going to index n-m in the array, clearing its top bit and setting its morton code, then setting up the actual object in memory by "dereferencing" the object index.

Post Reply

Return to “Programming and Security”