Baritone AI pathfinder

Baritone AI pathfinder

72.7k Downloads

Possible optimization for dynamic-connectivity

btrekkie opened this issue ยท 8 comments

commented

Describe your suggestion

Apologies, I wasn't sure where to post this. According to btrekkie/dynamic-connectivity#4 (comment) , @leijurv is considering incorporating a fork of https://github.com/btrekkie/dynamic-connectivity into Baritone. Based on a profiler screenshot in his comment, I thought of a potential optimization for this library.

I realized that Augmentation.combine is called a lot, and in typical usage, it creates a new object each time it is called. I came up with this to reduce the number of object allocations: btrekkie/dynamic-connectivity@master...augmentation-pool . (Maybe start by looking at btrekkie/dynamic-connectivity@1586a7c#diff-06e896ad5b668fbe72375bf3a901e696429004e18c61104559fdde04d26360a7 and btrekkie/dynamic-connectivity@1586a7c#diff-1ff29087d4530c92b31fe61df3b855074ac0e88f4f54f6cd88bb441bc109e92fR163-R166 .) It would require switching from using Augmentation to using MutatingAugmentation. I wouldn't want to merge it into https://github.com/btrekkie/dynamic-connectivity without measuring its real-world performance, but I thought you might want to give it a try.

Also, you might try overriding Attachment.equals, as this could reduce the number of calls to augment(). (See https://github.com/btrekkie/dynamic-connectivity/blob/e7f54ab81da5cfd62908f870ccbfb68baec42b30/src/main/java/com/github/leijurv/NavigableSurface.java#L25 . )

Settings

N/A

Context

This could make things faster.

Final checklist

  • I know how to properly use check boxes
  • I have not used any OwO's or UwU's in this issue.
commented

Oops, already replied here btrekkie/dynamic-connectivity@f378578#commitcomment-109123881

I have overridden Attachment.equals

public boolean equals(Object o) { // used as performance optimization in RedBlackNode to avoid augmenting unchanged attachments

I will try and merge this in when I get a chance. I wonder if git will allow me to merge this directly?

commented

I was considering a few things for my use case and perhaps I should get more specialized. For example, I could replace the references to an augmentation object with just a function call in EulerTourNode.

Like:

    @Override
    public boolean augment() {
        int newSize = left.size + right.size + 1;

I could just add my own fields here, and I would probably get blazing fast performance since it would be primitive fields rather than an entire augmentation object that needs to be allocated or pooled. I obviously understand that you probably wouldn't want that in your upstream repo because it's specialized and not a good abstraction. And it would make the code more icky and I'll probably only do it if I truly need the performance. But I think that would be the approach if I really needed to squeeze more out of this.

Another thought I had was here 8c11840#diff-982795ee3ff2c18759ea5167152094504af1cca0b0fbd4994271411f0d1e685cR25

My use case has no more than 4 edges attached to any given vertex (at most one each for north south east west). I think I might be able to improve performance by replacing the hashmap in VertexInfo with simply a ConnEdge[] edges = new ConnEdge[4] then indexing into it with something like edges[Face.between(pos, neighborPos).horizontalIndex]. (see

public final int horizontalIndex = x & 1 | (x | z) & 2;
and
public static final int NUM_X_BITS = 26;
public static final int NUM_Z_BITS = NUM_X_BITS;
public static final int NUM_Y_BITS = 9; // note: even though Y goes from 0 to 255, that doesn't mean 8 bits will "just work" because the deserializer assumes signed. i could change it for just Y to assume unsigned and leave X and Z as signed, however, we know that in 1.17 they plan to add negative Y. for that reason, the better approach is to give the extra bits to Y and leave it as signed.
// also, if 1.17 sticks with the current plan which is -64 to +320, we could have 9 bits for Y and a constant offset of -64 to change it to -128 to +256.
// that would result in the packed long representation of any valid coordinate still being a positive integer
// i like that property, so i will keep num_y_bits at 9 and plan for an offset in 1.17
// it also gives 1 bit of wiggle room in case anything else happens in the future, so we are only using 63 out of 64 bits at the moment
public static final int Z_SHIFT = 0;
public static final int Y_SHIFT = Z_SHIFT + NUM_Z_BITS + 1; // 1 padding bit to make twos complement not overflow
public static final int X_SHIFT = Y_SHIFT + NUM_Y_BITS + 1; // and also here too
public static final long X_MASK = (1L << NUM_X_BITS) - 1L; // X doesn't need padding as the overflow carry bit is just discarded, like a normal long (-1) + (1) = 0
public static final long Y_MASK = (1L << NUM_Y_BITS) - 1L;
public static final long Z_MASK = (1L << NUM_Z_BITS) - 1L;
public static final long POST_ADDITION_MASK = X_MASK << X_SHIFT | Y_MASK << Y_SHIFT | Z_MASK << Z_SHIFT; // required to "manually inline" toLong(-1, -1, -1) here so that javac inserts proper ldc2_w instructions at usage points instead of getstatic
// what's this ^ mask for?
// it allows for efficient offsetting and manipulation of a long packed coordinate
// if we had three ints, x y z, it would be easy to do "y += 1" or "x -= 1"
// but how do you do those things if you have a long with x y and z all stuffed into one primitive?
// adding together two long coordinates actually works perfectly if both sides have X, Y, and Z as all positive, no issues at all
// but when Y or Z is negative, we run into an issue. consider 8 bits: negative one is 11111111 and one is 00000001
// adding them together gives 00000000, zero, **but only because there isn't a 9th bit to carry into**
// if we had, instead, 00000000 11111111 + 00000000 00000001 we would rightly get 00000001 00000000 with the 1 being carried into the 9th position there
// this is exactly what happens. "toLong(0, 1, 0) + toLong(0, -1, 0)" ends up equaling toLong(1, 0, 0) while we'd rather it equal toLong(0, 0, 0)
// so, we simply mask out the unwanted result of the carry by inserting 1 bit of padding space (as added above) between each
// it used to be 000XXXXXXXXXXXXXXXXXXXXXXXXXXYYYYYYYYYZZZZZZZZZZZZZZZZZZZZZZZZZZ
// and now it is 0XXXXXXXXXXXXXXXXXXXXXXXXXX0YYYYYYYYY0ZZZZZZZZZZZZZZZZZZZZZZZZZZ
// we simply place the X Y and Z in slightly different sections of the long, putting a bit of space between each
// the mask ^ is 0111111111111111111111111110111111111011111111111111111111111111
// using that example of (0,1,0) + (0,-1,0), here's what happens
// 0000000000000000000000000000000000001000000000000000000000000000 (this is X=0 Y=1 Z=0)
// + 0000000000000000000000000000111111111000000000000000000000000000 (this is X=0 Y=-1 Z=0)
// = 0000000000000000000000000001000000000000000000000000000000000000
// the unwanted carry bit here ^ is no longer corrupting the least significant bit of X and making it 1!
// now it's just turning on the unused padding bit that we don't care about
// using the mask and bitwise and, we can easily and branchlessly turn off the padding bits just in case something overflow carried into them!
// 0000000000000000000000000001000000000000000000000000000000000000 (the result of the addition from earlier)
// & 0111111111111111111111111110111111111011111111111111111111111111 (this is POST_ADDITION_MASK)
// = 0000000000000000000000000000000000000000000000000000000000000000
// POST_ADDITION_MASK retains the bits that actually form X, Y, and Z, but intentionally turns off the padding bits
// so, we can simply do "(toLong(0, 1, 0) + toLong(0, -1, 0)) & POST_ADDITION_MASK" and correctly get toLong(0, 0, 0)
// which is incredibly fast and efficient, an add then a bitwise AND against a constant
// and it doesn't require us to pull out X, Y, and Z, modify one of them, and put them all back into the long
// that's what the point of the mask is
static {
if (POST_ADDITION_MASK != toLong(-1, -1, -1)) {
throw new IllegalStateException(POST_ADDITION_MASK + " " + toLong(-1, -1, -1)); // sanity check
}
}
public long toLong() {
return toLong(this.x, this.y, this.z);
}
public static BetterBlockPos fromLong(long serialized) {
return new BetterBlockPos(XfromLong(serialized), YfromLong(serialized), ZfromLong(serialized));
}
public static int XfromLong(long serialized) {
return (int) (serialized << (64 - X_SHIFT - NUM_X_BITS) >> (64 - NUM_X_BITS));
}
public static int YfromLong(long serialized) {
return (int) (serialized << (64 - Y_SHIFT - NUM_Y_BITS) >> (64 - NUM_Y_BITS));
}
public static int ZfromLong(long serialized) {
return (int) (serialized << (64 - Z_SHIFT - NUM_Z_BITS) >> (64 - NUM_Z_BITS));
}
public static long toLong(final int x, final int y, final int z) {
return ((long) x & X_MASK) << X_SHIFT | ((long) y & Y_MASK) << Y_SHIFT | ((long) z & Z_MASK) << Z_SHIFT;
}
public static long offsetBy(long pos, int x, int y, int z) {
return (pos + toLong(x, y, z)) & BetterBlockPos.POST_ADDITION_MASK;
}
)

commented

Curiosity got the better of me, so I took some more painkillers and merged in your commit. Thankfully, git let me do it, and it was actually very smart about putting the files in the right places even after I'd moved them all around! Only took a few minutes of fixing package statements and simple merge conflicts.

Comparing 4bea9df (branch builder-2) to 54e7473 (branch builder-2-mutating), I sadly don't see any improvement in NavigableSurfaceTest. It's still around 9.6 to 10 seconds give or take. If you have intellij, you should be able to open the baritone project, open NavigableSurfaceTest, then do "run current file" or "profile current file" to see this for yourself.
Screen Shot 2023-04-15 at 2 00 58 PM

Before vs after comparison zoomed into one unit test:
Screen Shot 2023-04-15 at 1 56 05 PM
Screen Shot 2023-04-15 at 1 57 27 PM

Before vs after comparison zoomed into just removeEdge of that unit test:
Screen Shot 2023-04-15 at 1 56 26 PM
Screen Shot 2023-04-15 at 1 58 09 PM

Zoomed way far in, in the mutable version:
Screen Shot 2023-04-15 at 1 58 47 PM

commented

Dang, good to know.

I was able to run NavigableSurfaceTest locally. Using MutatingAugmentation reduces the number of Attachment allocations from ~400 million to ~2 million, but has no significant effect on performance. I was really hoping for a simple ~50% reduction in running time.

Anyway, take it easy.

commented

well, lowering allocations may still have more of an effect on some systems or java versions so it is probably still a good thing...

commented

no yeah i think this is a good idea and it's moving in the right direction

i just want to go further and not have the attachments be separate at all. i want to inline them as primitive fields like EulerTourNode.size is

commented

Okay, after tinkering some more, I finally came up with something that significantly reduces the running time of NavigableSurfaceTest: btrekkie/RedBlackNode@4dacba9 . In my testing, it reduced the running time by about 30% (after applying the bugfix at btrekkie/dynamic-connectivity@7c973bc ). I think this is mostly because it decreases the number of calls to augment() by 40%.

Everything else I've tried so far has had an insignificant effect on the running time of NavigableSurfaceTest, so it may have to wait until I have a more realistic benchmark of performance. (I didn't try moving the Attachment field into EulerTourNode or the VertexInfo.edges optimization you mentioned, because they're not general changes that would be incorporated back into the dynamic-connectivity library.)

commented

Wow!!! Fantastic. I have not been putting as much time as I'd like into builder-2 in the last few weeks, sorry - surgery, then, starting a new job next week. I hope to get back into it!

That's a great help and that makes a lot of sense given how much time is spent in augment! Thank you so much!