Facing the Camera: “Physics.Raycast” vs “Vector3.Dot” in Unity3D – PART 2

After further optimization testing, I felt I had enough content to write a new blog post, to follow up the original article from earlier. The big reason is because the results I came up with were wrong, due to a major flaw in my testing environment. This article will cover physics colliders parented to other objects, dot product vs raycasting, and a little bit of multithreading in Unity3D.

A lot of random-looking dancing objects. This is harder to compute than you think!

After much encouragement (again, special thanks to the IGDA Ann Arbor branch for pushing me to finally get around to this), I did some more tests to experiment optimizing my method of “3D Cel Animation.” Up to now, I’ve always used “Physics.Raycast” or “Physics.RaycastAll” on many flat planes, carefully placed around the center of a character model, to know which perspective plane to display. It works, but for more than a few characters on screen, some slowdown is noticeable. And running on a mobile OS wasn’t viable, due to the process being CPU heavy. So even though I didn’t need to at the moment, I tried “Vector3.Dot” instead.

My testing results showed “Vector3.Dot” was actually a bit worse than “Physics.Raycast,” and a bit better than “Physics.RaycastAll,” but also had more opportunity for optimization than the built-in raycasting did. But my test environment had one major flaw: I had hundreds of “Cel Objects” to test on, but none of them were moving!

It turns out this is a major trick in optimization: parenting objects is bad for efficiency, and should be avoided, especially if child objects have many components. This is because any changes made at the top of the tree have to be applied to all children, sometimes updating components that don’t really need that update. In particular, updating physics colliders this way gives a big performance hit. Well… maybe not too big, if you only have one collider per character in your game scene, or even only one collider per body limb.

Me? To use “Physics.Raycast,” I had about 114 individual planes around every limb and body part. That’s a few thousand colliders per character, all of which were parented to an animated 3D joint system. No wonder my profiler would show “PhysicsFixedUpdate” taking a massive amount of CPU, even if nothing big seemed to be happening; even a subtle idle animation could kill performance!

So I fixed the benchmark test: about 650 ‘cel objects,’ making 74,100 planes (and therefore, colliders, if applicable), representing the equivalent of about 30 on-screen 2D characters. Plus, they are all parented to one of several 3D models, a Blender fbx of 3 bones, like a flowing blade of grass.

I deployed 5 tests: 1) Physics.RaycastAll, 2) Physics.Raycast, 3) Vector3.Dot on all quads, with colliders still active, 4) Vector3.Dot on all quads, with colliders disabled, and 5) Vector3.Dot with Unity3D’s new “Job System” for parallel processing. The results are a bit more like I, and my colleagues, expected. Sort of.

Test results of FPS across different tests (Surface Pro 2017 – i5, 8 GB RAM)

So in comparison to my previous tests, while static (non-moving) objects aren’t too bad to use Physics.Raycast on compared to using Vector3.Dot on literally everything, to have moving colliders just destroys performance. Even without much optimization (Vector3.Dot doesn’t really need to be used on EVERY object for every frame of gameplay, which is what I did during testing), using Dot Product reigns supreme.

These numbers come from running the test in the Unity3D editor: in a standalone build, Methods 4 and 5 (Dot with no colliders, and Dot with multithreading) each doubled their frame rates. For “Dot with no colliders,” that was almost hitting the 30 fps mark. And this is on a Surface Pro 2017, with a 2-core (4-thread) Intel i5 CPU. It’s not a slouch in the CPU space, but not exactly a gaming PC. It doesn’t consider other gaming logic or rendering that might come in a full game, but it’s neat to think I can have as many as 30 detailed 2D characters actively running around, while still getting solid frame rates, on a modern but low-gaming machine.

Multi-threading is another issue entirely, and a point of confusion. Actually, multi-threading in Unity3D almost deserves it’s own blog post. But to summarize a bit of it here:

 

Multi-threading in Unity3D:

While Unity3D does employ some optimization underneath, it is mostly inaccessible to the game developer. Therefore, Unity3D has long been considered a “single-threaded” game engine by programmers. Yes, almost any C# library and function can be used in Unity3D, including multi-threading, but typically this is blocked from accessing Unity3D-specific logic. Things like accessing object components, or even transform.position, isn’t available. Depending on how your game is programmed, that makes multi-threading difficult to utilize. A shame, since my game has many elements that do not need to speak to each other at any point; it’d be great, and seemingly trivial, to assign some functions to run on their own thread, IF access to game object-components within a frame was available.

I found out that Unity3D recently deployed it’s own native form of “multi-threading” for programmers to utilize, typically labelled as 3 components: the “Job System,” “Entity-Class System (ECS),” and “Burst.” These have been encouraged with documentation in 2018, although some parts of it are still experimental and under continuous development through to 2019. The “Job System” is the only actual part to set up “multi-threading” as a programmer might recognize it. “ECS” is meant to define variables in such a way that they can be stored in RAM in a more efficient way (the order of variables listed in a script CAN make a small difference) for read/write access, and “Burst” is meant to compile your game into native byte code to run better on supported machines and architectures, instead of using “MonoDevelop” to ensure easy compatibility with many systems.

When Unity3D praises this new feature, they claim it’s a revolution to encourage developers to use a “data-oriented” instead of “object-oriented” approach to programming games. If you’ve used Unity for a long time, it’s a completely different way of thinking. Stop setting properties into sub-classes for organization. Stop parenting sub-objects and components in the scene hierarchy. It feels familiar to the days I made games BEFORE I started using Unity3D, which feels like a loss to Unity’s biggest strength: quick prototyping and easy viewing of 3D objects and components. Even Unity3D’s usually great documentation is lacking in detail on how to utilize this form of multi-threading: I found Intel’s website (they have a sub-division that encourages optimizing for their multi-threaded CPU’s) to have the best tutorial on the subject.

 

 

After much trial and error, I finally got an example using the “Job System” working, ignoring the more experimental “ECS” and “Burst” features. That’s Test 5 in this article. I don’t think I did it right, however… I used the default “IJob” definition several times with 100+ handles, finishing each handle at the end of a main “Update()” function, when other versions (“IJobForParallel”, for example) are recommended for better parallel performance… I think? Or does “IJobForParallel” just do exactly what I did by having many scheduled-“IJob” ‘s, in easier-to-write code? Again, documentation either isn’t available or isn’t clear. But the reason the “Job System” is slower: instead of directly applying “Vector3.Dot” on a transform’s position or forward direction, I have to first copy those “Vector3” values to a special array, THEN send to a job to compute “Vector3.Dot” for me, and THEN exit the job to apply the result on the plane the camera sees on the main thread. That’s a lot of extra copying of data, and having 100+ jobs doing an individually quick task… the overhead of creating jobs, disposing of jobs, and copying and disposing data makes this a poor case for multi-threading.

Would multi-threading be more powerful on a higher-core CPU? I also tried a compiled version of the test on a separate machine, with an Intel i7, 6-core (12-thread) CPU, with a much more powerful NVIDIA graphics card to boot. Both Tests 4 and 5 (Vector3.Dot, without colliders, with and without multi-threading) each had almost 2x the stand-alone-build performance; Test 4 was getting a consistent 60 fps! But Test 5 still lagged at about 40 fps. Huh… clearly, I must have done something wrong in how I implemented multi-threading. Or maybe “Vector3.Dot” really is the simpler part of the total calculation being done.

Here’s the curious thing: even though Test 4 seemed to have relatively consistent performance (not as many dips; all the others had some garbage collection to take care of, even Test 5), Test 4 appeared to be a little jerky to the naked eye. In comparison, Test 4, with multi-threading, always appeared smooth. It contradicts what the numbers say, but that’s what my eyes told me. Maybe a lesser load on the main thread worked for the best after all? Or maybe I just turned “vsync” off in the editor for observing raw performance, and this was some odd screen-tearing for cases that were almost-but-not-quite 30fps. Anyway, if you’re worried about your game looking smooth… maybe the “Job System” is worth a try.

(Update: Crackers, I’m dumb. I revisited the profiler again, and found that Test 4 (and Tests 1-3) were still using the “update sprite every 12 fps” logic. But Test 5 did not – it checks every single frame. No wonder Test 4 seemed jittery, going from 60+fps to 15fps constantly… trying again, Test 4 was able to perform at more constant rate that was still better than multi-threading in Test 5.)

If I try multithreading again and get very different performance results, I might write yet ANOTHER blog post about it. But I don’t need to. As fun as it’s been to experiment with new ways of optimizing my code (and I haven’t had this much fun with game programming in a long while), it’s unlikely that I’ll ever need as many as 30 “3D Cel Animated” characters (with over 650 body parts) at once in my games. Even in “True King,” which will have strategy maps with a couple dozen units on the screen at once… I still wouldn’t reach this cap. And if I did want this many characters at once, consider the amount of unique sprite-textures I’d need… RAM would become a bottleneck before the CPU at this point. It’s a common excuse I use to stop myself from staying in the rabbit hole for too long: “if it’s good enough now, why search for a better way?” And maybe for more complex examples of A.I., or other complex tasks, my experience in multi-threading will come in handy.

And I’ve learned something important: DON’T PARENT MANY COLLIDERS TO A MOVING OBJECT unless to have to.