For my project, I decided to make my ray tracer distributed. As we've learned in class, there are all kinds of ways to distribute the rendering task. So I will go into detail my methodology and reasoning behind it.
First off, I am interested in high quality renderings at large resolutions. If a ray tracer is going to render a really small scene with few rays bouncing around, there is absolutely no need to make the thing parallel (unless you are trying for real-time rendering, but again I'm stressing quality). Prior to parallelizing this thing, I had some renders that were taking my Intel 2.85 Ghz P4 over 20 hours to render. The images were usually at least 1024x768 (desktop backgrounds! hehe). So first let's address why this is so large...
Considering a worst case scenario, we have 1024x1024x36 primary rays (ok, that's 36 mil). Now if every ray hits a dielectric, say 4 times, we'll end up 24 rays for each primary ray. Let's take that to the worst case and say the rays bounce 1000 times before that coloring is black. We are now looking at 16000 rays for each of the 36000000 primary rays without considering the complexity of the geometry!. That is well over a a few hundred billion rays.... in plain english... that's a $%&^load of rays!
Now regarding distributing the rendering task, I decided to split the actual pixel data to the different processors. There are a few base reasons... I'd say it is primarily because I only have access to 5 machines. Also, as I'll show, there is virtually no network overhead when doing this!! What does that mean? Given the number of computers, this method (with my code...) is limited by the time it takes to render a pixel, not the network usage! I'll get more into this later...
Because this is a project, I started with a simple ASCII protocol. As it turns out, I only utilize about .02% of the network on most scenes. I saw it spike at 15% or so once, but it was a very quick scene as far as rendering. Sso I did not modify it. If needed, it can easily be changed to a binary protocol. Another thing to note is that I have experience with TCP acting ridiculous by not transmitting the entire buffer when you expect it. So my protocol appends 4 bytes (an unsigned integer in network order) that represent the size of the next message. The network library uses this information to return data only when ALL of the data is there!
Otherwise, the protocol is client/server based using TCP/IP Berkeley sockets (using winsock2, I can *easily* port it to linux if needed). The server sends commands to clients and the clients must reply to each command. The clients can only talk to the server (no intercommunication). While this could have the potential of backing up at the server, for the time being, I've only seen the server hit about 8% network usage on a simple scene. On more complex scenes, the server is usually at about .02% network usage.
There are three commands for the clients:
At the risk of stating the obvious, these commands tell the client what to do. Load tells the client to load the specified camera and scene (relative to the working path). Trace tells the client to return the color of a pixel in the image. And quit instructs the client to quit execution.
The clients may send 2 things (responses ONLY):
Load indicates that the client is done loading the specified files. Trace indicates the client has finished coloring a pixel and returns the color of a pixel (in the range [0,1]). Because many pixels may be sitting on a particular client, it also must include the pixel coordinates.
All single renderings were performed on a 2.4 Ghz P4 (w/ 533 Mhz bus and 768 Mb RAM).
The single machine listed above was both a client and the server in all parallel experiments. The server had a raised priority so that it would distribute work effectively, but it also had both blocking calls and CPU Yielding in the code to ensure it was not a CPU hog (it generally was less than 1% CPU usage). Otherwise, another similar Intel machine was added as well as 3 Athlon XP 2600+ machines with 1 Gb RAM each.
To test the parallelization, I rendered some scenes with both the cluster and the single machine (ug). I ensured that the pictures were the same. Here is more information about the scenes.
dist_texture_test: This is a scene with bvh, instances, and textures. It has one metallic object, so the recursion tree is extremely shallow compared to other scenes I tested. Also note this uses only point lights.
cube_reflect: A bvh with 2700 triangles. Uses normal mapping, texture mapping, and 3 point lights. The mirrors in this scene will lean more towards recursion depth then the number of leaves as the next test stresses.
ply: Some high poly instances with an area light. Low recursion depth per pixel.
cube_area: A bvh with 2700 triangles. Uses normal mapping, texture mapping, and 2 area lights.
bump_sphere: A normal mapped sphere with 2 area lights.
plexi_bump:This is a scene with lots of bump mapped spheres, a dielectric, and mirrors. Oh ya, The area light as well!
area_sphere: A huge dielectric sphere with a massive area light that has about 1000 samples to account for its size (the shadows don't look defined enough with low sample counts due to the lights position in the scene). To top it off, there are 2 mirrors reflectiving light back and forth.
| Files | Time for Single Machine | Time for Cluster | Speed Up | Speed Up per CPU |
|---|---|---|---|---|
| dist_texture_test | 11:46 (706 s) | 5:11 (311 s) | 2.2701 | 0.4540 |
| cube_reflect | 9:18 (558 s) | 2:57 (177 s) | 3.1525 | 0.6305 |
| ply | 17:56 (1076 s) | 8:02 (482 s) | 2.2324 | 0.4465 |
| cube_area | 3:53:49 (14029 s) | 46:03 (2736 s) | 5.1276 | 1.0255* |
| bump_sphere | 2:38:41 (9521 s) | 33:58 (2038 s) | 4.6717 | 0.9343 |
| plexi_bump | 10:51:32 (39092 s) | 1:57:42 (7062 s) | 5.5355 | 1.1071* |
| area_sphere | (excessive) | 8:03:52 (29032 s) | ? | ? |
*Regarding the speedup per CPU, I think this is because I rendered using both Intel and AMD CPUs. AMD CPUs are known for having more floating point units. So I suppose that is why the the time for a single machine is large. The distribution might also being allowing the cache performance to increase the time.
At the risk of stating the obvious, this scheme works really well when an image takes a long time to render. Most of the time, the interreflections and area lights seem to cause the extra time. I noticed the server (not the network) was a bit inefficient and was not giving the data out fast enough on the small scenes. In general, the server would take about 45-60% CPU and the clients would only be 30-50% CPU. The network usage was minimal in these cases. Howeverm in these types of scenes, the render time was not too large at all. When we consider the scenes that take hours to render, the parallization really helps out. The parallel rendering would finish considerably sooner than I calculated on some pictures.
Some problems with no solutions yet can be grouped together into "REQUIRES RANDOM DATA". An example of this is the turblent material. The machines were getting different random seeds. Because of this, the turbulence was not "smooth" across all of the objects. I suspect a fix to this is to just enforce random seeding via the network, but I didn't. It wouldn't be much, but I didn't do it.
Regarding the problems I had prior to my presentation. I had three main issues that were somewhat related:
First off, I was polling for read/write on the client system because I shared code. This was problematic because the client didn't need to poll (however, the server did). This is the case because the client could only process one job at a time, and it didn't matter if I blocked trying to read the data from the socket. As far as writing, the amount of tasks for each client is so low that there's virtuall no way it will block! Removing this helped some... but there's more.
Secondly, Nagel's Algorithm (for network optimization) is enabled on sockets by default. This algorithm tries to pack data into larger chunks for transmission. So When I would write the messages from the server, they would sit waiting for more data (maybe 2-3 seconds) before getting transmitted. Disabling this on both the client and server helped dramatically!
Finally, I was calling the sleep function on the server so that it would not hog the CPU. Well, according to my tests, the CPU would not wake up for up to 200-300 milliseconds after the sleep call. This made the transmission speed drop to virtually nothing! To fix this, I had to yield the processor without sleeping. In linux this function is sched_yield(). In Windows it is SwitchToThread() (WHY?????????). Thanks a LOT to Walt Mundt for helping me with that!
That's it...