The 3D domain decomposition situation seems to be better than I expected. AMD GCN has 4 SIMD units per Compute Unit, each has 64 KiB of private registers. If the data is read into registers instead of memory, it frees up Local Memory for ghost cell sharing within a CU. The only real drawbacks are registers cannot be indexed, so I may need to auto-generate a kernel with several hundred lines of spaghetti code. Occupancy consideration also means only a fraction of registers is usable. :thinking_cirno:
Discussion
No replies yet.