Hi Chrisapo,
generally, the parallel performance is computation- and environment dependent. You will just have to test out what works best.
Judging from your previously posted examples, I checked the following:
For a 2D example with 128x128 grid points, N=100 particles and M=4 orbitals, I get reasonable performance using just OpenMP threads with a single MPI process on my desktop computer. In that case, using more than 4 threads didn't add much performance.
Hope that helps!