utilizing concurrency

With pipeline parallelism and especially tensor parallelism, a lot of throughput performance is being left on the table by not solving any task that could be broken down into multiple pieces and solved in parallel.

Want to come up with a good way to utilize this extra perf, probably with some sort of toggle for max concurrency (default 1) and let the model div up tasks this way.