Speed up the scheduling of stateless stages
Idea: The instance of a stateless stage could be executed by multiple threads at the same time.
Current Problems:
- The user must still indicate whether the ordering should be preserved. However, we could set the default to "preserve ordering".
- Each thread must reserve or copy its required number of input elements. For this purpose, such a stateless stage must consume a fixed number of elements per execution.
- What should happen with unprocessed input elements if an error occurs?