It's known that Intel Core 2 Duo has 3 SSE units. These 3 units allows 3 SSE instructions to be run paralelly (1), for example:
It's known also, that each SSE unit consists of 2 modules: one for addition (substraction), and one for multiplication (division). The latter allows to run mullps-addps instruction sequences parallelly (2), for example:
Which way of instruction ordering should I prefer, A or B?
Is it possible to distribute 3 mulps to 3 SSE multiplication units (1), and at the same time (2) to distribute addps to their respective SSE addition units, resulting in total 6 instructions per schedule cycle?
by 'scheduled' I mean throughput rate.