There a number of things we can do to speedup Python-to-Python calls, without changing the stack layout.
Faster creation of interpreter frames
Our fastest Python-to-Python call is _INIT_CALL_PY_EXACT_ARGS
which is reasonably efficient, but could definitely be made faster.
There are few issues with it.
- It contains a variable length loop.
- The inlined call to
_PyFrame_PushUnchecked also contains a variable length loop.
We can make the loops fixed length by:
- Unconditionally copying
self_or_null
- Only adjust the pointer, not the count if
self is not NULL.
If self_or_null is NULL it will then be overwritten.
- Break
_INIT_CALL_PY_EXACT_ARGS into two parts, one to initialize the arguments and
one to NULL out the remaining locals. Both can be marked replicate to avoid the loop.
Better optimization of other Py-to-Py calls in tier 2
We currently specialize the remaining Py-to-Py calls into "with defaults" and do not specialize
for "code complex parameters".
We should treat both the same in tier 1 as "CALL_PY", and expand the call sequence in tier2 to
produce an optimal sequence of instructions.
This will probably make no difference to T1 performance, the "with defaults" case will get a tiny bit slower and the other cases might be a bit faster.
Remove f_globals and f_builtins from the interpreter frame
In tier 2, we have largely eliminated access to f_globals and f_builtins.
We can speedup calls, without slowing down access to globals, in tier 2 if we
were to remove these fields.
Doing this will slowdown access to globals in tier 1, however.
In order to get an overall speedup the ratio of tier 2 to tier 1 code will need to increase.
Once the ratio of T2 to T1 code is 3:1 or better, it should be profitable to remove these fields.
Faster checking of stack space
#620
There a number of things we can do to speedup Python-to-Python calls, without changing the stack layout.
Faster creation of interpreter frames
Our fastest Python-to-Python call is
_INIT_CALL_PY_EXACT_ARGSwhich is reasonably efficient, but could definitely be made faster.
There are few issues with it.
_PyFrame_PushUncheckedalso contains a variable length loop.We can make the loops fixed length by:
self_or_nullselfis not NULL.If
self_or_nullisNULLit will then be overwritten._INIT_CALL_PY_EXACT_ARGSinto two parts, one to initialize the arguments andone to
NULLout the remaining locals. Both can be markedreplicateto avoid the loop.Better optimization of other Py-to-Py calls in tier 2
We currently specialize the remaining Py-to-Py calls into "with defaults" and do not specialize
for "code complex parameters".
We should treat both the same in tier 1 as "CALL_PY", and expand the call sequence in tier2 to
produce an optimal sequence of instructions.
This will probably make no difference to T1 performance, the "with defaults" case will get a tiny bit slower and the other cases might be a bit faster.
Remove
f_globalsandf_builtinsfrom the interpreter frameIn tier 2, we have largely eliminated access to
f_globalsandf_builtins.We can speedup calls, without slowing down access to globals, in tier 2 if we
were to remove these fields.
Doing this will slowdown access to globals in tier 1, however.
In order to get an overall speedup the ratio of tier 2 to tier 1 code will need to increase.
Once the ratio of T2 to T1 code is 3:1 or better, it should be profitable to remove these fields.
Faster checking of stack space
#620