I recently started getting my feet wet in the high-level optimizations of Julia compiler, and one of my recent works was the implementation of call-site inlining annotation. In this blog post I'd like to share my observations that I learned during working on that PR.
I will first explain a general idea of inlining as well as showcase its example application. Then I will introduce the annotations that Julia programer can use to control inlining decision, and finally I'm also going to explain a bit of Julia compiler's inlining cost model.
Some of code snippets shown in this article are only runnable on Julia 1.8 or higher. The results shown in this article are generated with this Julia version:
versioninfo()
Julia Version 1.10.0-DEV.614
Commit 95e0da1421e (2023-02-15 19:01 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 2 × Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
Threads: 1 on 2 virtual cores
Environment:
LD_LIBRARY_PATH = /opt/hostedtoolcache/Python/3.8.16/x64/lib
Inlining is one of the most common program optimization techniques. Replacing a function call-site with the body of the function may improve performance because:
it eliminates the overhead of the function call
it may increase the chances of other optimizations
Generally speaking, on the modern powerful CPUs, an overhead of function call is very low and so the performance benefit from 1.
is often negligible when compared to the other computation costs[1]. But when a function that is called within a heavy loop isn't inlined, the cost of the function calls will be summed up and it may come up to the surface. Also, if the function being called is very simple, and is called frequently, the cost of function call can be comparable to the cost of the function's computation itself, so inlining such functions may provide a relative performance gain too.
On the other hand, since inlining drastically changes the program representation and thus also changes chances of other succeeding program optimizations, the effect of 2.
can impact program performance in a broader context. In general, intra-procedural analysis (i.e. analysis on a local scope) is much more easier than inter-procedural analysis (i.e. analysis across arbitrary function calls). Inlining puts different routines all together that otherwise need to be analyzed inter-procedurally, so from the viewpoint of program optimization, it is like converting a difficult problem into a much easier one. And because of this reason, it is usually more effective to do inlining at an earlier stage in the program optimization pipeline.
Since modern compilers perform a variety of optimizations, it is hard to state unconditionally what kind of optimizations can be enabled by inlining other than eliminating function call overhead. But here, let's take a look at the following simple Julia program as an example of how inlining can reduce memory allocation also.
Let's consider this Julia program[2] :
struct Point
x::Float64
y::Float64
end
a::Point +ₚ b::Point = Point(a.x+b.x, a.y+b.y)
function compute(n)
a = Point(1.5, 2.5)
b = Point(2.25, 4.75)
for _ in 1:n
a = (a +ₚ b) +ₚ b
end
return a.x, a.y
end
compute (generic function with 1 method)
When we call compute(n::Int)
, it first allocates two Point
structs called a
and b
. And ::Point +ₚ ::Point
called within the for
loop takes Point
objects as its arguments and compute and allocate new Point
struct. Since the for
loop is iterated over n
-times, if it simply goes on like this, we may end up seeing so many allocations when n
is big.
Is there anyway we can prevent these allocations ?
If +ₚ
isn't inlined, then there is no chance since the Point
objects needs to exist on the memory in order to be passed as the arguments of +ₚ
.
What about if +ₚ
is inlined into compute
? After Julia compiler performed the inlining, compute
will be equivalent to:
function compute(n)
a = Point(1.5, 2.5)
b = Point(2.25, 4.75)
for i in 0:(n-1)
tmp = Point(a.x+b.x, a.y+b.y) # the expanded body of `a +ₚ b`
a = Point(tmp.x+b.x, tmp.y+b.y) # the expanded body of `tmp +ₚ b`
end
return a.x, a.y
end
When looking closely, you may see that compute
above is equivalent to the following version where all Point
s are replaced by scalar values:
function compute(n)
ax, ay = 1.5, 2.5 # replaced `Point(1.5, 2.5)`
bx, by = 2.25, 4.75 # replaced `Point(2.25, 4.75)`
for i in 0:(n-1)
tmpx, tmpy = ax+bx, ay+by # replaced `Point(a.x+b.x, a.y+b.y)`
ax, ay = tmpx+bx, tmpy+by # replaced `Point(tmp.x+b.x, tmp.y+b.y)`
end
return ax, ay
end
Hooray ! By inline expansion, we can eliminate the Point
allocation entirely while keeping the semantics of the original program[3].
Here, for the sake of explanation, we performed manual inlining and allocation elimination, but in reality, a compiler performs all these optimization just automatically. So programmers usually don't need to do these optimization by their hands, and even if we need to do so, such a need should be removed by evolution of the compiler. Julia compiler also implements these inlining and allocation elimination, and it can eliminate all the allocations involved with e.g. compute(100_000_000)
completely. We can confirm that optimization by looking at the output of @code_typed compute(100_000_000)
:
@code_typed compute(100_000_000) # fully inlined
CodeInfo(
1 ── %1 = Base.sle_int(1, n)::Bool
└─── goto #3 if not %1
2 ── goto #4
3 ── goto #4
4 ┄─ %5 = φ (#2 => n, #3 => 0)::Int64
└─── goto #5
5 ── goto #6
6 ── %8 = Base.slt_int(%5, 1)::Bool
└─── goto #8 if not %8
7 ── goto #9
8 ── goto #9
9 ┄─ %12 = φ (#7 => true, #8 => false)::Bool
│ %13 = φ (#8 => 1)::Int64
│ %14 = Base.not_int(%12)::Bool
└─── goto #15 if not %14
10 ┄ %16 = φ (#9 => %13, #14 => %28)::Int64
│ %17 = φ (#9 => 1.5, #14 => %21)::Float64
│ %18 = φ (#9 => 2.5, #14 => %22)::Float64
│ %19 = Base.add_float(%17, 2.25)::Float64
│ %20 = Base.add_float(%18, 4.75)::Float64
│ %21 = Base.add_float(%19, 2.25)::Float64
│ %22 = Base.add_float(%20, 4.75)::Float64
│ %23 = (%16 === %5)::Bool
└─── goto #12 if not %23
11 ─ goto #13
12 ─ %26 = Base.add_int(%16, 1)::Int64
└─── goto #13
13 ┄ %28 = φ (#12 => %26)::Int64
│ %29 = φ (#11 => true, #12 => false)::Bool
│ %30 = Base.not_int(%29)::Bool
└─── goto #15 if not %30
14 ─ goto #10
15 ┄ %33 = φ (#13 => %21, #9 => 1.5)::Float64
│ %34 = φ (#13 => %22, #9 => 2.5)::Float64
│ %35 = Core.tuple(%33, %34)::Tuple{Float64, Float64}
└─── return %35
) => Tuple{Float64, Float64}
There are no %new
statements and it means there are no longer any allocations.
Well, sadly, it is not always a good idea to do inlining. Excessive inlining can lead to worse performance for the following reasons:
Run-time cost: due to the increased size of compiled native code
increased size of compiled native code itself can be memory hungry
bigger code may lead to poor localities of memory access, which makes cache misses more likely
complex instructions may challenge speculative execution optimization
Compile-time overhead due to the increased complexity of the IR ("Intermediate Representation" of program) on which succeeding optimization passes work on
So compilers often have some kind of "cost model" that judges whether or not inlining a function call is profitable. Such cost model is usually based on heuristics, and it seems like the following criteria are used:
simplicity of function body
# of call-site (i.e. static call count)
# of runtime call (i.e. dynamic call count)
escapability of arguments[4]
In any case, this is such a problem that has no absolutely correct answer, and so there would be many other possibilities like employing machine learning techniques to predict the costs.
Anyway, despite the fact that the idea of inlining sounds pretty simple, its application is actually very profound.
Julia compiler also has its own cost model for inlining decision. Basically, it judges inlineability based on 1. simplicity of function body
.
This cost model sounds pretty simple, but it judges very reasonably in many cases, meaning we Julia programmers don't usually need to care about inlining.
On the other hand, there may be still rare chances when you find Julia compiler doesn't perform inlining as expected when you check outputs of code_typed
or Cthulhu.jl in order to pursue performance.
As an example, let's take a look at the following program, which is a slightly modified version of the above example. In order to cheat the inlining cost model, /ₚ
contains very contrived and unnatural error checks (I will explain the exact reason of this tweak at the last of this article):
struct Point
x::Float64
y::Float64
end
a::Point +ₚ b::Point = Point(a.x+b.x, a.y+b.y)
a::Point /ₚ b::Point = begin
# error pass
if false
@label diverror
error("/ₚ: division error detected !")
end
# do some error checks
iszero(a.y) && @goto diverror
iszero(b.y) && @goto diverror
# do the main computation
Point(a.x/a.y, b.x/b.y)
end
function compute(n)
a = Point(1.5, 2.5)
b = Point(2.25, 4.75)
for i in 0:(n-1)
a = (a +ₚ b) /ₚ b
end
return a.x, a.y
end
compute (generic function with 1 method)
For the sake of comparisons with later versions, let's take a benchmark first for now:
# the initial version
@benchmark compute(100_000_000)
BenchmarkTools.Trial: 5 samples with 1 evaluation.
Range (min … max): 1.207 s … 1.252 s ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.216 s ┊ GC (median): 0.00%
Time (mean ± σ): 1.223 s ± 17.412 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ██ █ █
█▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
1.21 s Histogram: frequency by time 1.25 s <
Memory estimate: 0 bytes, allocs estimate: 0.
Although the performance seems quite reasonable already, let's assume we want to make it yet faster for some reasons. A first step of optimizing Julia program is to take a look at the output of @code_typed
:
@code_typed compute(100_000_000) # only partially inlined
CodeInfo(
1 ── %1 = Base.sub_int(n, 1)::Int64
│ %2 = Base.sle_int(0, %1)::Bool
└─── goto #3 if not %2
2 ── goto #4
3 ── goto #4
4 ┄─ %6 = φ (#2 => %1, #3 => -1)::Int64
└─── goto #5
5 ── goto #6
6 ── %9 = Base.slt_int(%6, 0)::Bool
└─── goto #8 if not %9
7 ── goto #9
8 ── goto #9
9 ┄─ %13 = φ (#7 => true, #8 => false)::Bool
│ %14 = φ (#8 => 0)::Int64
│ %15 = Base.not_int(%13)::Bool
└─── goto #15 if not %15
10 ┄ %17 = φ (#9 => %14, #14 => %30)::Int64
│ %18 = φ (#9 => $(QuoteNode(Point(1.5, 2.5))), #14 => %24)::Point
│ %19 = Base.getfield(%18, :x)::Float64
│ %20 = Base.add_float(%19, 2.25)::Float64
│ %21 = Base.getfield(%18, :y)::Float64
│ %22 = Base.add_float(%21, 4.75)::Float64
│ %23 = %new(Point, %20, %22)::Point
│ %24 = invoke :/ₚ(%23::Point, $(QuoteNode(Point(2.25, 4.75)))::Point)::Point
│ %25 = (%17 === %6)::Bool
└─── goto #12 if not %25
11 ─ goto #13
12 ─ %28 = Base.add_int(%17, 1)::Int64
└─── goto #13
13 ┄ %30 = φ (#12 => %28)::Int64
│ %31 = φ (#11 => true, #12 => false)::Bool
│ %32 = Base.not_int(%31)::Bool
└─── goto #15 if not %32
14 ─ goto #10
15 ┄ %35 = φ (#13 => %24, #9 => $(QuoteNode(Point(1.5, 2.5))))::Point
│ %36 = Base.getfield(%35, :x)::Float64
│ %37 = Base.getfield(%35, :y)::Float64
│ %38 = Core.tuple(%36, %37)::Tuple{Float64, Float64}
└─── return %38
) => Tuple{Float64, Float64}
We can observe:
invoke Main.:/ₚ(...)::Point
indicates that there is a (static) function call of /ₚ
, meaning /ₚ
isn't inlined
also %new(Point, ...)::Point
means that some object "allocation" happens in the call[5]
For this case inlining /ₚ
could still be beneficial because of the following reasons:
when n
is big, the costs of repeated /ₚ
calls may accumulate and become apparent
by inlining /ₚ
, there seems to be a good chance to eliminate the allocations that happen at %new(Point, ...)::Point
and within the call of /ₚ
What can we do in such a situation ? Julia programmers can ask the compiler to perform inlining using either of the following methods:
the definition-site annotation
the call-site annotation
In a case when we "own" the function, we can use the definition-site annotation. The usage is very simple – we annotate @inline
or @noinline
on the function definition.
For this case, we can use @inline
to encourage the inlining of /ₚ
:
@inline a::Point /ₚ b::Point = begin
# error pass
if false
@label diverror
error("/ₚ: division error detected !")
end
# do some error checks
iszero(a.y) && @goto diverror
iszero(b.y) && @goto diverror
# do the main computation
Point(a.x/a.y, b.x/b.y)
end
/ₚ (generic function with 1 method)
Let's run the benchmark again:
# the definition-site annotation version
@benchmark compute(100_000_000)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 670.188 ms … 740.710 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 691.924 ms ┊ GC (median): 0.00%
Time (mean ± σ): 693.627 ms ± 20.987 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ ▁█ ▁
█▁▁▁▁▁█▁▁▁▁▁▁█▁▁█▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
670 ms Histogram: frequency by time 741 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Yay, it seems to run about 30% faster than the previous version !
Next, let's assume a case when we don't "own" a function that we want to be inlined. When applied to this example, it would be such a situation where Point
, +ₚ
and /ₚ
are provided by a library that we don't maintain, and we are users of that library. In the real world we may see such situations rather more often since we usually compose many library utilities to build up an algorithm or application.
Under such circumstances, it is really not preferable to use the definition-site annotation. It is not impossible to use the definition-site annotation, since Julia allows us to overwrite a function definition at runtime[6], but it tends to be a source of bugs and also, in Julia, it may incur unnecessary compilation overhead since it will invalidate already compiled code caches.
In that situation you can use the call-site annotation, which will be added on the next-next stable version, Julia 1.8. The call-site annotation uses the same @inline
/@noinline
macros as like the definition-site annotation, but the macros are annotated on function calls rather than function definitions. There will be no need to overwrite any function definition and so we don't need to worry about the problems of the definition-site annotation.
The call-site annotation was added very recently to the Julia language, and it is not very common feature in other languages, you may not be familiar with it yet. You can read its documentation here, but to put it simply, it is implemented according to the following design:
a call-site annotation affects all the function calls involved within a block to which the annotation is applied
a call-site annotation is always prioritized over the definition-site annotation
when call-site annotation is nested, the innermost annotation is prioritized
For this case we can use it like:
function compute(n)
a = Point(1.5, 2.5)
b = Point(2.25, 4.75)
for i in 0:(n-1)
a = @inline (a +ₚ b) /ₚ b
end
return a.x, a.y
end
compute (generic function with 1 method)
Again, we can obtain around 30% performance gain by the annotation !
# the call-site annotation version
@benchmark compute(100_000_000)
BenchmarkTools.Trial: 8 samples with 1 evaluation.
Range (min … max): 671.822 ms … 692.895 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 681.513 ms ┊ GC (median): 0.00%
Time (mean ± σ): 682.441 ms ± 8.071 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
██ █ █ █ █ █ █
██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁▁▁█ ▁
672 ms Histogram: frequency by time 693 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
So, in what kind of situations do we want to consider adding annotations ? Here is my rule of thumb:
when a simple function called within a heavy loop isn't inlined: add @inline
when a complex function which is rarely called is inlined: add @noinline
when there is an heavy overhead when compiling code full of @inline
annotations: remove @inline
Having said that, it would be much more ideal if there is no need for programmers to add such annotations. In other words, if a function that should be inlined is not inlined or vice versa, it simply means there is a room for improvement in the compiler's cost model.
So if you encounter such a situation when inspecting Julia program, please feel free to report it as an issue in the JuliaLang/julia
repository, and it may lead to an improvement of Julia compiler. In fact, there has been at least one example of such improvement just recently :)
We can take a peek at Julia's inlining cost model using Cthulhu.jl. Let's see why /ₚ
wasn't inlined without annotations:
julia> using Cthulhu
julia> descend(/ₚ, (Point,Point); inline_cost=true)
/ₚ(a::Point, b::Point) in Main at untitled-e25fcb34ce5237c6bb8376cca04cf1b2:24
│ ─ %-1 = invoke /ₚ(::Point,::Point)::Point
26 1 ─ 0 goto #3 │
28 2 ┄ 0 invoke Main.error("/ₚ: division error detected !"::String)::Union{}
└── 0 unreachable │
31 3 ─ 0 %4 = Base.getfield(_2, :y)::Float64 │╻ getproperty
│ 2 %5 = Base.eq_float(%4, 0.0)::Bool ││╻ ==
└── 0 goto #5 if not %5 │
4 ─ 40 goto #2 │
32 5 ─ 0 %8 = Base.getfield(_3, :y)::Float64 │╻ getproperty
│ 2 %9 = Base.eq_float(%8, 0.0)::Bool ││╻ ==
└── 0 goto #7 if not %9 │
6 ─ 40 goto #2 │
34 7 ─ 0 %12 = Base.getfield(_2, :x)::Float64 │╻ getproperty
│ 0 %13 = Base.getfield(_2, :y)::Float64 ││
│ 20 %14 = Base.div_float(%12, %13)::Float64 │╻ /
│ 0 %15 = Base.getfield(_3, :x)::Float64 │╻ getproperty
│ 0 %16 = Base.getfield(_3, :y)::Float64 ││
│ 20 %17 = Base.div_float(%15, %16)::Float64 │╻ /
│ 0 %18 = %new(Main.Point, %14, %17)::Point │╻ Point
└── 0 return %18 │
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [h]ide type-stable statements, [d]ebuginfo, [r]emarks, [i]nlining costs, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
• %2 = invoke error(::String)::Union{}
↩
What is shown above is the body of /ₚ
, but as an intermediate representation on which Julia compiler works on. Here, the third column from the left (the one composed of the numbers 0
/2
/20
/40
) indicates the computed inlining costs corresponding to each SSA statement.
When there are no annotations, Julia compiler usually performs inlining of functions whose inline costs don't sum up to exceed 100
, and otherwise doesn't. The sum of inlining costs of /ₚ
is 124
, and this is why it was not inlined.
Let's take a closer look. If you look closely, you will notice that the two goto #2
statements have been assigned a high inline cost of 40
. These two goto #2
correspond to the @goto diverror
in the original program. Those goto
s are jump instructions that go back to the opposite direction of the control-flow. Such a backward jump usually implies the existence of a loop, which often does "complex" computations, and thus assigned such a high inline cost. And as a result /ₚ
was not inlined.
The core logic of Julia compiler's inlining cost model is described at base/compiler/optimizer.jl
, and the inlining costs of built-in functions are defined in base/compiler/tfuncs.jl
. Particularly, the heuristic about a backward jump elaborated above corresponds to the logic defined here.
There might be another interesting insights if you fiddle with Cthulhu to see how the inlining cost model makes decisions on various kinds of code.
In this article we saw a general idea of inlining and its pros and cons, as well as we covered the two different inlining annotations that are available for Julia programmers.
Well, there are still many other interesting topic that we couldn't cover. For example, in the context of Julia compiler development, it was a quite interesting finding for me that a success of inlining really depends on constant-propagation but a profitability of constant-propagation also depends on the success of inlining.
But this is just an introduction to inlining. Let's wrap it up now. I hope you enjoyed this post and the fact that the idea of inlining is pretty simple but also its application is actually very profound !
[1] | References: |
[2] | The example was adapted from this code snippet used in the documentation of LuaJIT compiler. |
[3] | This allocation elimination by replacing structs with scalar values is called as "SROA – Scalar Replacement of Aggregates". |
[4] | JVM compilers often implement good escape analysis. Some of them seem to tweak their inlining cost depending on whether of not if there is any argument that doesn't escape, since inlining such function calls often leads to successful SROA: https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf |
[5] | You may have noticed that @benchmark compute(100_000_000) nevertheless reported something that really conflicts with this statement: Memory estimate: 0 bytes, allocs estimate: 0. . That is because @benchmark nor @allocated don't account for anything allocated on stack memory by design. Since Point is defined as immutable type, Julia compiler can optimize its allocation so that Point objects are allocated on stack rather than heap. It means memory allocated for Point will be released as soon as control-flow returns from the call and so there won't be any GC pressure to manage and release the allocated memory later. Still there remains some computations to store and load data into stack memory since Point objects need to exist somewhere, but the computations are very cheap compared to the cost of heap allocation as a whole. This is why the performance compute(100_000_000) was quite fast already without the performance tweaks with inlining annotations I showcased later on.
Now you may have wondered what happens if we define The good news is that we are working on this sort of memory optimizations ! You can take a peek on these HackMD documents: or even drop in at our weekly meeting if you're interested. Please ping me on Julia slack if that's the case. |
[6] | This kind of technique is often called as "monkey-patching". Actual method of monkey-patch really depends on what kind of features are provided by languages or frameworks. In Julia, we can do monkey-patching using Core.eval , which allows us to evaluate arbitrary piece of code in the context of an already-defined module. For this specific case, if module A defines /ₚ , we can overwrite that definition and define a new equivalent definition except with the definition-site @inline annotation:
|