.. only:: html
.. note::
:class: sphx-glr-download-link-note
Click :ref:`here ` to download the full example code
.. rst-class:: sphx-glr-example-title
.. _sphx_glr_tutorials_schedule_primitives.py:
===============================
Schedule Primitives in CINN
===============================
In this tutorial, we will guide you through the examples of using schedule primitives.
.. code-block:: python
import cinn
import numpy as np
# sphinx_gallery_thumbnail_path = './paddlepaddle.png'
declare some variables for latter use
Expr is short for expression.
.. code-block:: python
m = cinn.Expr(32)
n = cinn.Expr(8)
print(m, n)
# get the integer contained in an integer expression
print(m.int())
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
32 8
32
A schedule can be created from a list of Tensors.
.. code-block:: python
# declare an elementwise multiply
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
# create the stages for further schedule
stages = cinn.create_stages([C])
# lower will transform the computation to real code
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
C[i, j] = (A[i, j] * B[i, j])
}
}
}
One schedule is composed by multiple stages. We provide several
methods to schedule each stage.
split
------
:code:`split` can partition a specific axis into two axises by :code: `factor`.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, ))
B = cinn.compute((m, ), lambda v: A(v[0]) * 2., name='B')
stages = cinn.create_stages([B])
i0, i1 = stages[B].split(level=0, factor=4)
fn = cinn.lower("fn", stages, [A.to_tensor(), B])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B)
{
for (i_outer, 0, 8)
{
for (i_inner, 0, 4)
{
B[((4 * i_outer) + i_inner)] = (2 * A[((4 * i_outer) + i_inner)])
}
}
}
fuse
------
:code:`fuse` can fuse two specific axises into a axis.
It is the reverse operation of `split`.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.compute((m, n), lambda v: A(v[0], v[1]) * 2., name='B')
stages = cinn.create_stages([B])
i0 = stages[B].fuse(level0=0, level1=1)
fn = cinn.lower("fn", stages, [A.to_tensor(), B])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B)
{
for (i_j_fused, 0, 256)
{
B[(i_j_fused / 8), (i_j_fused % 8)] = (2 * A[(i_j_fused / 8), (i_j_fused % 8)])
}
}
tile
------
:code:`tile` can partition two adjacent axises into blocks.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
stages = cinn.create_stages([C])
i, j = stages[C].axis(0), stages[C].axis(1)
i_outer, i_inner, j_inner, j_outer = stages[C].tile(i, j, 4, 4)
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C)
{
for (i_outer, 0, 8)
{
for (i_inner, 0, 4)
{
for (j_outer, 0, 2)
{
for (j_inner, 0, 4)
{
C[((4 * i_outer) + i_inner), ((4 * j_outer) + j_inner)] = (A[((4 * i_outer) + i_inner), ((4 * j_outer) + j_inner)] * B[((4 * i_outer) + i_inner), ((4 * j_outer) + j_inner)])
}
}
}
}
}
reorder
---------
:code:`reorder` can reorder the axises in the specified order.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
stages = cinn.create_stages([C])
i0, i1 = stages[C].axis(0), stages[C].axis(1)
stages[C].reorder([i1, i0])
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C)
{
for (j, 0, 8)
{
for (i, 0, 32)
{
C[i, j] = (A[i, j] * B[i, j])
}
}
}
unroll
------
:code:`unroll` unroll a specific axis.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
stages = cinn.create_stages([C])
i1 = stages[C].axis(1)
stages[C].unroll(i1)
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C)
{
for (i, 0, 32)
{
C[i, 0] = (A[i, 0] * B[i, 0])
C[i, 1] = (A[i, 1] * B[i, 1])
C[i, 2] = (A[i, 2] * B[i, 2])
C[i, 3] = (A[i, 3] * B[i, 3])
C[i, 4] = (A[i, 4] * B[i, 4])
C[i, 5] = (A[i, 5] * B[i, 5])
C[i, 6] = (A[i, 6] * B[i, 6])
C[i, 7] = (A[i, 7] * B[i, 7])
}
}
compute_inline
----------------
:code:`compute_inline` marks a stage as inline, then the computation
body will be expanded and inserted at the location where the tensor
is referenced.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
# C1[i,j] = C[i,j] + B[i,j]
C1 = cinn.compute([m, n], lambda v: C(v[0], v[1]) + B(v[0], v[1]), "C1")
# C2[i,j] = C1[i,j] + B[i,j]
C2 = cinn.compute([m, n], lambda v: C1(v[0], v[1]) + B(v[0], v[1]), "C2")
stages = cinn.create_stages([C, C1, C2])
stages[C].compute_inline()
stages[C1].compute_inline()
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C2])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C2)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
C2[i, j] = ((2 * B[i, j]) + (A[i, j] * B[i, j]))
}
}
}
bind
----------------
:code:`bind` can bind a specified axis with a thread axis.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.Placeholder('float32', 'B', (m, n))
C = cinn.compute((m, n), lambda v: A(v[0], v[1]) * B(v[0], v[1]), name='C')
stages = cinn.create_stages([C])
stages[C].bind(0, "blockIdx.x")
stages[C].bind(1, "threadIdx.x")
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _C)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
C[i, j] = (A[i, j] * B[i, j])
}
}
}
compute_at
----------------
:code:`compute_at` can specify the stage to be computed at
another stage's scope.
The input param `other` specifies the other stage.
The input param `level` specifies the stage's scope(which loop)
to be computed at.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n, n))
B = cinn.Placeholder('float32', 'B', (m, n, n))
C = cinn.compute(
(m, n), lambda v: A(v[0], v[1], v[1]) * B(v[0], v[1], v[1]), name='C')
D = cinn.compute((m, n), lambda v: C(v[0], v[1]) + 1., name='D')
stages = cinn.create_stages([C, D])
print("---------Before compute_at---------")
fn = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C, D])
print(fn)
print("---------After compute_at---------")
stages[C].compute_at(other=stages[D], level=1)
fn2 = cinn.lower("fn", stages, [A.to_tensor(), B.to_tensor(), C, D])
print(fn2)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
---------Before compute_at---------
function fn (_A, _B, _C, _D)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
C[i, j] = (A[i, j, j] * B[i, j, j])
}
}
for (i, 0, 32)
{
for (j, 0, 8)
{
D[i, j] = (1 + C[i, j])
}
}
}
---------After compute_at---------
function fn (_A, _B, _C, _D)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
C[i, j] = (A[i, j, j] * B[i, j, j])
D[i, j] = (1 + C[i, j])
}
}
}
cache_read
------
:code:`cache_read` can create a cache Tensor and load the origin
Tensor's data into this buffer.
It will replace all the reading in the readers with the cache.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.compute((m, n), lambda v: A(v[0], v[1]) * 2., name='B')
stages = cinn.create_stages([B])
ACR = stages[A.to_tensor()].cache_read("local", [B], stages)
fn = cinn.lower("fn", stages, [A.to_tensor(), ACR, B])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _A_read_cache, _B)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
A_read_cache[i, j] = A[i, j]
}
}
for (i, 0, 32)
{
for (j, 0, 8)
{
B[i, j] = (2 * A_read_cache[i, j])
}
}
}
cache_write
------
:code:`cache_write` can create a cache for writing to the
original tensor.
It will store the data in the cache memory first, then
write to the output tensor.
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.compute((m, n), lambda v: A(v[0], v[1]) * 2., name='B')
stages = cinn.create_stages([B])
BCR = stages[B].cache_write("local", stages, B)
fn = cinn.lower("fn", stages, [A.to_tensor(), B, BCR])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B, _B_write_cache)
{
for (i, 0, 32)
{
for (j, 0, 8)
{
B_write_cache[i, j] = (2 * A[i, j])
}
}
for (i, 0, 32)
{
for (j, 0, 8)
{
B[i, j] = B_write_cache[i, j]
}
}
}
Parallel
------
:code:`parallel` will mark one loop to execute in parallel.(Only used in X86 backends)
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.compute((m, n), lambda v: A(v[0], v[1]) * 2., name='B')
stages = cinn.create_stages([B])
stages[B].parallel(0)
fn = cinn.lower("fn", stages, [A.to_tensor(), B])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B)
{
parallel for (i, 0, 32)
{
for (j, 0, 8)
{
B[i, j] = (2 * A[i, j])
}
}
}
Vectorize
------
:code:`vectorize` will vectorize one loop in param `level`.(Only used in X86 backends)
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, n))
B = cinn.compute((m, n), lambda v: A(v[0], v[1]) * 2., name='B')
stages = cinn.create_stages([B])
stages[B].vectorize(0, 10)
fn = cinn.lower("fn", stages, [A.to_tensor(), B])
print(fn)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
function fn (_A, _B)
{
for (i, 0, 4)
{
for (j, 0, 8)
{
B[Ramp((10 * i),1,10), Broadcast(j,10)] = (Broadcast(2,10) * A[Ramp((10 * i),1,10), Broadcast(j,10)])
}
}
}
--------------------------------------------------------------
An example of optimizing performance in cuda backends
--------------------------------------------------------------
**In this section, we will show you a practical example about optimizing performance using schedule primitives**
Optimize an elementwise_add kernel using `fuse`, `split` and `bind`
.. code-block:: python
A = cinn.Placeholder('float32', 'A', (m, m))
B = cinn.compute((m, m), lambda v: A([v[0], v[1]]) * 2., name='B')
stages = cinn.create_stages([B])
fn0 = cinn.lower("fn", stages, [A.to_tensor(), B])
print("Original kernel before optimizing:\n", fn0)
stages[B].fuse(0, 1)
stages[B].split(level=0, factor=256)
stages[B].bind(0, "blockIdx.x")
stages[B].bind(1, "threadIdx.x")
fn1 = cinn.lower("fn", stages, [A.to_tensor(), B])
print("\n======================================\nThe optimized kernel:\n", fn1)
.. rst-class:: sphx-glr-script-out
Out:
.. code-block:: none
Original kernel before optimizing:
function fn (_A, _B)
{
for (i, 0, 32)
{
for (j, 0, 32)
{
B[i, j] = (2 * A[i, j])
}
}
}
======================================
The optimized kernel:
function fn (_A, _B)
{
for (i_j_fused_outer, 0, 4)
{
for (i_j_fused_inner, 0, 256)
{
B[((i_j_fused_inner / 32) + (8 * i_j_fused_outer)), (i_j_fused_inner % 32)] = (2 * A[((i_j_fused_inner / 32) + (8 * i_j_fused_outer)), (i_j_fused_inner % 32)])
}
}
}
Thus we get an optimized kernel.
.. rst-class:: sphx-glr-timing
**Total running time of the script:** ( 0 minutes 0.441 seconds)
.. _sphx_glr_download_tutorials_schedule_primitives.py:
.. only :: html
.. container:: sphx-glr-footer
:class: sphx-glr-footer-example
.. container:: sphx-glr-download sphx-glr-download-python
:download:`Download Python source code: schedule_primitives.py `
.. container:: sphx-glr-download sphx-glr-download-jupyter
:download:`Download Jupyter notebook: schedule_primitives.ipynb `
.. only:: html
.. rst-class:: sphx-glr-signature
`Gallery generated by Sphinx-Gallery `_