Potential performance improvement? Something similar to Julia's `@simd`? #1493

paugier · 2020-03-18T14:21:42Z

Hi Serge,

i hope you are fine at home.

I investigated about the bad result for Pythran mentioned in https://github.com/Thierry-Dumont/BenchmarksPythonJuliaAndCo/wiki/5-The-FeStiff-benchmark (see the resulting PR Thierry-Dumont/BenchmarksPythonJuliaAndCo#12).

I isolated two issues. The first one is about the lack of native struct (so a method uses Python getattr and 2 function calls instead of 1). Using __slots__ and PyPy removes most of the overhead!

Then, even just for the numerical kernel (the code is here https://github.com/paugier/bench_integrate_callback/tree/master/struct_simd/only_simd), Pythran is slower than Julia on this example (ratio Julia/Pythran ~ 2.4).

The implementation in Julia contains the same information than the one in Python, except for a macro @simd (https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd, which is used to "annotate a for loop to allow the compiler to take extra liberties to allow loop re-ordering").

Even without @simd Julia is faster than Pythran, but it seems that it helps Julia to be faster (here, Pythran runs in 0.30 µs):

Julia without simd 0.157 µs
Julia with simd    0.125 µs
ratio (without simd)/(with simd): 1.26

There may be a (small) performance issue with this code and Pythran
For me, the macro @simd is somehow similar to OpenMP comments, so i was wondering if we could also add such information in a comment (something like # simd for) just before the for loop. Could Pythran do something useful with that?

The text was updated successfully, but these errors were encountered:

serge-sans-paille · 2020-03-18T21:25:16Z

The correct way to achieve that goal the pythran way would be through openmp simd directives. i'll have a look!

serge-sans-paille · 2020-03-29T14:36:14Z

@paugier : using the appropriate vector form:

#pythran export compute(float[3], float[3], float[6,3,2], float[21])
import numpy as np

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    ii = 0
    fgrads = grads.reshape((6,6))
    for i in range(6):
        for j in range(i + 1):
            out[ii] = dv * np.sum(fgrads[i] * fgrads[j])
            ii += 1

This gets vecotrized with -DUSE_XSIMD just fine. Can you confirm?

[edited]

serge-sans-paille · 2020-03-29T15:40:35Z

Note:

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    ii = 0
    for i in range(6):
        for j in range(i + 1):
            out[ii] = dv * np.sum(grads[i] * grads[j])
            ii += 1

works just fine too.

serge-sans-paille · 2020-03-29T15:58:40Z

And I particularly like this form, even if it's arguably less intuitive:

#pythran export compute(float[3], float[3], float[6,3,2], float[21])
import numpy as np

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    for i in range(6):
        for j in range(i + 1):
            out[(i * (i +1)) // 2 + j] = dv * np.sum(grads[i] * grads[j])

paugier · 2020-03-29T20:34:56Z

Nice new versions! Unfortunately, I don't get any speedup compared to the original code. Worse, -DUSE_XSIMD doesn't accelerate these functions, and even makes your first version slower than without -DUSE_XSIMD!

Without -DUSE_XSIMD:

compute :  0.28 µs
compute1:  0.29 µs
compute2:  0.28 µs
compute3:  0.28 µs
Julia:   0.12 µs
ratio Pythran/Julia: 2.30

And with -DUSE_XSIMD:

compute :  0.28 µs
compute1:  0.38 µs
compute2:  0.28 µs
compute3:  0.28 µs
Julia:   0.12 µs
ratio Pythran/Julia: 2.32

serge-sans-paille · 2020-03-30T06:21:34Z

Are you using the fixed-size version in the pythran annotation?

paugier · 2020-03-30T09:40:44Z

The previous results were with the fixed-size version in the pythran annotation.

With the standard annotations, it gives:

Without -DUSE_XSIMD:

compute :  0.30 µs
compute1:  0.30 µs
compute2:  0.45 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.36

With -DUSE_XSIMD:

compute :  0.30 µs
compute1:  0.38 µs
compute2:  0.48 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.37

With -DUSE_XSIMD -march=native:

compute :  0.30 µs
compute1:  0.39 µs
compute2:  0.57 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.36

-DUSE_XSIMD and -march=native slow down the computation!

MordicusEtCubitus · 2021-05-17T14:43:23Z

Hi,
I am mostly completely out of concern, but I've looked at the old Just In Time compiler used by the no more maintained HOPE library.

It is doing JIT using only pure python code to achieve this. I found it quite amazing because the code is really simple.
Unfortunately only a few subset of python is supported and the library is not able to compile the famous mandelbrot example below:

@hope.jit
def mandel(x, y, max_iters):
    """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
    """
    c = complex(x, y)
    z = 0.0j
    for i in range(max_iters):
        z = z*z + c
        if (z.real*z.real + z.imag*z.imag) >= 4:
          return i

    return max_iters

@hope.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters, w, h):
    height = h
    width = w

    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height

    for x in range(width):
        real = min_x + x * pixel_size_x
        for y in range(height):
          imag = min_y + y * pixel_size_y
          color = mandel(real, imag, iters)
          image[y, x] = color
    


image = np.zeros((1024, 1536), dtype = np.uint8)
start = timer()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 50) 
dt = timer() - start

print("Mandelbrot created in %f s" % dt)
imshow(image)
show()

But using its own examples, it appears to be near 2 times faster than numba default jit

from scipy.integrate import odeint
import hope
import numba

P, d, B, G, A = 0, 0.0001, 0.0095, 0.0001, 0.0001
N = 10**3
def f_nat(y, t):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

y0, t = np.array([500., 0., 5.]), np.linspace(0, 5., N)


@hope.jit
def f_hope(y, t, P, d, B, G, A):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

@hope.jit
def f_opt(y, t, dy, P, d, B, G, A):
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

import numba

@numba.jit
def f_numb(y, t, P, d, B, G, A):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

@numba.jit
def f_numb_opt(y, t, dy, P, d, B, G, A):
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

dy = np.empty(3)
print ("native python")
%timeit odeint(f_nat, y0, t)
print ("hope")
%timeit odeint(f_hope, y0, t, args=(P, d, B, G, A))
print ("hope without allocation")
%timeit odeint(f_opt, y0, t, args=(dy, P, d, B, G, A))
print ("numba")
%timeit odeint(f_numb, y0, t, args=(P, d, B, G, A))
print ("numba without allocation")
%timeit odeint(f_numb_opt, y0, t, args=(dy, P, d, B, G, A))

native python
952 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
hope
187 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
hope without allocation
170 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba
320 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
numba without allocation
289 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using a bigger value for N this disappear on my laptop showing no diff between native python, numba or hope.

But there may be a nice optimization to find in hope JIT.

Honestly I have not investigate more how the library is working.
Just wanted to share this measurement. I have not tested it with pythran too :(
Shame on me. Will do it as soon as possible.

Hope it can help anyway.

rqwa mentioned this issue Aug 27, 2020

Performance regression by --march=native and --DUSE_XSIMD #1626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance improvement? Something similar to Julia's `@simd`? #1493

Potential performance improvement? Something similar to Julia's `@simd`? #1493

paugier commented Mar 18, 2020

serge-sans-paille commented Mar 18, 2020

serge-sans-paille commented Mar 29, 2020 •

edited

Loading

serge-sans-paille commented Mar 29, 2020

serge-sans-paille commented Mar 29, 2020

paugier commented Mar 29, 2020

serge-sans-paille commented Mar 30, 2020

paugier commented Mar 30, 2020

MordicusEtCubitus commented May 17, 2021 •

edited

Loading

Potential performance improvement? Something similar to Julia's @simd? #1493

Potential performance improvement? Something similar to Julia's @simd? #1493

Comments

paugier commented Mar 18, 2020

serge-sans-paille commented Mar 18, 2020

serge-sans-paille commented Mar 29, 2020 • edited Loading

serge-sans-paille commented Mar 29, 2020

serge-sans-paille commented Mar 29, 2020

paugier commented Mar 29, 2020

serge-sans-paille commented Mar 30, 2020

paugier commented Mar 30, 2020

MordicusEtCubitus commented May 17, 2021 • edited Loading

Potential performance improvement? Something similar to Julia's `@simd`? #1493

Potential performance improvement? Something similar to Julia's `@simd`? #1493

serge-sans-paille commented Mar 29, 2020 •

edited

Loading

MordicusEtCubitus commented May 17, 2021 •

edited

Loading