Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvement? Something similar to Julia's @simd? #1493

Open
paugier opened this issue Mar 18, 2020 · 8 comments
Open

Comments

@paugier
Copy link
Contributor

paugier commented Mar 18, 2020

Hi Serge,

i hope you are fine at home.

I investigated about the bad result for Pythran mentioned in https://github.com/Thierry-Dumont/BenchmarksPythonJuliaAndCo/wiki/5-The-FeStiff-benchmark (see the resulting PR Thierry-Dumont/BenchmarksPythonJuliaAndCo#12).

I isolated two issues. The first one is about the lack of native struct (so a method uses Python getattr and 2 function calls instead of 1). Using __slots__ and PyPy removes most of the overhead!

Then, even just for the numerical kernel (the code is here https://github.com/paugier/bench_integrate_callback/tree/master/struct_simd/only_simd), Pythran is slower than Julia on this example (ratio Julia/Pythran ~ 2.4).

The implementation in Julia contains the same information than the one in Python, except for a macro @simd (https://docs.julialang.org/en/v1/base/base/#Base.SimdLoop.@simd, which is used to "annotate a for loop to allow the compiler to take extra liberties to allow loop re-ordering").

Even without @simd Julia is faster than Pythran, but it seems that it helps Julia to be faster (here, Pythran runs in 0.30 µs):

Julia without simd 0.157 µs
Julia with simd    0.125 µs
ratio (without simd)/(with simd): 1.26
  1. There may be a (small) performance issue with this code and Pythran
  2. For me, the macro @simd is somehow similar to OpenMP comments, so i was wondering if we could also add such information in a comment (something like # simd for) just before the for loop. Could Pythran do something useful with that?
@serge-sans-paille
Copy link
Owner

The correct way to achieve that goal the pythran way would be through openmp simd directives. i'll have a look!

@serge-sans-paille
Copy link
Owner

serge-sans-paille commented Mar 29, 2020

@paugier : using the appropriate vector form:

#pythran export compute(float[3], float[3], float[6,3,2], float[21])
import numpy as np

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    ii = 0
    fgrads = grads.reshape((6,6))
    for i in range(6):
        for j in range(i + 1):
            out[ii] = dv * np.sum(fgrads[i] * fgrads[j])
            ii += 1

This gets vecotrized with -DUSE_XSIMD just fine. Can you confirm?

[edited]

@serge-sans-paille
Copy link
Owner

Note:

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    ii = 0
    for i in range(6):
        for j in range(i + 1):
            out[ii] = dv * np.sum(grads[i] * grads[j])
            ii += 1

works just fine too.

@serge-sans-paille
Copy link
Owner

And I particularly like this form, even if it's arguably less intuitive:

#pythran export compute(float[3], float[3], float[6,3,2], float[21])
import numpy as np

def compute(x , y , grads, out):
    det = -(x[1] - x[2]) * y[0] + (x[0] - x[2]) * y[1] - (x[0] - x[1]) * y[2]
    dv = 1.0 / (6.0 * det)
    for i in range(6):
        for j in range(i + 1):
            out[(i * (i +1)) // 2 + j] = dv * np.sum(grads[i] * grads[j])

@paugier
Copy link
Contributor Author

paugier commented Mar 29, 2020

Nice new versions! Unfortunately, I don't get any speedup compared to the original code. Worse, -DUSE_XSIMD doesn't accelerate these functions, and even makes your first version slower than without -DUSE_XSIMD!

Without -DUSE_XSIMD:

compute :  0.28 µs
compute1:  0.29 µs
compute2:  0.28 µs
compute3:  0.28 µs
Julia:   0.12 µs
ratio Pythran/Julia: 2.30

And with -DUSE_XSIMD:

compute :  0.28 µs
compute1:  0.38 µs
compute2:  0.28 µs
compute3:  0.28 µs
Julia:   0.12 µs
ratio Pythran/Julia: 2.32

@serge-sans-paille
Copy link
Owner

Are you using the fixed-size version in the pythran annotation?

@paugier
Copy link
Contributor Author

paugier commented Mar 30, 2020

The previous results were with the fixed-size version in the pythran annotation.

With the standard annotations, it gives:

Without -DUSE_XSIMD:

compute :  0.30 µs
compute1:  0.30 µs
compute2:  0.45 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.36

With -DUSE_XSIMD:

compute :  0.30 µs
compute1:  0.38 µs
compute2:  0.48 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.37

With -DUSE_XSIMD -march=native:

compute :  0.30 µs
compute1:  0.39 µs
compute2:  0.57 µs
Julia:   0.13 µs
ratio Pythran/Julia: 2.36

-DUSE_XSIMD and -march=native slow down the computation!

@MordicusEtCubitus
Copy link

MordicusEtCubitus commented May 17, 2021

Hi,
I am mostly completely out of concern, but I've looked at the old Just In Time compiler used by the no more maintained HOPE library.

It is doing JIT using only pure python code to achieve this. I found it quite amazing because the code is really simple.
Unfortunately only a few subset of python is supported and the library is not able to compile the famous mandelbrot example below:

@hope.jit
def mandel(x, y, max_iters):
    """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
    """
    c = complex(x, y)
    z = 0.0j
    for i in range(max_iters):
        z = z*z + c
        if (z.real*z.real + z.imag*z.imag) >= 4:
          return i

    return max_iters

@hope.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters, w, h):
    height = h
    width = w

    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height

    for x in range(width):
        real = min_x + x * pixel_size_x
        for y in range(height):
          imag = min_y + y * pixel_size_y
          color = mandel(real, imag, iters)
          image[y, x] = color
    


image = np.zeros((1024, 1536), dtype = np.uint8)
start = timer()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 50) 
dt = timer() - start

print("Mandelbrot created in %f s" % dt)
imshow(image)
show()

But using its own examples, it appears to be near 2 times faster than numba default jit

from scipy.integrate import odeint
import hope
import numba

P, d, B, G, A = 0, 0.0001, 0.0095, 0.0001, 0.0001
N = 10**3
def f_nat(y, t):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

y0, t = np.array([500., 0., 5.]), np.linspace(0, 5., N)


@hope.jit
def f_hope(y, t, P, d, B, G, A):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

@hope.jit
def f_opt(y, t, dy, P, d, B, G, A):
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

import numba

@numba.jit
def f_numb(y, t, P, d, B, G, A):
    dy = np.empty(3)
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

@numba.jit
def f_numb_opt(y, t, dy, P, d, B, G, A):
    dy[0] = P - B*y[0]*y[1] - d*y[0]
    dy[1] = B*y[0]*y[1] + G*y[2] - A*y[0]*y[1]
    dy[2] = d*y[0] + A*y[0]*y[1] - G*y[2]
    return dy

dy = np.empty(3)
print ("native python")
%timeit odeint(f_nat, y0, t)
print ("hope")
%timeit odeint(f_hope, y0, t, args=(P, d, B, G, A))
print ("hope without allocation")
%timeit odeint(f_opt, y0, t, args=(dy, P, d, B, G, A))
print ("numba")
%timeit odeint(f_numb, y0, t, args=(P, d, B, G, A))
print ("numba without allocation")
%timeit odeint(f_numb_opt, y0, t, args=(dy, P, d, B, G, A))
native python
952 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
hope
187 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
hope without allocation
170 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba
320 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
numba without allocation
289 µs ± 746 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using a bigger value for N this disappear on my laptop showing no diff between native python, numba or hope.

But there may be a nice optimization to find in hope JIT.

Honestly I have not investigate more how the library is working.
Just wanted to share this measurement. I have not tested it with pythran too :(
Shame on me. Will do it as soon as possible.

Hope it can help anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants