In a HLL like Julia or Mojo you use special types and annotations to nudge the c...

adgjlsfhk1 · on Sept 9, 2023

This just isn't true. Julia lets you write intrinsics (either LLVM intrinsics or native assembly code) just the same as C. For example, https://github.com/eschnett/SIMD.jl/blob/master/src/LLVM_int....

bjourne · on Sept 9, 2023

That is not "the same as C" and you certainly do not achieve the same performance as you do with C. Furthermore my point, which you missed, was that developers typically use different methods to vectorize performance-sensitive code in different languages (even Python has a SIMD wrapper but most people would use NumPy instead).

adgjlsfhk1 · on Sept 9, 2023

what's the difference? an llvm (or assembly) intrinsic called from Julia and one called from c will have exactly the same performance. c isn't magic pixie dust that makes your CPU faster.

bjourne · on Sept 9, 2023

That SIMD.jl doesn't give you direct control over which SIMD instructions are emitted, and that SIMD code generated with that module is awful compared to what a C compiler would emit. The Mandelbrot benchmark is there. Prove me wrong by implementing it using SIMD.jl and achieving performance rivaling C. Bet you can't.

adgjlsfhk1 · on Sept 10, 2023

I wasn't talking about using SIMD.jl. I was talking about the implimentation of the package (which is why I linked to a specific file in the package) which does directly (with some macros) generate simd intrinsics. As for the performance difference per core you're seeing, it's only because your C code is using 32 bit floats compared to the 64 bit floats that Julia is using here.

jakobnissen · on Sept 10, 2023

He has a point. Currently there is no way in Julia of checking with CPU instructions are available. So in practice, it's impossible to write low-level assembly code in Julia.

IIUC, SIMD.jl only works because it only provides what is guaranteed by LLVM to work cross-platform, which is quite far from being able to use AVX2, for example.

DNF2 · on Sept 10, 2023

Loopvectorization exploits avx512, when available. How is that achieved?

jakobnissen · on Sept 16, 2023

IIRC it relies on HostCPUFeatures.jl which parses output from LLVM. However, this means it just crashes when used on a different CPU than it was compiled on (which can happen on compute clusters) and it crashes if the user sets JULIA_CPU_TARGET.