Code a GPU program

>code a GPU program
>it's slower than the corresponding CPU program

Attached: Nvidia_CUDA_Logo.jpg (300x182, 11K)

>using logic with complicated branching on something that will run in a GPU
shiggy diggy

>Want control over individual pixels for say 256 color modes or stuff like palette techniques
>have to use (((shaders)))

>cpu
>5 million tripcodes per second

>gpu
>25 million tripcodes per second

Why?

transfering data to and from the gpu is slow. only good use case for gpus is sending the data once and doing a lot of computations with it while it's in the gpus memory.

It's not that simple you mong. You have to code it in a way that instructions don't run in series. Instead you have to make a majority of it run in parallel which is batshit insanely hard.

Even making programes run on more than 2-4 cores efficiently is already a daunting task (see gaymen), good luck getting shit to run on 2,000+ cores efficiently.

Attached: cpu-vs-gpu-presentation-14-638.jpg (638x479, 103K)

>make program that runs on 6000 processors with a super low clock speed
>slower than a program that runs on 1 processor clocked at least 10000 times higher
WHEW LAD TELL ME MORE

absolute state, etc
also kys tripfag

Memory is slower and each thread is slower. A GPU is only better when your problem is extremely parallel in nature.

These

It should also be noted that at the end of the day despite having an assload of coars consumer GPUs are not that much more powerful than consumer CPUs in terms of FP64 math. While they have magnitudes more FP32 compute performance it makes it even harder to write efficient software for them(see rounding errors). AMD GPUs generally have more FP64 but coding for CUDA is easier. Once you get to 8-16 core modern AMD/intel stuff GPU acceleration becomes less attractive to devs especially given intel's 512-bit AVX i9s.

General AVX FP64 compute of haswell 4-core processor: 112 GFLOPS

Theoretical FP64 compute of a gtx 2080ti : 420 GFLOPS

Theoretical FP64 compute of a Quadro GV100: 8,330 GFLOPS

That's why those fuckers go for ~$10,000

Attached: CPU-Floating-Point-Test-AMD-vs-Intel.png (568x499, 10K)

>CUDA bad you don't want it goyims.

>good luck getting shit to run on 2,000+ cores efficiently.
So how do game programmers program their video game graphics to be processed by those thousands of gpu cores

highly specialized parallel operations
you process thousands of triangles or thousands of pixels at the same time

>Its another /v/ thinks every new tech should be aimed at bideo geams episode

Attached: k.jpg (500x500, 56K)

>doesn't know about costs of data transfer to memory
>doesn't know about the concepts of SIMD
>doesn't know how to parallelism his algorithms
>doesn't know how branching works on a GPU

try to understand what's going on underneath

Basically this. In gaymes devs basically throw enough triangles at the problem and you get good looking graphics. Though this also leads to shitty game devs that think 100K poly guns are somehow better than carefully crafted 1K poly guns with high res textures and better lighting.

Attached: 29l1jic.jpg (1352x668, 198K)

>batshit insanely hard
>even 4+ core programming is hard
Imagine being this much of a brainlet holy shit.

this but unironically. got a quick rundown on GPU programming this past semester and these were the things that were covered, since not knowing them will almost guarantee you write gpu code that is slower than it's cpu equivalent

parallel programming is actually pretty hard to get 100% right. if think you've never written a race condition you're dangerous.

The high poly is used to bake normal maps and other sorts of trickery onto the low poly model you noob.

Correctly written parallel programming does not have race conditions. You're thinking of asynchronous/concurrent programming.

>Correctly written parallel programming does not have race conditions
His point is that people rarely write correct code on their first attempt, unless it's something extremely trivial.

If even the average programmer could easily write code than ran on 2,000+ CUDA cores than NVIDIA would have already made a CPU and both intel and AMD would be 6 feet under.

Correctly written parallel code does not even present the opportunity for race conditions. You're thinking of asynchronous/concurrent programming.

The average programmer can, the average programmer (and program) simply has no need for such extreme parallelism.

>game programmers program their video game graphics

They don't

They use libraries and drivers that are written by people with phds.

>They use libraries and drivers that are written by people with phds.
lmao

>Correctly written parallel code does not even present the opportunity for race conditions. You're thinking of asynchronous/concurrent programming.
You're being a fucking autistic piece of cunt. Why the fuck do you think the __synchthreads() directive exists in CUDA? Why the fuck do you think __fence() exists in CUDA? Even though your CODE is perfectly parallel, execution of it MIGHT STILL NOT BE.

Now fuck off, you're clearly a horrible CUDA programmer.

Critical sections are not race conditions you mongoloid.

the execution order of perfectly parallel code doesn't matter... because it's perfectly parallel

Attached: brainlet-autism.jpg (480x394, 26K)

You think the silly cone gives a flying fuck? It's going to do it's job exactly specified by the electrons flipping its transistors. No less, no more.

Attached: 73d0518a7de54abf86df3ba82e3b0760.png (1422x1600, 1.77M)

what?

>i was only pretending.png

>execution order doesn't matter

Attached: brainlet1.jpg (1462x2046, 121K)

If you need 100 units processed in parallel, it doesn't matter what order they're processed in, all you need to know is if they're all finished or not

>what is data dependency
>what are memory fences

if your units depend on each other, they aren't parallel
do you go to school and learn terms for these things and think you're smart by throwing them around despite the fact you don't know what you're talking about

>Even though your CODE is perfectly parallel
>__synchthreads()
>SYCNthreads
>SYNCHRONIZE THREADS
>PARALLEL
I'm calling Rob Pike as soon as it's appropriate tomorrow so he can beat the shit out of you.

>if your units depend on each other, they aren't parallel
I didn't say that they depend on each other, I said that the DATA may have dependencies. You shouldn't just ignore stuff like cache locality simply because HURR DURR IT'S PARALLEL SO ORDER DOESN'T MATTER, that's freaking retarded.

You obviously want to work WITH the system, having warps that work on data that is already prefetched in the cache being scheduled for execution next etc. That's why you have memory fences and synchronisation primitives in CUDA to begin with.

>do you go to school and learn terms for these things
You're the one throwing around terms like they mean anything in the practical world.

Unless you write your own perfect drivers and perfect assembly code for a GPU you're gonna have to trust nvidia's drivers and CUDA libraries to be 100% efficient and perfect (they're not). Even then there could be a few lines of your code that fuck everything up or severely limit efficiency while working with 2K+ cores.

>Rob Pike
That pink sweater NPC homo doesn't know anything about computing to begin with.

>hurr durr dynamic linking bad
>herp derp import directly from github without versioning good
>hurrrrrrrr return values is good error handling
>durrrrrrrr pointers good, pointer arithmetics bad

>HURR DURR IT'S PARALLEL SO ORDER DOESN'T MATTER
thats literally what parallel means

Why don't you go wank over your perfectly parallel theoretical Turing machines in your CS101 textbook then, while the rest of us deal with real-life scenarios.

>l theoretical Turing machines
that's a funny word for GPU but ok

You 0/2. Basically proving to people that you're bad at programming and have little comprehension of computers in general.

GPUs don't have infinite cache sizes, so by your own definition they're not perfectly parallel.

>moving the goal post

How the fuck is that moving the goal post? I've been saying this all along, see

>okay if it's perfectly parallel then fine but it's not possible for someone to craft such a thing in the real world
It may or may not be moving the goal post but it's certainly some kind of bullshit that changes the target.

Vulkan compute here, laughing irl

I remember picking a Vulkan book but being put off because the first triangle was 300+ pages in.
Someday I will man up and learn it.

First of all, I don't agree with the autistic definition of parallel. I would definitively say that GPUs are parallel. But by your own definition, they're not "perfectly parallel". Whatever, fine.

Original user said that it's generally very hard to write parallel code correctly (obviously using the "less than perfect" definition of parallel), then you (or some other user) came in with autism canons blazing and started ranting about how perfect parallel code can never be suboptimal or whatever because execution order doesn't matter in a perfect world.

Now, in the context of GPU programming, talking about ideal theoretical conditions is in my book moving the fucking goal post. In the context of OP, you need to actually consider hardware limitations and memory architecture in order to achieve optimal performance. Brushing it off saying "but hurr durr perfect parallelism" isn't contributing at all to any conversation at all.

Its good shit user
Gpus don't work statefully like opengl thinks they do anymore

They aren't called shaders, they're GPU programs now.

Attached: download-1.jpg (304x166, 11K)

I'm honestly excited for AMD's gcc 9 extensions that will allow for GPU programming using pragmas.
Probably will be a dead end like fixed function pipelines, but hope rides alone.

You literally dont know what youre on about do you?

most parallel code is transforming array A to array B and doesn't involve much random memory access so trying to control the exact order of execution for maximum cache efficiency isn't an issue
I wasn't even aware that you could

See you fucking faggot.
Execution order of thread warps obviously matters (for optimal performance), even if your code is CORRECT.

Most people make the mistake of doing multiple synchronous kernel launches in a tight loop too, which isn't good.

But anyway, transforming array A to array B depends entirely on how large these arrays are and whether or not they fit in shared or global memory, if you have other allocations currently being used or are unused etc.

Alignment on reads and writes also matters a lot, misaligned accesses can slow down your code by a factor of 20 or more in worst case scenarios.

Both of these examples are "correct" parallel, but the second one is considerably slower than the first one.

__device__ static
void moveBytesFast(const void* src, size_t srcOffset, void* dst, size_t dstOffset, size_t size)
{
const uint16_t numThreads = blockDim.x;
const uint16_t threadNum = threadIdx.x;

const ulong4* source = (ulong4*) (((const unsigned char*) src) + srcOffset);
ulong4* destination = (ulong4*) (((unsigned char*) dst) + dstOffset);

for (size_t i = 0, n = size / sizeof(ulong4); i < n; i += numThreads)
{
destination[i + threadNum] = source[i + threadNum];
}
}


__device__ static
void moveBytesSlow(const void* src, size_t srcOffset, void* dst, size_t dstOffset, size_t size)
{
const uint16_t numThreads = blockDim.x;
const uint16_t threadNum = threadIdx.x;

const ulong4* source = (ulong4*) (((const unsigned char*) src) + srcOffset);
ulong4* destination = (ulong4*) (((unsigned char*) dst) + dstOffset);

for (size_t i = (size / sizeof(ulong4)) * threadNum, n = size / sizeof(ulong4); i < n; ++i)
{
destination[i] = source[i];
}
}

Slight error, the condition of the for loop should be i < n*(threadNum + 1)

I don't think his point is so much about scheduling as it is considering data locality. See the example in for example. The reason the first one is so much faster is because it stripes/coalesces memory accesses in a way that allows data to be fetched efficiently simultaneously for all threads, whereas the second example will lead to thrashing because of the memory access pattern.

See the examples and my explanation above. user clearly has a point. The order of things may be irrelevant in regards to correctness, but in the case of performance it definitively matters.

data locality isn't specific to parallel programming though, it matters for all code

Of course, that's very true. My point is merely that correct code is not synonymous with optimal code. People tend to think of getting their code "right" in terms of being bug free, but as the motivation of using a compute accelerator or GPU is to increase performance too, I'd argue that getting it "right" also means writing performant code.