Apple is now leading intel in ipc by a huge margin

>apple is now leading intel in ipc by a huge margin
Why is noone talking about this? Whether it matters or not in practice, it sure raises the question why both Intel and AMD have such horrible cores.

Attached: ss.png (816x275, 40K)

Other urls found in this thread:

spec.org/cpu2017/Docs/overview.html#Q22
spec.org/cpu2017/Docs/benchmarks/625.x264_s.html
spec.org/cpu2006/Docs/464.h264ref.html
spec.org/cpu2006/results/res2018q1/cpu2006-20171224-51360.html
freepatentsonline.com/y2018/0239708.html
freepatentsonline.com/y2018/0239702.html
anandtech.com/show/12312
ieeexplore.ieee.org/document/6522302?reload=true
en.wikipedia.org/wiki/Out-of-order_execution
github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedExynosM3.td
github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedA53.td
github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedA57.td
github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedKryo.td
twitter.com/AnonBabble

>Whether it matters or not in practice
It doesn't happen in practice, simple.

So how do you explain the SPEC results?

spec.org/cpu2017/Docs/overview.html#Q22

Q22. What will happen to SPEC CPU2006?
Three months after the announcement of CPU2017, SPEC will require all CPU2006 results submitted for publication on SPEC's web site to be accompanied by CPU2017 results. Six months after announcement, SPEC will stop accepting CPU2006 results for publication on its web site.

After that point, you may continue to use SPEC CPU2006. You may publish new CPU2006 results only if you plainly disclose the retirement (the link includes sample disclosure language).

Q23. Can I convert CPU2006 results to CPU2017?
There is no formula for converting CPU2006 results to CPU2017 results and vice versa; they are different products. There probably will be some correlation between CPU2006 and CPU2017 results (that is, machines with higher CPU2006 results often will have higher CPU2017 results), but the correlation will be far from perfect, because of differences in code, data sets, hardware stressed, metric calculations, and run rules.

SPEC encourages SPEC licensees to publish CPU2017 numbers on older platforms to provide a historical perspective on performance.

So what? Both the A12 and the Xeon was tested with the same benchmark. What does it matter that it's outdated?

spec.org/cpu2017/Docs/benchmarks/625.x264_s.html
>625.x264_s uses the Blender Open Movie Project's "Big Buck Bunny", Copyright 2008, Blender Foundation / www.bigbuckbunny.org. Each workload uses a portion of the movie.

To save space on the SPEC CPU media, the movie is first decoded to YUV format in a (non-timed) setup phase, using the decoder 'ldecod' from the H.264/AVC reference software implementation. (The H.264/AVC encoder was used in SPEC CPU2006 benchmark 464.h264ref.)

spec.org/cpu2006/Docs/464.h264ref.html
>For the reference workload, we are using two different files for input. Both are raw (uncompressed) video data in YUV-format.

Foreman (foreman_qcif.yuv): a standard sequence used in video compression, consisting of 120 frames with resolution 176x144 pixels.
SSS (sss.yuv): a sequence from a video game, consisting of 171 frames with resolution 512x320 pixels

"IPC" changes for workload, today compute use very heavy compute loads over SPEC2006.

>ARMshit
who fucking cares
call me when apple is beating desktop CPUs.

Well yes, that was part of the reason that SPEC2006 was chosen for the A12, since apparently iOS is very uncooperative when it comes to larger workloads.

But none of that changes the fact that, on these particular workloads, the A12 performed better.

But that's pretty much what they're doing. If the Xeon results are anything to go by, they're already beating such desktop processors as the i5 8400.

>a12 performed better in tests that utilize its specific instructions
Hmmmm let's check out those avx512 tests instead.

And GPU performe better in several compute loads and DSP.

>a12 performed better in tests that utilize its specific instructions
It's not exactly like the H264 test is the only one that it's winning at.
>Hmmmm let's check out those avx512 tests instead.
You mean like the libquantum test?

Yeah, CPUs don't matter for anything anymore, amirite.

spec.org/cpu2006/results/res2018q1/cpu2006-20171224-51360.html
>8350
Anandtech off SIMD compiler for Xeon.

>But that's pretty much what they're doing
They don't even beat mobile socs outside of these benchmarks.
There's eitheer something wrong with the benchmarks, or it's just complete apple tier fraud.

Interesting, user. This might actually be the generation where ARM servers actually become relevant, if 7nm ARM chips are actually this competitive.

>Whether it matters or not in practice
It does for servers and embedded. That is the only market in which ARM and x86 directly competes, however. On desktop systems with Windows there's simply such a huge pile of software that won't ever get ported that it doesn't matter.

We might see more and more ARM chips in laptops if this keeps up, though. I would gladly jump off the x86 meme seeing as my laptop runs Linux and doesn't need software that wouldn't run on ARM.

>it sure raises the question why both Intel and AMD have such horrible cores.
ARM has always been more energy efficient than x86 ever was. It's not really that surprising that a high-performance ARM chip manages to completely outperform an x86 one, and the reason they haven't before now was simply that x86 had many hundreds of billions more in R&D invested in them than ARM chips.

You are retarded. The Xeon 8176 has comparable IPC to any modern desktop processor.

>Hmmmm let's check out those avx512 tests instead.
AVX512 isn't necessarily that relevant, and the vast majority of the market where ARM and AMD EPYC is a threat doesn't need it. Those who DO need it are going to buy a Skylake-X chip, but they are a tiny minority.

Even if that is so, are you really saying that there's nothing special about desktop processors only having wide SIMD as an advantage over mobile processors?

>They don't even beat mobile socs outside of these benchmarks.
For actual CPU tests, that's not true.

Questions I have are:
Is IPC better at lower clocks due to being less bottlenecked by other system resources?
(AKA, if you reduced Xeon clocks to 2.5Ghz, would IPC be better?)
How much is do to the manufacturing process and not the architecture?
Can the clockspeed of A12 ever be increased close to 4Ghz level?
Is performance per watt better as well?

>It's not really that surprising that a high-performance ARM chip manages to completely outperform an x86 one
I do think it is pretty significant, though. Seeing as how Intel has hardly managed to squeeze out any IPC improvements in a decade now, and AMD seems to be closing in on the same plateau, I had assumed that meant that most programs just don't have more ILP to exploit, but here Apple comes and hits almost 200% more work done per cycle. I mean, it's not like ARM is inherently more instruction parallel than x86 in any way.

>Is IPC better at lower clocks due to being less bottlenecked by other system resources?
To some extent, this is surely so, but not to the extent that the figures show. Especially seeing as how desktop processors have much beefier memory interfaces as well.
>How much is do to the manufacturing process and not the architecture?
Manufacturing process would only account for lower power consumption or higher clock frequencies, not IPC.
>Can the clockspeed of A12 ever be increased close to 4Ghz level?
Only Apple would know the answer to that.
>Is performance per watt better as well?
Haven't seen power metrics for SPEC2006 on x86, but pic related contains the energy usage for the A12 test.

Attached: SPECint_575px.png (671x1990, 89K)

What about SSE and other SIMD that has been around for well over a decade? Those give huge speedups too.

That is a pretty significant advantage to be honest.

>Seeing as how Intel has hardly managed to squeeze out any IPC improvements in a decade now
To be honest, Intel has iterated on the same µarch since the C2D days. It is very possible that they've simply reached the limits of it, and need a proper redesign. Consider Bulldozer, Piledriver etc. which had cores based on old K10 ones, and those needed a complete revamp (enter: Zen).

>and AMD seems to be closing in on the same plateau
No evidence of that so far. AMD has a lot of low-hanging fruit to optimise the Zen architecture and the architecture itself is quite energy efficient. The most notable ones are stuff like memory latency, the Infinity Fabric's clocks and the weak FPU (which is a fair bit behind Intel's).

Pertaining to memory latency, AMD has recently filed two patents:

freepatentsonline.com/y2018/0239708.html
freepatentsonline.com/y2018/0239702.html

The first seems to be about NUMA and L3/L4 ("last-level") cache, while the second actually discusses cache coherence between several CCX ("collections of processors"). In other words, it seems that AMD is working on reducing memory latency in NUMA systems and latency in CCX systems. Zen+ has to stall the core for like 80-100ns, meanwhile Skylake-X and Coffee Lake only have to wait for memory for about 50-60ns. In other words, a *huge* part of Zen's single core performance being lower than Intel's architectures is because of memory latency, and if AMD could fix that then they'd be in good shape.

The Infinity Fabric can be clocked higher, perhaps at 1:1 to RAM MT/s. It would also naturally clock higher when AMD jumps ship to DDR5. Both scenarios would provide fairly significant speedups knowing what we know about memory speed and Zen.

Can't wait for MBPs to drop Intel in 2020.

Attached: 1535613322748.png (1920x1080, 2.55M)

>AMD has a lot of low-hanging fruit to optimise the Zen architecture
Be that as it may, I'm not expecting any 50% increase in IPC to Zen 2 (and AMD has said that they don't expect that either), and even that wouldn't be enough to bring it to parity with the A12's IPC if the figures in OP are anything to go by.
>the Infinity Fabric's clocks
I don't have the source, but I believe AMD has stated that the IF clock being bound to the IMC is very integral to the whole design, and would require a major rework to change.
>weak FPU (which is a fair bit behind Intel's).
It's only behind in SIMD. In scalar FP, it is in fact completely demolishing Intel, since it has 4 FP pipes where Intel only has 2.

slide it

>Be that as it may, I'm not expecting any 50% increase in IPC to Zen 2 (and AMD has said that they don't expect that either), and even that wouldn't be enough to bring it to parity with the A12's IPC if the figures in OP are anything to go by.
Probably not, that would be unreasonable to expect. However, it is quite possible that Zen could get some reasonable gains to close the gap in 3-4 years time for example.

As I said earlier, ARM is just outright more energy efficient than x86 and has always been. AMD actually has their fingers in the ARM pie, having developed the K12 architecture (never released). We do know that AMD is still spending some R&D money on ARM: anandtech.com/show/12312

I think AMD could begin to release ARM processors if they find there's an opportunity. Certainly more easily than Intel, given that they already have some engineering experience with it whereas Intel has very little.

>I don't have the source, but I believe AMD has stated that the IF clock being bound to the IMC is very integral to the whole design
It is. "Double Data Rate" is the meaning of DDR, so 2400MT/s RAM has frequency of 1200MHz (despite the fact that it's advertised as "2400MHz"). The IMC is clocked at 1200MHz in the case of 2400MT/s RAM, and so the entire IF probably shares the same clock pulse to make communication with the IMC not awkward.

Not knowing the exact specifics... I would assume that it is possible for the IF to clock at exactly double the actual memory frequency, either together with the IMC itself or with a buffer between the IF and the IMC, purely to enable better inter-CCX communication. If the IMC clocks at double the actual memory frequency, the memory frequency itself would still have to be 1200MHz (in the case of 2400MT/s), and you'd have to place a buffer between the IMC and memory instead.

Either way, AMD needs to figure out a way to accelerate inter-CCX communication. Increasing IF clocks is one way to do that.

>It's only behind in SIMD. In scalar FP, it is in fact completely demolishing Intel, since it has 4 FP pipes where Intel only has 2.
Also, thank you. I actually had no idea.

MACTODDLERS BTFO

>As I said earlier, ARM is just outright more energy efficient than x86 and has always been

>We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant.

ieeexplore.ieee.org/document/6522302?reload=true

>As I said earlier, ARM is just outright more energy efficient than x86 and has always been.
It is more energy efficient, yes, but there's nothing about that would make it inherently better on an absolute single-threaded performance scale. Except possibly a few cycles shorter branch mispredict penalty thanks to simpler decoders (though it seems that most high-performance ARM implementations have similar mispredict penalties to x86 implementations anyway, for whatever reason), but that wouldn't be anywhere close to explaining this gap.
>having developed the K12 architecture (never released)
I know, I was really looking forward to seeing it, and was immensely disappointed when it was put on ice.

>We find that ARM and x86 processors are simply engineering design points optimized for different levels of performance, and there is nothing fundamentally more energy efficient in one ISA class or the other. The ISA being RISC or CISC seems irrelevant.
Not having read the paper, I can't exactly respond to that, but it doesn't seem plausible given that both Intel and AMD has stated that the main advantage of their µop caches are that they can turn off the decoders while executing code out of it, for efficiency gains. That would imply that the decoders require quite a lot of energy, which would indeed be a characteristic of x86.

Actually, reading the abstract and seeing that their test was of real-world energy usage tested with SNB processors, their results probably include the effects of the µop cache, in which case they're probably "correct", strictly speaking, but it seems hard to argue that the extra complexity doesn't come at a cost.

>no x265 with AVX-512 implementation
You might as well spam gookbench scores

Attached: 223_donate.jpg (574x460, 46K)

I'm trying to point out the greater microarchitecture, not specific features bolted on to it.

>I know, I was really looking forward to seeing it, and was immensely disappointed when it was put on ice.
I was too, that was really disappointing. AMD was nearing bankruptcy though, and it probably didn't make the R&D cut, and they found that Zen was good enough.

I haven't read the paper and was thinking I'd go to bed. It might also be behind a paywall.

However, having read the abstract at least... One has to question oneself why the Intel Atom processors never achieved the perf/watt of ARM processors despite Intel's massive investment into it.

The thing is, there's *technically* nothing wrong with making x86 as energy efficient as ARM, but it is MUCH harder to do exactly because the instruction set is a CISC one.

The fact that x86 is CISC has caused x86 to have to develop two instruction decoders (the x86 one and the µop one) simply to enable technologies such as pipelining, speculative execution etc. while on ARM an other RISC chips that is not necessary because you can pipeline those fairly easily.

>The fact that x86 is CISC
This gives CISC an unnecessarily bad name. There are tons of CISC ISAs that are quite alright. x86 stands out among any sort of ISA as being particularly horrible to decode.

Me again

user in is correct. x86 introduces *a lot* of complexity just so it can effectively use a RISC-like internal µop execution, which enables x86 processors to pipeline, speculatively execute etc. Between x86 and ARM chips, the x86 one will always be more complex purely because it needs more silicon to do the same thing, and so making x86 as energy efficient as ARM requires the manufacturer to find very clever ways to shut down silicon in an x86 processor. The ARM processor on the other hand requires minimal work in that way, because the ISA itself is just less complex and thus more efficient, and lends itself easily to pipelining.

>The fact that x86 is CISC has caused x86 to have to develop two instruction decoders
There's only one instruction decoder in x86 chips. The very point with µops is that they don't need any explicit decoding.

I completely agree with you, yeah.

I didn't mean decoder in the case of µops. It requires a regular instruction interpreter, like you'd find in any chip, that turns an µop into electrical signals.

>a RISC-like internal µop execution, which enables x86 processors to pipeline, speculatively execute etc
That's not exactly accurate. The vast majority of the commonly used x86-instructions actually do decode directly, 1-to-1, into internal µops, so they can in fact be pipelined and speculatively executed as is. The problem isn't so much with pipelining them as it is with simply interpreting them to begin with.

The whole point of consumer x86 is avx-512 now which speeds up FP computations tremendously and aplel can't touch that..

Also apple's high performance cores are an abomination due to severe overheating and thermal throttling issues. Performance can't be sustained for longer than a couple of minutes.

Hell can apple even match AVX 128/256 from zen?

Attached: 1538686613348.png (500x650, 195K)

>The whole point of consumer x86 is avx-512
You're saying this ironically, surely? Among real-world non-HPC workloads, even packed SSE is fairly rare. For the most part, it's not like the SPEC subtests represent weird, exotic workloads that you don't see in practice.
>Also apple's high performance cores are an abomination due to severe overheating and thermal throttling issues. Performance can't be sustained for longer than a couple of minutes.
Not denying that at all, but that's not a problem with the cores (compared to desktop x86 implementations, they still draw very much less power), but with the system around them.

>The whole point of consumer x86 is avx-512 now
What the. If you truly believe that, please explain how Ryzen is hardly losing to Intel even though it doesn't even have AVX-256, let alone 512.

>Why
x86 is CISC and each instruction may very well represent multiple instructions in another architecture, especially a RISC one. IPC isn't comparable between architectures.

>draw less power
after it throttles to 1 GHz sure

It has AVX-256 by joining 2 128-bit modules together to perform 4 ops per cycle. In real world applications this matters greatly.

Neither does the SPEC benchmark compare actual instructions. Rather, what it indicates is that the A12 gets more work done per clock cycle.

>after it throttles to 1 GHz sure
Look at the power figures in . The test was run in an actively cooled environment so that the CPUs wouldn't throttle at all, and the power measurments are somewhere around 3-5 W. And that's for the whole SoC. A high-performance x86 core *alone* doesn't use that little.

>It has AVX-256 by joining 2 128-bit modules together to perform 4 ops per cycle.
No, it has AVX-256 by splitting AVX-256 instructions into two separate 128-bit ops internally, so except for the decode bandwidth (which is never a bottleneck) it's effectively exactly the same as using two 128-bit instructions.

SAMSHIT BTFO
QUALSHIT BTFO
ANDSHIT BTFO
INTLEL BTFO

>Anandtech use mtune=cortex-A53 for ARM

...

POO IN LOO

Attached: 1524974618019.jpg (643x960, 256K)

So? Cortex-A53 supports NEON.

>SEETHING X86FAG

The only question it raises is how much Apple pays street shitters to push their faked benchmarks.
Applel still gets BTFO by Qualcomm.

Attached: 1516405197456.webm (640x360, 1.97M)

def CortexA53Model : SchedMachineModel {
let MicroOpBufferSize = 0; // Explicitly set to zero since A53 is in-order.
let IssueWidth = 2; // 2 micro-ops are dispatched per cycle.
let LoadLatency = 3; // Optimistic load latency assuming bypass.
// This is overriden by OperandCycles if the
// Itineraries are queried instead.
let MispredictPenalty = 9; // Based on "Cortex-A53 Software Optimisation
// Specification - Instruction Timings"
// v 1.0 Spreadsheet
let CompleteModel = 1;

list UnsupportedFeatures = [HasSVE];
}

>Applel still gets BTFO by Qualcomm.
Only in high-level tests, which would imply that it's Apple's software that sucks. I haven't seen any pure CPU test in which Apple doesn't win.

Please stop being obtuse, what is it that you're trying to draw attention to? Even if SVE is unsupported, it still has NEON, which would put it on par with SSE.

>Only in high-level tests
Yes, Apple always loses in the ONLY metric that ever matters.

Noone is denying that. That's not what this thread is about.

Compile for in order 2 wide non Micro Op buffer in samsung or qualcomm out orden 6,4 wide.

So. What.

>470.lbm is an interesting workload for the Apple CPUs as they showcase multi-factor performance advantages over competing Arm and Samsung cores. Qualcomm’s Snapdragon 820 Kryo CPU oddly enough still outperforms the recent Android SoCs. 470.lbm is characterised by extremely large loops in the hottest piece of code. Microarchitectures can optimise such workloads by having (larger) instruction loop buffers, where on a loop iteration the core would bypass the decode stages and fetch the instructions from the buffer. It seems that Apple’s microarchitecture has some kind of such a mechanism. The other explanation is also the vector execution performance of the Apple cores – lbm’s hot loop makes heavy use of SIMD, and Apple’s 3x execution throughput advantage is also likely a heavy contributor to the performance.

Autor off Out order execution and limite wide execution for samsung/qualcomm
def ExynosM3Model : SchedMachineModel {
let IssueWidth = 6; // Up to 6 uops per cycle.
let MicroOpBufferSize = 228; // ROB size.
let LoopMicroOpBufferSize = 40; // Based on the instruction queue size.
let LoadLatency = 4; // Optimistic load cases.
let MispredictPenalty = 16; // Minimum branch misprediction penalty.
let CompleteModel = 1; // Use the default model otherwise.

list UnsupportedFeatures = [HasSVE];

// FIXME: Remove when all errors have been fixed.
let FullInstRWOverlapCheck = 0;
}

Again, stop being obtuse. What is it that you're trying to say by stating these well-known facts?

>work
My point was that instructions are not equivalent to work. IPC isn't comparable between ISAs.

Of course, but who would oppose such an obvious truth?

en.wikipedia.org/wiki/Out-of-order_execution

Autor anandtech use this compiler flags for android

>Android: Toolchain: NDK r16 LLVM compiler, Flags: -Ofast, -mcpu=cortex-A53
A53 is inorder cpu low power
This is code for A57 Out order cpu.
def CortexA57Model : SchedMachineModel {
let IssueWidth = 3; // 3-way decode and dispatch
let MicroOpBufferSize = 128; // 128 micro-op re-order buffer
let LoadLatency = 4; // Optimistic load latency
let MispredictPenalty = 14; // Fetch + Decode/Rename/Dispatch + Branch

// Enable partial & runtime unrolling. The magic number is chosen based on
// experiments and benchmarking data.
let LoopMicroOpBufferSize = 16;
let CompleteModel = 1;

list UnsupportedFeatures = [HasSVE];
}

>Autor anandtech use this compiler flags for android
Yes, we have been over this for like 10 posts now, and everyone is aware of this. What is your point? Also, why are you writing English like a retarded phoneposter?

Look IssueWidth and MicroOpBuffersize.
github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedExynosM3.td

github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedA53.td

github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedA57.td

github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64SchedKryo.td

What is your point?

bump

>bumps 4 minutes after last post

Attached: wat-13.png (706x412, 278K)

Of course it's not. This thread is about some street shitter shilling his employer.

Attached: 1520343736427.png (1024x1024, 453K)

Please kill yourself. You are the cancer causing the inability of having actual interesting technical discussion on Jow Forums.

Please kill yourself. You are the cancer causing the inability of having actual interesting technical discussion on Jow Forums.

>I think AMD could begin to release ARM processors if they find there's an opportunity
I still fantasize about them releasing an ARM/x86 hybrid, with a design kinda like the threadripper 2990wx but the two extra dies are packed with like 50 little arm cores you could abuse for background processing and shit

Perhaps because without adequate cooling, those chips will never out match intel's. Also closed environment.