Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

LightningDust · Nov 20, 2025

DZero said:
Talking again but with something in mind: A78 should be the next small core. Those are extremely efficient and powerful at the same time.

Wait. For things like thermometers?? Why not throw in a hundred Neoverse-V3's and call it a day?

soresu · Nov 20, 2025

LightningDust said:
Wait. For things like thermometers?? Why not throw in a hundred Neoverse-V3's and call it a day?

For a thermometer or other type of simple sensor equipment anything beyond a microcontroller is surplus to need.

Even an A5x/A5xx/Cx Nano core would be overkill.

The point is to draw as little power as possible leaving it for the sensor itself to do its job.

LightningDust · Nov 20, 2025

soresu said:
For a thermometer or other type of simple sensor equipment anything beyond a microcontroller is surplus to need.

Even an A5x/A5xx/Cx Nano core would be overkill.

The point is to draw as little power as possible leaving it for the sensor itself to do its job.

I know. That was my exact point. Hence the sarcastic mention of a hundred Neoverse V3's.

soresu · Nov 20, 2025

LightningDust said:
I know. That was my exact point. Hence the sarcastic mention of a hundred Neoverse V3's.

Oh ye I was really replying to the person you were.

DZero · Nov 20, 2025

LightningDust said:
I know. That was my exact point. Hence the sarcastic mention of a hundred Neoverse V3's.

Well, the in-order cores are no longer useful for the current phones and the apps, which relies more on OoO cores.

regen1 · Dec 5, 2025

AWS introduces Graviton5: the company’s most powerful and efficient CPU

Fifth generation chip provides the best price performance for a broad range of workloads in Amazon EC2.

www.aboutamazon.com

DrMrLordX · Dec 5, 2025

@regen1

That's interesting. Wonder how the pricing will stack up vs Graviton 4's introductory rates.

NTMBK · Dec 5, 2025

LightningDust said:
Wait. For things like thermometers?? Why not throw in a hundred Neoverse-V3's and call it a day?

Personally I don't trust any thermometer without SVE2 support

DZero · Dec 5, 2025

NTMBK said:
Personally I don't trust any thermometer without SVE2 support

Still, you might play Doom on it

LightningZ71 · Dec 5, 2025

But can it run... (Checks notes) Crysis?

Nothingness · Dec 5, 2025

LightningZ71 said:
But can it run... (Checks notes) Crysis?

I thought this had switched to Cyberpunk 2077.

DrMrLordX · Dec 5, 2025

It will always be Crysis.

LightningZ71 · Dec 5, 2025

It's the new rage. Collecting a bushel of dollar store digital pregnancy tests, flashing them with a new firmware to enable their low power wireless modules, configuring them as a Beowulf cluster to software render crisis on the head node.

/S

DrMrLordX · Dec 5, 2025

LightningZ71 said:
It's the new rage. Collecting a bushel of dollar store digital pregnancy tests, flashing them with a new firmware to enable their low power wireless modules, configuring them as a Beowulf cluster to software render crisis on the head node.

/S

That sure beats playing Skyrim on a refrigerator.

regen1 · Dec 8, 2025

regen1 said:
AWS introduces Graviton5: the company’s most powerful and efficient CPU

Fifth generation chip provides the best price performance for a broad range of workloads in Amazon EC2.

www.aboutamazon.com

12-Channel DDR5-8800

poke01 · Jan 1, 2026

Inside Nvidia GB10’s Memory Subsystem, from the CPU Side

GB10 is a collaboration between Nvidia and Mediatek that brings Nvidia’s Blackwell architecture into an integrated GPU.

chipsandcheese.com

The CPU portion of GB10 is bad to say the least. I hope they don’t use this exact config in consumer N1

Nothingness · Jan 8, 2026

GB10 vs 395 on AI workloads : https://www.phoronix.com/review/dell-pro-max-gb10-llama-cpp

Not bad. I hope Michael will do a non AI focused comparison later.

MerryCherry · Feb 13, 2026

GFHK noted in their Q&A following their February report that ARM-based CPUs have relatively weak momentum in AI servers, attributing this to lower GPU scheduling efficiency compared to x86.

They indicated that companies including NVIDIA plan to develop x86 CPU solutions in response. (It remains unclear whether this means licensing x86 architecture or co-developing x86 with Intel.)

- Jukan (X.com)

is this true? Why would the GPU scheduling efficiency be worse necessarily?

gdansk · Feb 13, 2026

MerryCherry said:
is this true? Why would the GPU scheduling efficiency be worse necessarily?

It wouldn't be necessarily worse. It's likely customers looking at Grace and comparing it to... anything else.
If it's true for other designs too, it's most likely an artifact of x86 using GPGPUs for decades. Whatever warts arise have been treated already.

Doug S · Feb 13, 2026

gdansk said:
It wouldn't be necessarily worse. It's likely customers looking at Grace and comparing it to... anything else.
If it's true for other designs too, it's most likely an artifact of x86 using GPGPUs for decades. Whatever warts arise have been treated already.

If Nvidia has over the years evolved their GPUs to make use of x86's advantages and worked around its warts, that might make it a little more difficult for non-x86 to slot into role. Maybe the way x86 does interrupts, maybe its SIMD instructions are ideal for managing and mangling the flow of data.

I would be interested in learning more about this, out of idle curiosity. If this is a real issue it sure seems like something that Nvidia should be able to fix, given that they have control over their GPU architecture and if ARM's Neoverse design isn't quite up to snuff designing their own ARM cores (with custom instructions, different interrupt methodology etc. if necessary) seems like something a company with a $4.5 trillion market cap could afford.

Schmide · Feb 14, 2026

Doug S said:
If Nvidia has over the years evolved their GPUs to make use of x86's advantages and worked around its warts, that might make it a little more difficult for non-x86 to slot into role. Maybe the way x86 does interrupts, maybe its SIMD instructions are ideal for managing and mangling the flow of data.

I would be interested in learning more about this, out of idle curiosity. If this is a real issue it sure seems like something that Nvidia should be able to fix, given that they have control over their GPU architecture and if ARM's Neoverse design isn't quite up to snuff designing their own ARM cores (with custom instructions, different interrupt methodology etc. if necessary) seems like something a company with a $4.5 trillion market cap could afford.

My intuition leans towards ARM's weak memory model, kinda interrupts, which is both a blessing and a curse. The reason apple can have its super large L2 is the cores aren't always fighting over the memory. If they need to sync memory, they're going to have to manually evoke a barrier or similar memory flush. This means they can run their round robin fast L2 as long as no conflicts arise. This would never work with x86.

On the other side x86's strong memory order allows a much finer grain of memory synchronization. As long as the processes are on the same page (memory joke) they can leave it to the memory system to work out the order and make sure all operations are invoked in the correct order as data is moved between memory and caches.

Software and OS being mostly built on x86 is refined to rely on the total store order (TSO) of x86. When the GPU serializes data from and to the CPU it need not worry as much about synchronization. This happens mostly at the page level.

ARM on the other hand is not going to work on this level right out of the box. So software and OS has to be rewritten/tweaked to make sure memory corruption never happens. There may be a way to level the playing field, but it isn't free nor is it mature at this moment in time.

poke01 · Feb 14, 2026

Schmide said:
My intuition leans towards ARM's weak memory model, kinda interrupts, which is both a blessing and a curse. The reason apple can have its super large L2 is the cores aren't always fighting over the memory. If they need to sync memory, they're going to have to manually evoke a barrier or similar memory flush. This means they can run their round robin fast L2 as long as no conflicts arise. This would never work with x86.

On the other side x86's strong memory order allows a much finer grain of memory synchronization. As long as the processes are on the same page (memory joke) they can leave it to the memory system to work out the order and make sure all operations are invoked in the correct order as data is moved between memory and caches.

Software and OS being mostly built on x86 is refined to rely on the total store order (TSO) of x86. When the GPU serializes data from and to the CPU it need not worry as much about synchronization. This happens mostly at the page level.

ARM on the other hand is not going to work on this level right out of the box. So software and OS has to be rewritten/tweaked to make sure memory corruption never happens. There may be a way to level the playing field, but it isn't free nor is it mature at this moment in time.

you could have chosen any other ARM platform to make your point cause Apple silicon has TSO.

Luckily, Apple Silicon from M1 have an custom implementation of TSO on aarch64 architecture, and can be toggled on and off via a MSR register. This provides a unique opportunity to compare the performance of TSO and relaxed memory models on the same hardware platform.

Is TSO a historical burden for modern CPU? Case study on Apple M1 – 属于CYY自己的世界

adroc_thurston · Feb 14, 2026

gdansk said:
It's likely customers looking at Grace and comparing it to... anything else.

yeah server Tegras are garbage.
NV spent like 2 years running around peddling Grace-Grace to like, everyone and it's lol.

Schmide · Feb 14, 2026

poke01 said:
you could have chosen any other ARM platform to make your point cause Apple silicon has TSO.

Is TSO a historical burden for modern CPU? Case study on Apple M1 – 属于CYY自己的世界

Just because something enforces TSO doesn't mean it does so at the same rate. I chose to highlight apple L2 because it shows how weak memory order can be an advantage. With TSO enabled, it often only loses a few percentage points from its base rate, though this can go much higher.

This leads to apple crushing single thread transactional operations, yet falling off a bit when threading and complexity rise.

poke01 · Feb 14, 2026

Schmide said:
Just because something enforces TSO doesn't mean it does so at the same rate. I chose to highlight apple L2 because it shows how weak memory order can be an advantage. With TSO enabled, it often only loses a few percentage points from its base rate, though this can go much higher.

https://www.sciencedirect.com/science/article/pii/S1383762124000390

This is a good paper testing both TSO and WO on the M1 Ultra on Linux that goes in depth.

Schmide said:
This leads to apple crushing single thread transactional operations, yet falling off a bit when threading and complexity rise.

this is largely solved in M4 Pro/Max and I think M3 Max. Apple implemented IBM Telum style caching and they can now access L2 caches from multiple clusters obviously with higher latency.

Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Member

Diamond Member

Member

Diamond Member

Platinum Member

Senior member

Lifer

Lifer

Platinum Member

Platinum Member

Diamond Member

Lifer

Platinum Member

Lifer

Senior member

Attachments

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member