• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Page 62 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Wait. For things like thermometers?? Why not throw in a hundred Neoverse-V3's and call it a day?
For a thermometer or other type of simple sensor equipment anything beyond a microcontroller is surplus to need.

Even an A5x/A5xx/Cx Nano core would be overkill.

The point is to draw as little power as possible leaving it for the sensor itself to do its job.
 
For a thermometer or other type of simple sensor equipment anything beyond a microcontroller is surplus to need.

Even an A5x/A5xx/Cx Nano core would be overkill.

The point is to draw as little power as possible leaving it for the sensor itself to do its job.

I know. That was my exact point. Hence the sarcastic mention of a hundred Neoverse V3's.
 
It's the new rage. Collecting a bushel of dollar store digital pregnancy tests, flashing them with a new firmware to enable their low power wireless modules, configuring them as a Beowulf cluster to software render crisis on the head node.

/S
 
It's the new rage. Collecting a bushel of dollar store digital pregnancy tests, flashing them with a new firmware to enable their low power wireless modules, configuring them as a Beowulf cluster to software render crisis on the head node.

/S
That sure beats playing Skyrim on a refrigerator.
 
g1-png.134901

g2.png
g3.png
g4.png
g5.png
12-Channel DDR5-8800
 

Attachments

  • g1.png
    g1.png
    385.4 KB · Views: 81
GFHK noted in their Q&A following their February report that ARM-based CPUs have relatively weak momentum in AI servers, attributing this to lower GPU scheduling efficiency compared to x86.

They indicated that companies including NVIDIA plan to develop x86 CPU solutions in response. (It remains unclear whether this means licensing x86 architecture or co-developing x86 with Intel.)
- Jukan (X.com)

is this true? Why would the GPU scheduling efficiency be worse necessarily?
 
is this true? Why would the GPU scheduling efficiency be worse necessarily?
It wouldn't be necessarily worse. It's likely customers looking at Grace and comparing it to... anything else.
If it's true for other designs too, it's most likely an artifact of x86 using GPGPUs for decades. Whatever warts arise have been treated already.
 
It wouldn't be necessarily worse. It's likely customers looking at Grace and comparing it to... anything else.
If it's true for other designs too, it's most likely an artifact of x86 using GPGPUs for decades. Whatever warts arise have been treated already.

If Nvidia has over the years evolved their GPUs to make use of x86's advantages and worked around its warts, that might make it a little more difficult for non-x86 to slot into role. Maybe the way x86 does interrupts, maybe its SIMD instructions are ideal for managing and mangling the flow of data.

I would be interested in learning more about this, out of idle curiosity. If this is a real issue it sure seems like something that Nvidia should be able to fix, given that they have control over their GPU architecture and if ARM's Neoverse design isn't quite up to snuff designing their own ARM cores (with custom instructions, different interrupt methodology etc. if necessary) seems like something a company with a $4.5 trillion market cap could afford.
 
If Nvidia has over the years evolved their GPUs to make use of x86's advantages and worked around its warts, that might make it a little more difficult for non-x86 to slot into role. Maybe the way x86 does interrupts, maybe its SIMD instructions are ideal for managing and mangling the flow of data.

I would be interested in learning more about this, out of idle curiosity. If this is a real issue it sure seems like something that Nvidia should be able to fix, given that they have control over their GPU architecture and if ARM's Neoverse design isn't quite up to snuff designing their own ARM cores (with custom instructions, different interrupt methodology etc. if necessary) seems like something a company with a $4.5 trillion market cap could afford.
My intuition leans towards ARM's weak memory model, kinda interrupts, which is both a blessing and a curse. The reason apple can have its super large L2 is the cores aren't always fighting over the memory. If they need to sync memory, they're going to have to manually evoke a barrier or similar memory flush. This means they can run their round robin fast L2 as long as no conflicts arise. This would never work with x86.

On the other side x86's strong memory order allows a much finer grain of memory synchronization. As long as the processes are on the same page (memory joke) they can leave it to the memory system to work out the order and make sure all operations are invoked in the correct order as data is moved between memory and caches.

Software and OS being mostly built on x86 is refined to rely on the total store order (TSO) of x86. When the GPU serializes data from and to the CPU it need not worry as much about synchronization. This happens mostly at the page level.

ARM on the other hand is not going to work on this level right out of the box. So software and OS has to be rewritten/tweaked to make sure memory corruption never happens. There may be a way to level the playing field, but it isn't free nor is it mature at this moment in time.
 
My intuition leans towards ARM's weak memory model, kinda interrupts, which is both a blessing and a curse. The reason apple can have its super large L2 is the cores aren't always fighting over the memory. If they need to sync memory, they're going to have to manually evoke a barrier or similar memory flush. This means they can run their round robin fast L2 as long as no conflicts arise. This would never work with x86.

On the other side x86's strong memory order allows a much finer grain of memory synchronization. As long as the processes are on the same page (memory joke) they can leave it to the memory system to work out the order and make sure all operations are invoked in the correct order as data is moved between memory and caches.

Software and OS being mostly built on x86 is refined to rely on the total store order (TSO) of x86. When the GPU serializes data from and to the CPU it need not worry as much about synchronization. This happens mostly at the page level.

ARM on the other hand is not going to work on this level right out of the box. So software and OS has to be rewritten/tweaked to make sure memory corruption never happens. There may be a way to level the playing field, but it isn't free nor is it mature at this moment in time.
you could have chosen any other ARM platform to make your point cause Apple silicon has TSO.
Luckily, Apple Silicon from M1 have an custom implementation of TSO on aarch64 architecture, and can be toggled on and off via a MSR register. This provides a unique opportunity to compare the performance of TSO and relaxed memory models on the same hardware platform.
 
you could have chosen any other ARM platform to make your point cause Apple silicon has TSO.

Just because something enforces TSO doesn't mean it does so at the same rate. I chose to highlight apple L2 because it shows how weak memory order can be an advantage. With TSO enabled, it often only loses a few percentage points from its base rate, though this can go much higher.

This leads to apple crushing single thread transactional operations, yet falling off a bit when threading and complexity rise.
 
Just because something enforces TSO doesn't mean it does so at the same rate. I chose to highlight apple L2 because it shows how weak memory order can be an advantage. With TSO enabled, it often only loses a few percentage points from its base rate, though this can go much higher.

This is a good paper testing both TSO and WO on the M1 Ultra on Linux that goes in depth.
This leads to apple crushing single thread transactional operations, yet falling off a bit when threading and complexity rise.
this is largely solved in M4 Pro/Max and I think M3 Max. Apple implemented IBM Telum style caching and they can now access L2 caches from multiple clusters obviously with higher latency.
 
Back
Top