Page 1 of 1

AI Workload Survey

Hardware

Which hardware do you currently use?

	Yes	No	Not Sure
NVIDIA GPU DGX/Cloud/PCIe
NVIDIA H100
NVIDIA Grace Hopper (H200)
NVIDIA B100 (GB102 chip)
NVIDIA NVL72 (GB200)
AMD GPU MI300/MI350 etc.
Google TPU
AWS Tranium
AWS Inferentia
Groq
Cerebras
Tenstorrent
SambaNova

Any other hardware you use not mentioned above (please specify):

Which low-level toolchains and technologies do you currently use?

	Yes	No	Not Sure
CUDA C++
CUDA Graphs
CUDA JIT (NVRTC) - compiles source to PTX
Thrust (header-only parallel primitives library)
CUB (header-only library)
CUTLASS (header-only library)
GPUDirect (GPU-to-GPU)
GPUDirect Storage
NVSHMEM
NCCL
AMD HIP
AMD CK (composable kernels)
SCALE (compiles CUDA source to AMD hardware)

Any other low-level toolchains and technologies you use not mentioned above (please specify):

Performance Libraries

Which performance libraries do you currently use?

	Yes	No	Not Sure
cuBLAS (dense linear algebra)
cuBLASLt (hybrid precision support)
cuBLASXt (multi-GPU)
cuBLASMp (multi-node)
cuBLASDx (device-side)
cuDNN
cuTENSOR
cuSPARSE
cuFFT
cuRAND
cuSOLVER
cuDSS (direct sparse solver library)

Any other performance libraries you use not mentioned above (please specify):

Elaborate on your use of these libraries - which are used the most, etc.

Python Libraries

Any other python libraries in use by your team not mentioned above (please specify):

Domain-specific languages (DSLs)

Any other DSLs in use by your team not mentioned above (please specify)

Graph Compilers

Any other graph compilers in use by your team not mentioned above (please specify)

Frameworks

Any other frameworks? (please specify)

Distributed Frameworks

Any other distributed frameworks? (please specify):

Runtimes

Any other runtimes?

Serving

Any other server technologies in use

Utility Computing Vendors for ML

Other computing vendors in use for ML (please specify):

Tools

Any other tools in use (please specify)

Rate the properties of the technologies your team uses on a scale 0-10 where 0 is not important to 10 essential

Ability to perform low-level hardware specific optimizations:

Availability and usability of robust debug tools:

Ability to run on specific hardware (NVidia, AMD, etc):

Ability to run on multiple hardware platforms:

Ability to run compiled binary or binaries unmodified:

Willingness to recompile for another platform with little to no code changes:

Willingness to refactor if the gains are substantial enough:

Describe the threshold that could motivate significant code changes: For example, 10x performance, 1/2 the cost per token, never in a million years, I do it for fun often

Does your team consider underlying hardware architecture?