Benchmarks: cuda.core by danielfrg · Pull Request #2005 · NVIDIA/cuda-python

danielfrg · 2026-05-01T18:59:32Z

Description

This is for matching benchmarks we have been doing for cuda.bindings to cuda.core.

I guess its up for discussion if we need these and what we want to compare them against.

Right now its basically trying to measure extra latency of the cuda.core layer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.

The main question I think is regarding the "caching" that we get from cuda.core on Device. Device instances are singletons so after a first call Device(0)doesnt hit the driver. And probably other similar cases.

I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.

copy-pr-bot · 2026-05-01T18:59:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rwgk · 2026-05-01T23:34:22Z

Do you have a side-by-side bindings-vs-core delta table that you could post here?

Quick "Low" findings from Cursor GPT-5.4 Extra High Fast

Low: benchmarks/cuda_core/compare.py and benchmarks/cuda_core/benchmarks/bench_ctx_device.py tell readers to consult BENCHMARK_PLAN.md, but there is no BENCHMARK_PLAN.md under benchmarks/cuda_core or elsewhere in the repo. The starred-row legend is useful, but the referenced deeper rationale document is missing.
Low: benchmarks/cuda_core/benchmarks/bench_ctx_device.py says Device() with no args returns the TLS-cached current device, but cuda_core/cuda/core/_device.pyx actually resolves that case by calling cuCtxGetDevice() when a context is active. The benchmark behavior itself is fine, and benchmarks/cuda_core/compare.py already treats that row as a different code path, but the benchmark comment is misleading about what work is really being measured.

danielfrg · 2026-05-05T00:39:41Z

Here is a table:

On this case the Delta % is nto that relevant because we are just adding Python calls on cuda.core so we might just remove that column IMO. Delta ns might be the most important here.

Delta = core mean - bindings mean (positive = cuda.core slower).
* marks benchmarks where the cuda.core path invokes a different driver
  symbol, makes an additional driver call, or hits a cuda.core-side cache
  — so Delta is not pure Python wrapper overhead on top of the same driver
  call. Unstarred rows compare like-for-like driver calls; their Delta is
  wrapper overhead. See BENCHMARK_PLAN.md (Audit notes) for per-row detail.

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                   bindings (ns)       RSD       core (ns)       RSD      Delta ns     Delta %
-----------------------------------------------------------------------------------------------------------------------
ctx_device.ctx_get_current *                          114      4.3%             155      3.9%           +41        +36%
ctx_device.ctx_get_device                             117      1.9%               -         -             -           -
ctx_device.ctx_set_current                            103      2.6%             134      3.1%           +31        +30%
ctx_device.device_get *                               127      3.1%             129      4.1%            +1         +1%
ctx_device.device_get_attribute *                     192      1.4%              62      2.6%          -130        -68%
ctx_device.device_primary_ctx_retain                  228      3.4%               -         -             -           -
enum.curesult_construction                            168      5.9%               -         -             -           -
enum.curesult_member_access                            20      4.9%               -         -             -           -
enum.device_attribute_construction                    167      7.2%               -         -             -           -
event.event_create_destroy                            310      1.6%             484      2.5%          +174        +56%
event.event_query                                     208      2.6%             156      1.7%           -53        -25%
event.event_record                                    227      3.5%             174      2.4%           -53        -23%
event.event_synchronize                               226      2.3%             176      1.6%           -49        -22%
launch.launch_16_args *                             3,152      1.7%           2,458      1.2%          -694        -22%
launch.launch_16_args_pre_packed                    1,980      1.1%               -         -             -           -
launch.launch_2048b                                 2,433      1.5%               -         -             -           -
launch.launch_256_args *                           16,519      2.0%           7,671      1.3%        -8,849        -54%
launch.launch_512_args *                           31,422      1.3%          13,176      2.8%       -18,246        -58%
launch.launch_512_args_pre_packed                   3,798      1.3%               -         -             -           -
launch.launch_512_bools                            58,242      2.9%               -         -             -           -
launch.launch_512_bytes                            60,803      4.8%               -         -             -           -
launch.launch_512_doubles                          87,285      3.6%               -         -             -           -
launch.launch_512_ints                             61,578      4.4%               -         -             -           -
launch.launch_512_longlongs                        65,914      4.2%               -         -             -           -
launch.launch_empty_kernel *                        1,878      1.5%           1,849      1.5%           -28         -2%
launch.launch_small_kernel *                        2,230      1.3%           1,972      0.8%          -257        -12%
memory.mem_alloc_async_free_async *                   765      1.5%           1,175      2.3%          +409        +54%
memory.mem_alloc_free *                             2,549      1.7%             968      1.5%        -1,581        -62%
memory.memcpy_dtod                                  2,291      1.3%               -         -             -           -
memory.memcpy_dtoh                                  5,490      0.7%               -         -             -           -
memory.memcpy_htod                                  4,093      0.9%               -         -             -           -
module.func_get_attribute                             221      3.2%               -         -             -           -
module.module_get_function                            180      2.3%               -         -             -           -
module.module_load_unload                           8,744      1.0%               -         -             -           -
nvrtc.nvrtc_compile_program                     7,330,579      1.6%               -         -             -           -
nvrtc.nvrtc_create_program                            678      2.4%               -         -             -           -
nvrtc.nvrtc_create_program_100_headers             14,129      6.7%               -         -             -           -
pointer_attributes.pointer_get_attribute              511      3.1%               -         -             -           -
pointer_attributes.pointer_get_attributes           2,097      2.3%               -         -             -           -
stream.stream_create_destroy *                      3,531      2.0%           3,780      1.2%          +248         +7%
stream.stream_query                                   220      1.9%               -         -             -           -
stream.stream_synchronize                             241      2.3%             194      1.8%           -47        -20%
-----------------------------------------------------------------------------------------------------------------------

danielfrg · 2026-05-05T15:51:48Z

Updated the table with my last numbers, sorry about that.

The question here is if are ok with cuda.core being faster because of object construction and i think because of like caching some of the device creation.

In general I would say yes, but I understand that one could argue the comparison is not fair.

danielfrg added 4 commits May 1, 2026 12:50

cuda.core benchmarks

ed099ad

cuda.core benchmarks

a711361

cuda.core benchmarks

2144446

cuda.core benchmarks

c25b82f

danielfrg self-assigned this May 1, 2026

danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels May 1, 2026

danielfrg added this to the cuda.core v1.0.0 milestone May 1, 2026

danielfrg requested review from leofang, mdboom and rwgk May 1, 2026 19:00

leofang added P1 Medium priority - Should do cuda.core Everything related to the cuda.core module and removed cuda.bindings Everything related to the cuda.bindings module labels May 5, 2026

Remove benchmark plan mentions

8fdf7a4

leofang mentioned this pull request May 6, 2026

Graph kernel nodes don't keep kernel argument objects alive #2039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks: cuda.core#2005

Benchmarks: cuda.core#2005
danielfrg wants to merge 5 commits intomainfrom
benchmarks-cuda-core

danielfrg commented May 1, 2026

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

danielfrg commented May 5, 2026 •

edited

Loading

Uh oh!

danielfrg commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielfrg commented May 1, 2026

Description

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

rwgk commented May 1, 2026

Uh oh!

danielfrg commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielfrg commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielfrg commented May 5, 2026 •

edited

Loading