Skip to content

Benchmarks: cuda.core#2005

Open
danielfrg wants to merge 5 commits intomainfrom
benchmarks-cuda-core
Open

Benchmarks: cuda.core#2005
danielfrg wants to merge 5 commits intomainfrom
benchmarks-cuda-core

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

Description

This is for matching benchmarks we have been doing for cuda.bindings to cuda.core.

I guess its up for discussion if we need these and what we want to compare them against.

Right now its basically trying to measure extra latency of the cuda.core layer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.

The main question I think is regarding the "caching" that we get from cuda.core on Device. Device instances are singletons so after a first call Device(0)doesnt hit the driver. And probably other similar cases.

I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.

@danielfrg danielfrg self-assigned this May 1, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels May 1, 2026
@danielfrg danielfrg added this to the cuda.core v1.0.0 milestone May 1, 2026
@danielfrg danielfrg requested review from leofang, mdboom and rwgk May 1, 2026 19:00
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 1, 2026

Do you have a side-by-side bindings-vs-core delta table that you could post here?


Quick "Low" findings from Cursor GPT-5.4 Extra High Fast

  • Low: benchmarks/cuda_core/compare.py and benchmarks/cuda_core/benchmarks/bench_ctx_device.py tell readers to consult BENCHMARK_PLAN.md, but there is no BENCHMARK_PLAN.md under benchmarks/cuda_core or elsewhere in the repo. The starred-row legend is useful, but the referenced deeper rationale document is missing.

  • Low: benchmarks/cuda_core/benchmarks/bench_ctx_device.py says Device() with no args returns the TLS-cached current device, but cuda_core/cuda/core/_device.pyx actually resolves that case by calling cuCtxGetDevice() when a context is active. The benchmark behavior itself is fine, and benchmarks/cuda_core/compare.py already treats that row as a different code path, but the benchmark comment is misleading about what work is really being measured.

@danielfrg
Copy link
Copy Markdown
Contributor Author

danielfrg commented May 5, 2026

Here is a table:

On this case the Delta % is nto that relevant because we are just adding Python calls on cuda.core so we might just remove that column IMO. Delta ns might be the most important here.

Delta = core mean - bindings mean (positive = cuda.core slower).
* marks benchmarks where the cuda.core path invokes a different driver
  symbol, makes an additional driver call, or hits a cuda.core-side cache
  — so Delta is not pure Python wrapper overhead on top of the same driver
  call. Unstarred rows compare like-for-like driver calls; their Delta is
  wrapper overhead. See BENCHMARK_PLAN.md (Audit notes) for per-row detail.

-----------------------------------------------------------------------------------------------------------------------
Benchmark                                   bindings (ns)       RSD       core (ns)       RSD      Delta ns     Delta %
-----------------------------------------------------------------------------------------------------------------------
ctx_device.ctx_get_current *                          114      4.3%             155      3.9%           +41        +36%
ctx_device.ctx_get_device                             117      1.9%               -         -             -           -
ctx_device.ctx_set_current                            103      2.6%             134      3.1%           +31        +30%
ctx_device.device_get *                               127      3.1%             129      4.1%            +1         +1%
ctx_device.device_get_attribute *                     192      1.4%              62      2.6%          -130        -68%
ctx_device.device_primary_ctx_retain                  228      3.4%               -         -             -           -
enum.curesult_construction                            168      5.9%               -         -             -           -
enum.curesult_member_access                            20      4.9%               -         -             -           -
enum.device_attribute_construction                    167      7.2%               -         -             -           -
event.event_create_destroy                            310      1.6%             484      2.5%          +174        +56%
event.event_query                                     208      2.6%             156      1.7%           -53        -25%
event.event_record                                    227      3.5%             174      2.4%           -53        -23%
event.event_synchronize                               226      2.3%             176      1.6%           -49        -22%
launch.launch_16_args *                             3,152      1.7%           2,458      1.2%          -694        -22%
launch.launch_16_args_pre_packed                    1,980      1.1%               -         -             -           -
launch.launch_2048b                                 2,433      1.5%               -         -             -           -
launch.launch_256_args *                           16,519      2.0%           7,671      1.3%        -8,849        -54%
launch.launch_512_args *                           31,422      1.3%          13,176      2.8%       -18,246        -58%
launch.launch_512_args_pre_packed                   3,798      1.3%               -         -             -           -
launch.launch_512_bools                            58,242      2.9%               -         -             -           -
launch.launch_512_bytes                            60,803      4.8%               -         -             -           -
launch.launch_512_doubles                          87,285      3.6%               -         -             -           -
launch.launch_512_ints                             61,578      4.4%               -         -             -           -
launch.launch_512_longlongs                        65,914      4.2%               -         -             -           -
launch.launch_empty_kernel *                        1,878      1.5%           1,849      1.5%           -28         -2%
launch.launch_small_kernel *                        2,230      1.3%           1,972      0.8%          -257        -12%
memory.mem_alloc_async_free_async *                   765      1.5%           1,175      2.3%          +409        +54%
memory.mem_alloc_free *                             2,549      1.7%             968      1.5%        -1,581        -62%
memory.memcpy_dtod                                  2,291      1.3%               -         -             -           -
memory.memcpy_dtoh                                  5,490      0.7%               -         -             -           -
memory.memcpy_htod                                  4,093      0.9%               -         -             -           -
module.func_get_attribute                             221      3.2%               -         -             -           -
module.module_get_function                            180      2.3%               -         -             -           -
module.module_load_unload                           8,744      1.0%               -         -             -           -
nvrtc.nvrtc_compile_program                     7,330,579      1.6%               -         -             -           -
nvrtc.nvrtc_create_program                            678      2.4%               -         -             -           -
nvrtc.nvrtc_create_program_100_headers             14,129      6.7%               -         -             -           -
pointer_attributes.pointer_get_attribute              511      3.1%               -         -             -           -
pointer_attributes.pointer_get_attributes           2,097      2.3%               -         -             -           -
stream.stream_create_destroy *                      3,531      2.0%           3,780      1.2%          +248         +7%
stream.stream_query                                   220      1.9%               -         -             -           -
stream.stream_synchronize                             241      2.3%             194      1.8%           -47        -20%
-----------------------------------------------------------------------------------------------------------------------

@leofang leofang added P1 Medium priority - Should do cuda.core Everything related to the cuda.core module and removed cuda.bindings Everything related to the cuda.bindings module labels May 5, 2026
@danielfrg
Copy link
Copy Markdown
Contributor Author

Updated the table with my last numbers, sorry about that.

The question here is if are ok with cuda.core being faster because of object construction and i think because of like caching some of the device creation.

In general I would say yes, but I understand that one could argue the comparison is not fair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module P1 Medium priority - Should do performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants