diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/1-what-is-model-explorer.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/1-what-is-model-explorer.md new file mode 100644 index 0000000000..5d6609e8ca --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/1-what-is-model-explorer.md @@ -0,0 +1,154 @@ +--- +title: "What is Model Explorer and what will you learn?" + +weight: 2 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Model Explorer and Arm Adapters + +[Model Explorer](https://ai.google.dev/edge/model-explorer) is an open-source, web-based graph visualizer and debugger from Google AI Edge. It provides a hierarchical view of model graphs, lets you expand and collapse layers, search for nodes, inspect metadata, highlight inputs and outputs, compare models, and add overlays to graph nodes. The default version, [contained in this repository](https://github.com/google-ai-edge/model-explorer/) supports TFLite, TF, TFJS, MLIR, and PyTorch (Exported Program) formats. + +Model Explorer uses adapters and data providers to load formats beyond the built-in model types. The Arm adapters used in this Learning Path transform `.pte`, `.tosa`, `.vgf`, and `.etrecord` files into graph data that Model Explorer can display. In the final section, you also add `.etdp` runtime trace data as an overlay on top of the exported graph. + +## What you will do + +This learning path is about model artifacts, not model training or export. You start from small pre-generated files and use each to gain an understanding of when you would use Model Explorer, and the insights you can gain. + +The artifacts covered in this learning path are: + +| Artifact | Model Explorer support | Workflow layer | Use it to inspect | +| --- | --- | --- | --- | +| `.pte` | `pte-adapter-model-explorer` | ExecuTorch program | Delegate regions, backend partitioning, CPU fallback, and the deployed ExecuTorch graph | +| `.tosa` | `tosa-adapter-model-explorer` | Compiler/backend intermediate representation | Lowered operators, tensor shapes, quantized types, graph splits, and missed optimization opportunities | +| `.vgf` | `vgf-adapter-model-explorer` | Vulkan ML graph artifact | Inputs, outputs, constants, tensor metadata, graph connectivity, and SPIR-V graph modules | +| `.etrecord` | ExecuTorch ETRecord adapter | Export-time profiling context | Graph structure, debug handles, operator names, and delegate metadata used to map runtime events back to graph nodes | +| `.etdp` | ExecuTorch ETDump data provider | Runtime trace overlay | Timing data from a specific execution | + +{{% notice Note %}} +Model Explorer visualizes the specific artifact you generated or received. Small differences in the target the model has been delegated to, could result in a very different model graph. For example, delegating the same model to an Ethos-U55, may produce a very different model graph from delegating to an Ethos-U85. +{{% /notice %}} + +## Understand the model artifact flow + +Not every graph in this learning path needs to start with PyTorch and ExecuTorch. Let's run through what is specific to the ExecuTorch flow, and what is applicable more broadly. + +PTE is ExecuTorch-specific. A `.pte` file is the serialized ExecuTorch program loaded by the ExecuTorch runtime. Baseline portable-kernel, XNNPACK, Cortex-M, Ethos-U, and VGF-backend examples are all ExecuTorch deployment artifacts. + +The portable-kernel and XNNPACK routes are Cortex-A CPU routes. Cortex-M uses a separate ExecuTorch flow that applies CMSIS-NN-oriented passes for supported quantized operators. + +For Ethos-U, the ExecuTorch backend uses the Ethos-U Vela compiler to compile TOSA flatbuffers into an Ethos-U command stream that is packaged into the final `.pte`. + +TOSA is not inherently ExecuTorch-specific. TOSA is an intermediate representation (IR) that can sit between a model frontend and an Arm backend compiler or converter. ExecuTorch can lower supported graph partitions to TOSA, but another framework could also provide its own TOSA exporter and generate `.tosa` files. + +```output + PyTorch model + | + v + ExecuTorch export + | + +-- Cortex-A portable kernels -> baseline .pte + | + +-- Cortex-A XNNPACK delegate -> XNNPACK .pte + | + +-- Cortex-M CMSIS-NN passes -> Cortex-M .pte + | + +-- Lower to TOSA -------------+ + | + | + | + Alternative (non PT/ET) v + frontend with TOSA export ---> TOSA (.tosa) + | + | + +-- Ethos-U Vela + | -> command stream + | -> Ethos-U .pte + | + +-- ML SDK Model Converter + -> VGF payload + -> VGF-backend .pte + -> standalone .vgf + for Vulkan ML workflows +``` + +For VGF, the ExecuTorch Arm VGF backend uses the [Arm ML SDK Model Converter](https://github.com/arm/ai-ml-sdk-model-converter) to produce a VGF backend payload from TOSA. But the Arm ML SDK Model Converter could be used to convert `.tosa` files generated from a different flow, so VGF is also not specific to just ExecuTorch. + +When ExecuTorch is used for VGF, a `.pte` is emitted as well. Use that VGF-backend `.pte` when you want to run through ExecuTorch. Use the standalone `.vgf` when you want to inspect or integrate the Vulkan ML artifact directly, such as in a neural graphics workflow. + +ETRecord and ETDump sit alongside these artifact views rather than replacing them. ETRecord is generated at export time and preserves the graph context needed for profiling attribution. ETDump is generated at runtime and records what actually happened when a `.pte` ran. Together, they let Model Explorer move from static inspection to runtime overlays: you can connect the graph structures you saw in the `.pte`, `.tosa`, and `.vgf` sections to operator and delegate events measured during execution. + +If you have used the [Arm Neural Graphics Model Gym](https://github.com/arm/neural-graphics-model-gym), then under the hood you have been using ExecuTorch to export your neural graphics model to VGF. If you are interested in learning more, try out the [Fine-tune neural graphics using Model Gym](https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/model-training-gym/#:~:text=Upon%20completion%20of%20this%20Learning,and%20train%20neural%20graphics%20models) learning path, which briefly introduces Model Explorer. + +## Terminology + +A helpful glossary of different terms is provided below: + +| Term | Meaning in this learning path | +| --- | --- | +| Model | The neural network you want to deploy, before or after transformation by export and compiler tools. | +| Framework | The software used to define or train the model, such as PyTorch. | +| Export | The step that turns a framework model into a deployable or compiler-friendly representation. | +| Artifact | A file produced by export, lowering, compilation, or conversion, such as `.pte`, `.tosa`, or `.vgf`. | +| Runtime | The software on the target system that loads and executes a deployable artifact. ExecuTorch is the runtime for `.pte` files. | +| Compiler | A tool that transforms an intermediate representation into a lower-level target representation, such as Vela compiling TOSA for Ethos-U. | +| Lower | To transform a model or graph from a higher-level representation into a lower-level representation closer to a target backend. | +| Convert | To change one artifact format into another, such as TOSA to VGF. | +| Intermediate representation | A representation between the original model and the final target artifact. TOSA is the main intermediate representation in this learning path. | +| Delegate | A backend-specific execution path that handles supported parts of a graph. Unsupported parts can remain on a fallback path. | +| Kernel | The code that executes a model operator for a specific runtime or backend, such as a portable ExecuTorch kernel or a CMSIS-NN kernel. | +| Flatbuffer | A compact binary serialization format. TOSA flatbuffers store TOSA graphs; `.pte` files use a FlatBuffer-based ExecuTorch program format. | +| SPIR-V | Standard Portable Intermediate Representation - Vulkan. SPIR-V modules inside VGF files describe the Vulkan ML data graph used by the runtime. | +| ETRecord | An ExecuTorch export-time debug artifact that preserves graph and delegate metadata for profiling attribution. | +| ETDump | An ExecuTorch runtime trace artifact that can record operator, delegate, backend, timing, and cycle-count events from execution. The first Model Explorer overlay used in this Learning Path focuses on timing data. | + +## What models will I use? + +The hands-on sections will use a variety of pre-provided models. + +```output +model-explorer-artifacts/ +├── README.md +├── LICENSE.md +├── pte/ +│ ├── mv2_cortex_m.pte +│ ├── opt125m_cortex_a_portable.pte +│ ├── opt125m_cortex_a_xnnpack.pte +│ ├── mv2_fp32_ethos_u85.pte +│ ├── mv2_int8_ethos_u85.pte +│ ├── mv2_lrn_int8_ethos_u85.pte +│ ├── small_upscaler_ptq_vgf.pte +│ ├── small_upscaler_qat_vgf.pte +│ └── add_sigmoid_vgf.pte +├── tosa/ +│ ├── mv2_fp32.tosa +│ ├── mv2_int8.tosa +│ ├── mv2_lrn_int8_1.tosa +│ ├── mv2_lrn_int8_2.tosa +│ ├── small_upscaler_ptq.tosa +│ └── small_upscaler_qat.tosa +├── vgf/ +│ ├── small_upscaler_ptq.vgf +│ ├── small_upscaler_qat.vgf +│ └── add_sigmoid.vgf +├── etrecord/ +│ ├── opt125m_portable.etrecord +│ ├── opt125m_xnnpack.etrecord +│ ├── mobilenetv2_fp32_ethosu.etrecord +│ ├── mobilenetv2_int8_ethosu.etrecord +│ └── mobilenetv2_lrn_int8_ethosu.etrecord +└── etdump/ + ├── opt125m_portable.etdp + ├── opt125m_xnnpack.etdp + ├── mobilenetv2_fp32_ethosu.etdp + ├── mobilenetv2_int8_ethosu.etdp + └── mobilenetv2_lrn_int8_ethosu.etdp +``` + +## What you have learned + +You have learned how Model Explorer uses adapters and data providers to load artifact formats beyond its built-in model types. You have also seen how `.pte`, `.tosa`, `.vgf`, `.etrecord`, and `.etdp` files fit into Cortex-A, Cortex-M, Ethos-U, Vulkan ML, and ExecuTorch profiling workflows. + +Next, you will install Model Explorer, launch it with the Arm adapters, and open the first `.pte` artifact. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/2-install-and-open-pte.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/2-install-and-open-pte.md new file mode 100644 index 0000000000..1ea87e648f --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/2-install-and-open-pte.md @@ -0,0 +1,167 @@ +--- +title: "Install Model Explorer and the adapters, and view a Cortex-M model graph" + +weight: 3 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Clone the repo of example models + +In this section, you install Model Explorer in a clean Python virtual environment, along with the Arm adapters, and confirm it is working using the PTE adapter. First you will clone the repo of example models to use across the course of this learning path. + +Use a machine capable of displaying a browser e.g, a laptop. + +This repository uses Git LFS for model artifacts. After cloning, run `git lfs pull` to download the actual `.pte`, `.tosa`, and `.vgf` files. + +```bash +git clone https://github.com/arm-education/model-explorer-artifacts.git +cd model-explorer-artifacts +git lfs pull +``` + +## Create a virtual environment + +Use a separate environment to avoid dependency conflicts with any ExecuTorch build, notebook, or application environment you already use: + +If you use WSL on Windows, follow the Linux/macOS commands. + +{{< tabpane code=true >}} + {{< tab header="Linux/macOS" language="bash">}} +python3 -m venv model_explorer_env +source model_explorer_env/bin/activate +python -m pip install --upgrade pip + {{< /tab >}} + {{< tab header="Windows PowerShell" language="powershell">}} +py -m venv model_explorer_env +.\model_explorer_env\Scripts\Activate.ps1 +python -m pip install --upgrade pip + {{< /tab >}} +{{< /tabpane >}} + +{{% notice Tip %}} +On Windows, if PowerShell blocks `Activate.ps1`, allow local activation scripts for your user account: + +```powershell +Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser +``` +{{% /notice %}} + +## Install Model Explorer + +Install Model Explorer and PyTorch in the active virtual environment. + +The Linux command pins PyTorch to the CPU wheel index because this Learning Path does not need CUDA. On macOS and Windows, the standard PyPI install is usually sufficient. + +{{< tabpane code=true >}} + {{< tab header="Linux" language="bash">}} +pip install torch --index-url https://download.pytorch.org/whl/cpu +pip install ai-edge-model-explorer + {{< /tab >}} + {{< tab header="macOS" language="bash">}} +pip install torch ai-edge-model-explorer + {{< /tab >}} + {{< tab header="Windows PowerShell" language="powershell">}} +pip install torch ai-edge-model-explorer + {{< /tab >}} +{{< /tabpane >}} + +## Install the Arm adapters + +{{% notice TODO before release %}} +Update this installation section after the ExecuTorch Model Explorer extension is released. + +The intended install flow is to install the PTE, ETRecord, and ETDump adapters/data provider as one ExecuTorch extension, then launch Model Explorer with that ExecuTorch extension alongside the TOSA and VGF adapters. Until that package is available, keep the ETRecord and ETDump install and launch instructions under review. +{{% /notice %}} + +Install the PTE, TOSA, and VGF adapters used for the static artifact sections: + +```bash +pip install pte-adapter-model-explorer +pip install tosa-adapter-model-explorer +pip install vgf-adapter-model-explorer +``` + +## Launch Model Explorer + +Launch Model Explorer with the Arm adapters. This is the recommended route for the static artifact sections because you will open `.pte`, `.tosa`, and `.vgf` artifacts. Also shown are examples of running the base model explorer without adapters, and running with a single adapter. + +This will launch a webpage in your browser. + +{{< tabpane code=true >}} + {{< tab header="All adapters" language="bash">}} +model-explorer --extensions=pte_adapter_model_explorer,tosa_adapter_model_explorer,vgf_adapter_model_explorer + {{< /tab >}} + {{< tab header="PTE adapter" language="bash">}} +model-explorer --extensions=pte_adapter_model_explorer + {{< /tab >}} + {{< tab header="No adapters" language="bash">}} +model-explorer + {{< /tab >}} +{{< /tabpane >}} + +{{% notice Tip %}} +Use `CTRL + C` to stop Model Explorer. +{{% /notice %}} + +{{% notice Note %}} +If you have a specific interest in one model format, `.pte`, `.tosa`, or `.vgf`, or in a particular target (e.g., Cortex-M, Cortex-A, Ethos-U, Neural Graphics) you can skip to the appropriate section. +{{% /notice %}} + +## Open the Cortex-M PTE + +We will start with a `.pte` generated for the Cortex-M backend. This `.pte` was generated for the MobileNetV2 model, a typical Convolutional Neural Network (CNN) used in embedded ML. + +The ExecuTorch Cortex-M backend prepares models for Arm Cortex-M microcontrollers, where memory and compute resources are much more limited than on application-class CPUs. It rewrites supported quantized operators so they can use CMSIS-NN, an Arm library of optimized neural network kernels for Cortex-M processors. CMSIS-NN exists to make common ML operations such as convolutions, fully connected layers, activations, and quantization-related operations run efficiently on small embedded CPUs. Treat this first `.pte` as a good way to learn the Model Explorer interface while seeing how an ExecuTorch graph can reflect Cortex-M-specific lowering. + +{{% notice Note %}} +The Cortex-M backend is a work-in-progress proof of concept. It is not intended for production use, and APIs may change without notice. However, the `.pte` is pre-generated for you in the provided repo. If you would like to find out more about the Cortex-M backend, use the [Cortex-M Backend Documentation](https://docs.pytorch.org/executorch/1.2/backends/arm-cortex-m/arm-cortex-m-overview.html), which also links to a Jupyter Notebook. +{{% /notice %}} + +In the Model Explorer UI, open: + +```output +model-explorer-artifacts/pte/mv2_cortex_m.pte +``` + +Your view in browser should appear as follows: + +![Screenshot of Model Explorer with Arm Adapters and a loaded Cortex-M PTE.#center](model_explorer.png "Model Explorer with Arm Adapters") + +Click `View selected models` and your view should appear as below: + +![Screenshot of top-level view in Model Explorer of a loaded Cortex-M PTE.#center](cortex_m_top.png "Typical top-level graph view in Model Explorer") + +A right-hand bar tells you the graph info, including the `op node count` and the `layer count`. The `op node count` is the number of operator nodes in the graph. The `layer count` is the number of hierarchical graph components represented in the current view, not necessarily the number of neural network layers in the original model. + +Double click the `forward` layer to see the various operators comprising the layer. Click a specific operator, e.g., `cortex_m::quantize_per_tensor` to see various attributes, as well as inputs and outputs, in the right-hand bar. + +![Screenshot of examining a specific Cortex-M operator in Model Explorer.#center](cortex_m_inspect.png "Inspecting specific operators with Model Explorer") + +When you inspect a `.pte` for the first time, focus on the higher-level graph information first: + +- **Operator names** show the work the ExecuTorch program will perform. +- **Inputs and outputs** show how tensors flow through the graph. +- **Tensor shapes and types** help you check whether the model was exported and quantized as expected. +- **Hierarchical layers** let you expand or collapse parts of the graph so you can move between an overview and individual operators. +- **Delegate or backend-specific names** show where a backend flow has changed the graph. In this Cortex-M example, names such as `cortex_m::...` indicate operators affected by the Cortex-M backend flow. + +You might also see lower-level `.pte` execution fields in the node attributes. These fields come from the serialized ExecuTorch program: + +| Field | What it means | +| --- | --- | +| `instruction type: KernelCall` | This instruction calls an ExecuTorch operator kernel. A kernel is the code that executes a model operator for a runtime or backend. | +| `instr_args_type: 1` | The internal FlatBuffer type tag for the instruction arguments. In this case, `1` identifies the arguments as a `KernelCall`. | +| `op_index` | An index into the `.pte` operator table. It tells ExecuTorch which operator this instruction calls. | +| `args: [471, 473, 474, ...]` | Indexes into the `.pte` values table. These entries identify the tensors or other values used as inputs and outputs by the instruction. | + +You do not need to memorize these internal fields. Use them as clues when you want to connect a visible graph node to the underlying ExecuTorch program structure. + +Click the `eye symbol` in the top left bar to select data to view on nodes and edges. Select the data you want to see - e.g., `op node id` and `op node attributes` to include more data within the graph itself. + +## What you have learned + +You have installed Model Explorer, launched it with the Arm adapters, and opened your first `.pte` artifact. You have also learned how to inspect the graph overview, expand into the `forward` layer, read operator metadata, and interpret common low-level `.pte` fields such as `KernelCall`, `op_index`, and `args`. + +Next, you will keep the same browser tab open and compare portable and XNNPACK `.pte` artifacts to see how backend delegation changes the graph. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/3-compare-portable-xnnpack-pte.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/3-compare-portable-xnnpack-pte.md new file mode 100644 index 0000000000..68c98394ce --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/3-compare-portable-xnnpack-pte.md @@ -0,0 +1,117 @@ +--- +title: "Compare portable and XNNPACK PTE files" + +weight: 4 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Understand the CPU paths + +This section focuses on Cortex-A CPU deployment. Cortex-A processors are application-class CPUs used in systems such as phones, Raspberry Pi-class Linux devices, laptops, and cloud instances. + +The portable `.pte` uses ExecuTorch portable kernels. A portable kernel is a general ExecuTorch implementation of an operator. Portable kernels exist so ExecuTorch programs can run with a small runtime and broad operator coverage, even when no specialized backend is available. They are important for correctness, portability, fallback, and bring-up on new targets. + +Portable kernels are not usually the fastest CPU path. They prioritize broad support and a lightweight deployment model, rather than using every architecture-specific optimization available on a modern Cortex-A CPU. For transformer models such as OPT-125M, much of the runtime cost comes from linear layers and matrix multiplications. Those operations benefit strongly from optimized CPU kernels. + +[XNNPACK](https://github.com/google/XNNPACK) is the optimized CPU backend used by ExecuTorch for many Arm CPU deployments. During [export and lowering](https://docs.pytorch.org/executorch/stable/using-executorch-export.html), the XNNPACK partitioner finds supported parts of the graph and turns them into delegated regions. At runtime, those regions execute with XNNPACK instead of the default portable-kernel path. Operators that XNNPACK does not support, or graph sections that cannot be grouped into an XNNPACK region, remain on the default ExecuTorch path. + +On Arm CPUs, XNNPACK can use Arm KleidiAI micro-kernels. [KleidiAI](https://developer.arm.com/dev2/ai/kleidi-libraries) is Arm's open-source library of optimized low-level AI routines for Arm CPUs. It provides architecture-tuned compute kernels for operations such as matrix multiplication, using Arm features such as Neon, SVE2, and SME2 where supported. You do not call KleidiAI directly in this workflow; ExecuTorch delegates to XNNPACK, and XNNPACK can use KleidiAI-optimized kernels internally when the operator, data type, and hardware are supported. + +If you are interested in understanding how SME2 can accelerate performance of ExecuTorch models, take a look at [Profile ExecuTorch models with SME2 on Arm](https://learn.arm.com/learning-paths/cross-platform/sme-executorch-profiling/). + +To summarize the different CPU paths: + +| Artifact | Target CPU path | What to expect | +| --- | --- | --- | +| Cortex-M `.pte` | Cortex-M backend lowering with CMSIS-NN optimized kernels where supported | Quantized operator patterns and Cortex-M-specific names | +| Portable Cortex-A `.pte` | Default ExecuTorch portable kernels | Broad operator coverage, useful baseline, usually slower for heavy transformer compute | +| XNNPACK Cortex-A `.pte` | XNNPACK delegated regions, with portable fallback where needed | Faster supported CPU regions, possible graph fragmentation, larger `.pte` metadata | + +## Compare CPU deployment artifacts + +In this section, you compare two `.pte` files generated from the same model: [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m). + +OPT stands for Open Pre-trained Transformer. OPT-125M is a 125-million-parameter, decoder-only transformer language model from Meta. It is a small member of the OPT family, which makes it useful for demonstrations because it is large enough to contain transformer operations such as embeddings, attention, linear layers, matrix multiplication, reshapes, and masking, but small enough to inspect and run on edge-class Arm systems. + +The FP32 artifacts used here come from the [ExecuTorch on Arm Practical Labs](https://github.com/arm-education/executorch_on_arm_labs). + +Open these files in Model Explorer: + +```output +model-explorer-artifacts/pte/opt125m_cortex_a_portable.pte +model-explorer-artifacts/pte/opt125m_cortex_a_xnnpack.pte +``` + +## Open the portable kernel PTE + +Open `opt125m_cortex_a_portable.pte` in Model Explorer and inspect the graph structure. This file is the baseline ExecuTorch program without XNNPACK delegation. It shows how the model looks when the graph runs through the default ExecuTorch portable-kernel path. + +Inspect the graph and answer: + +- Are there any backend delegate regions? +- Do the operator names look like regular PyTorch/ATen operators or backend-specific operators? +- What are the model input and output shapes? +- Which transformer operator patterns appear repeatedly? +- Which shape or layout operators might become boundaries for optimized backend delegation? + +A small snippet image is shown below: + +![Screenshot of examining a portable Cortex-A PTE in Model Explorer.#center](portable.png "Inspecting portable Cortex-A PTE with Model Explorer") + +In this artifact, notice: + +- The graph has 600 operator nodes and no XNNPACK delegate regions. The visible operators are regular ExecuTorch `KernelCall` nodes. +- Most visible operator names use the `aten::` namespace. ATen is PyTorch's core operator library. +- The model has two fixed-shape inputs with shape `[1, 128]`, corresponding to a batch size of 1 and a fixed sequence length of 128 tokens. +- The output shape is `[1, 50272]`, which represents logits over the OPT vocabulary for the wrapped last-token output. +- Repeated transformer patterns are visible. Look for groups of `aten::addmm`, `aten::bmm`, `aten::_softmax`, `aten::native_layer_norm`, `aten::relu`, and residual `aten::add` operations. +- Many layout and shape-manipulation operators are present, such as `aten::permute_copy`, `aten::expand_copy`, `aten::unsqueeze_copy`, and `dim_order_ops::_clone_dim_order`. These are useful to notice because they can affect memory movement and can become boundaries around optimized backend regions. + +## Open the XNNPACK PTE + +Open `opt125m_cortex_a_xnnpack.pte` and compare it with the portable graph. + +Inspect the graph and answer: + +- Are there XNNPACK delegate regions? +- Is the delegated work one large block or many smaller blocks? +- Which inputs and outputs cross the delegate boundaries? +- Does any visible work remain outside the delegated regions? +- Which `aten::` operators remain on the default ExecuTorch path? +- Can you open a delegate subgraph and identify backend-level XNNPACK operators? + +The backend has clearly changed the execution plan: + +![Screenshot of examining an XNNPACK Cortex-A PTE in Model Explorer.#center](xnnpack.png "Inspecting XNNPACK Cortex-A PTE with Model Explorer") + +- The top-level graph is smaller than the portable graph, with about 335 operator nodes instead of about 600. +- The graph contains many `XnnpackBackend` nodes. In this artifact, these represent the delegated regions that will execute through XNNPACK. +- Model Explorer exposes the XNNPACK delegate subgraphs. Open a delegate subgraph to see backend-level operators such as `XNNFullyConnected`, `XNNBatchMatrixMultiply`, `XNNStaticTranspose`, `XNNAdd`, `XNNMultiply`, and `XNNSoftmax`. +- The input and output contract remains the same as the portable artifact: two `[1, 128]` inputs and one `[1, 50272]` output. +- Some `aten::` operators still remain at the top level, including shape, masking, normalization, and elementwise operations. These are the parts of the graph that stayed on the default ExecuTorch path. +- The graph is not one single XNNPACK region. OPT-125M is a transformer with attention, masking, reshapes, and layout changes, so delegation is useful but fragmented into many backend regions. + +![Screenshot of examining an XNNPACK subgraph in Model Explorer.#center](xnnpack_subgraph.png "Inspecting XNNPACK subgraph with Model Explorer") + +This is the key difference to notice: XNNPACK does not replace the whole `.pte`. It captures supported subgraphs and leaves the rest of the program in ExecuTorch. In performance work, the balance between large delegated regions and remaining default-path operators is often more important than the raw number of delegate nodes. + +## Compare the two artifacts + +Use this table to guide your comparison: + +| Question | What to look for | +| --- | --- | +| Did XNNPACK group expensive work? | A larger delegated region can indicate that supported transformer operations were grouped for optimized CPU execution. | +| Did the graph fragment? | Multiple small delegate regions can indicate unsupported operators or boundaries between supported regions. | +| What stayed on the CPU default path? | Remaining portable operators may explain residual latency or integration requirements. | +| Did the artifact size change? | Backend delegation can improve latency but may increase the `.pte` size. | + +## What you have learned + +You have compared the same FP32 OPT-125M model exported as a portable Cortex-A `.pte` and as an XNNPACK-delegated Cortex-A `.pte`. The portable artifact shows the baseline ExecuTorch execution plan with mostly `aten::` `KernelCall` nodes. The XNNPACK artifact shows how supported CPU subgraphs are replaced by `XnnpackBackend` regions while unsupported or awkward graph sections remain on the default ExecuTorch path. + +You have also seen that backend delegation is not all-or-nothing. For transformer models, shape changes, masking, normalization, and layout operations can fragment the graph, so performance analysis depends on both what was delegated and what stayed outside the delegate. + +Next, you will inspect Ethos-U `.pte` artifacts and see how NPU delegation differs from Cortex-A CPU delegation. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/4-inspect-ethosu-pte.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/4-inspect-ethosu-pte.md new file mode 100644 index 0000000000..9a0657891f --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/4-inspect-ethosu-pte.md @@ -0,0 +1,119 @@ +--- +title: "Inspect Ethos-U PTE delegation" + +weight: 5 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Inspect NPU delegation + +Ethos-U is Arm's microNPU family for embedded and edge AI acceleration. In ExecuTorch Arm Ethos-U flows, suitable quantized subgraphs are lowered for the Ethos-U backend and compiled through the Arm toolchain. + +Model Explorer is useful because the final `.pte` shows whether the ExecuTorch program contains a clean NPU delegate region, fragmented delegate regions, or CPU fallback. + +Ethos-U execution is heterogeneous: supported subgraphs are delegated to the NPU, while unsupported operators fall back to the CPU. The PyTorch blog [Efficient Edge AI on Arm CPUs and NPUs](https://pytorch.org/blog/efficient-edge-ai-on-arm-cpus-and-npus/) describes this flow as quantizing the model, lowering supported regions to TOSA, running Vela to produce an optimized Ethos-U command stream, and packaging the result into the final `.pte`. In this section, you inspect three MobileNetV2 artifacts in order: FP32 with no NPU delegation, INT8 with clean delegation, and INT8 with fragmented delegation. + +To run through how the artifacts used in this section are obtained, used the [ExecuTorch on Arm Practical Labs](https://github.com/arm-education/executorch_on_arm_labs). + +## Open the FP32 Ethos-U artifact + +First we will use a MobileNetV2 model, in FP32 form. + +Open: + +```output +model-explorer-artifacts/pte/mv2_fp32_ethos_u85.pte +``` + +Inspect the graph and answer: + +- Is there an Ethos-U delegate region? +- Are the operators still regular `aten::` operators? +- What are the input and output shapes? +- What does this tell you about targeting Ethos-U without quantization? + +![Screenshot of examining an FP32 Ethos-U PTE in Model Explorer.#center](ethos_fp32.png "Inspecting FP32 Ethos-U PTE with Model Explorer") + +Ethos-U execution expects quantized integer workloads. This artifact was generated from an FP32 MobileNetV2 model, so the graph is not in the form Ethos-U needs for NPU execution. As a result, the work falls back on to the CPU instead of being packaged as an Ethos-U delegate region. + +- The graph is much larger at the top level, with many visible `aten::convolution`, `aten::_native_batch_norm_legit_no_training`, and `aten::hardtanh` nodes. +- You should not see an `EthosUBackend` delegate node. +- The input and output shapes still match the image classification model: `[1, 3, 224, 224]` to `[1, 1000]`. + +This shows why quantization matters for Ethos-U. A model can be structurally valid and still fall back to CPU execution if it is not in a supported quantized form. + +## Open a delegated INT8 Ethos-U artifact + +Now we will use the same MobileNetV2 model, but quantized with the `EthosUQuantizer` into INT8. + +Open: + +```output +model-explorer-artifacts/pte/mv2_int8_ethos_u85.pte +``` + +Inspect the graph and answer: + +- Is there an Ethos-U delegate region? +- Is the NPU region one large block or several smaller blocks? +- Which inputs and outputs cross the delegate boundary? +- Does any visible work remain outside the delegated region? + +![Screenshot of examining an INT8 Ethos-U PTE in Model Explorer.#center](ethosu-int8-clean.png "Inspecting INT8 Ethos-U PTE with Model Explorer") + +A clean delegated example should have most supported quantized work inside the Ethos-U region. In this example, the compute-heavy quantized CNN operators are suitable for Ethos-U and should appear as one large delegated region. + +In Model Explorer, this artifact should look very compact at the top level: + +- The graph has one input with shape `[1, 3, 224, 224]` and one output with shape `[1, 1000]`. +- You should see a `quantized_decomposed::quantize_per_tensor` node near the start. +- You should see a single `EthosUBackend` delegate node for the main accelerated region. +- You should see a `quantized_decomposed::dequantize_per_tensor` node near the end. + +Most of the quantized MobileNetV2 compute is hidden behind one Ethos-U delegate call, so the top-level `.pte` graph mostly shows data entering the delegate, leaving the delegate, and returning to ExecuTorch. + +## Open a fragmented Ethos-U artifact + +To create this example, the original MobileNetV2 graph was modified by inserting an LRN (Local Response Normalization) layer. This is a useful example because the rest of the model still looks like the clean INT8 MobileNetV2 case, but the inserted LRN operation introduces work that the Ethos-U flow cannot keep inside one contiguous delegated region. + +Open: + +```output +model-explorer-artifacts/pte/mv2_lrn_int8_ethos_u85.pte +``` + +Look for: + +- Multiple delegate regions +- Operators between delegate regions +- Unsupported operations that force CPU fallback +- Extra tensor movement around backend boundaries + +![Screenshot of examining a fragmented INT8 Ethos-U PTE in Model Explorer.#center](ethos-u-int8-fragmented.png "Inspecting fragmented INT8 Ethos-U PTE with Model Explorer") + +Fragmentation often means that the model was only partly suitable for the target backend. Common causes include unsupported operators, unsupported tensor shapes, quantization issues, or target-specific compiler constraints. + +LRN is not natively supported by the Ethos-U flow used here, so it is decomposed into lower-level operations during lowering. Not all of those operations can be delegated to the NPU. Model Explorer should therefore show supported regions delegated to Ethos-U and unsupported work left on the CPU path. In summary: a single unsupported operation can break an otherwise clean NPU region into multiple segments, increasing transitions between CPU and NPU. + +In Model Explorer, compare it with the clean delegated artifact: + +- The graph still has one input with shape `[1, 3, 224, 224]` and one output with shape `[1, 1000]`. +- You should see two `EthosUBackend` delegate nodes instead of one. +- You should see quantize and dequantize nodes around the delegated regions. +- You should see an `aten::avg_pool3d` node between the delegate regions. This is the visible CPU-side work that breaks the otherwise contiguous NPU path. + +This is what fragmentation looks like in a `.pte`: the NPU still accelerates supported regions, but unsupported work splits the graph and creates extra boundaries between CPU execution and Ethos-U execution. These extra boundaries can cause extra overhead that leads to reduced performance. + +## Compare targets only with target-specific artifacts + +Remember, an artifact generated for one Ethos-U target does not fully explain another target. + +For example, an Ethos-U85 artifact does not provide full insight into Ethos-U55 behavior. Generate and inspect separate `.pte` files when comparing targets because operator support constraints, Vela behavior, memory configuration, MAC configuration, and fragmentation can differ. + +## What you have learned + +You have inspected three Ethos-U `.pte` artifacts and seen how quantization and operator support affect NPU delegation. The FP32 MobileNetV2 artifact stays on the CPU path because Ethos-U expects supported quantized integer workloads. The INT8 MobileNetV2 artifact shows the clean delegated pattern: quantize, run a compact `EthosUBackend` region, then dequantize. The LRN example shows fragmentation, where unsupported work splits one clean NPU region into multiple delegate regions with CPU work between them. + +Next, you will inspect TOSA artifacts directly to see the intermediate representation that sits between model lowering and backend compilation. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/6-inspect-tosa.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/6-inspect-tosa.md new file mode 100644 index 0000000000..e10bd9cc58 --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/6-inspect-tosa.md @@ -0,0 +1,165 @@ +--- +title: "Inspect TOSA artifacts" + +weight: 6 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Inspect the TOSA intermediate representation + +TOSA is the Tensor Operator Set Architecture. It is a stable operator-level intermediate representation (IR) used between model export and backend-specific compilation or conversion. + +You have already seen Ethos-U `.pte` files. Those `.pte` files show the final ExecuTorch program after supported regions have been delegated or left on the CPU path. TOSA lets you inspect an earlier stage: the graph representation that backend tools such as Vela or the Arm ML SDK Model Converter can consume. + +This is why this section does not include separate TOSA artifacts for the Cortex-A portable, Cortex-A XNNPACK, or Cortex-M examples. In the flow introduced at the start of this learning path, those routes do not need a TOSA intermediate representation: portable kernels stay in the ExecuTorch operator path, XNNPACK uses an ExecuTorch delegate for Cortex-A CPU acceleration, and Cortex-M uses its own Cortex-M/CMSIS-NN-oriented lowering path. TOSA becomes relevant for the backend routes that consume TOSA, such as Ethos-U and VGF. + +TOSA inspection is useful when you want to answer questions such as: + +- Did lowering produce the operators you expected? +- Are tensors in the expected shapes and data types? +- Did quantization change the graph structure? +- Did an unsupported operation split the graph into separate artifacts? +- Is this TOSA artifact ready to feed into the next backend tool? + +Unlike the `.pte` views you inspected earlier, these graphs are not showing ExecuTorch runtime instructions or delegate calls. They show TOSA operators such as `CONV2D`, `DEPTHWISE_CONV2D`, `RESCALE`, `CLAMP`, and `RESHAPE`. + +## Inspect Ethos-U TOSA artifacts + +Start with the same MobileNetV2 cases you inspected as `.pte` files. + +Open the FP32 TOSA artifact: + +```output +model-explorer-artifacts/tosa/mv2_fp32.tosa +``` + +Inspect the graph and answer: + +- Is the input shape `[1, 3, 224, 224]`? +- Is the output shape `[1, 1000]`? +- Are tensor types FP32? +- Which convolution, add, clamp, and pooling patterns are visible? + +![Screenshot of examining an FP32 MobileNetV2 TOSA artifact in Model Explorer.#center](tosa_ethos_fp32.png "Inspecting FP32 TOSA with Model Explorer") + +This artifact shows that the FP32 model can be represented in TOSA. In Model Explorer, you should see a large graph with about 800 nodes. Most of the graph is made from constants and arithmetic around the MobileNetV2 operator structure: `CONV2D`, `DEPTHWISE_CONV2D`, `MUL`, `ADD`, `SUB`, `CLAMP`, one `AVG_POOL2D`, and a final `RESHAPE`. + +That does not mean the graph can run on Ethos-U. Ethos-U expects supported quantized integer workloads, so the FP32 `.pte` you inspected earlier did not contain an `EthosUBackend` delegate region. + +Now open the INT8 TOSA artifact: + +```output +model-explorer-artifacts/tosa/mv2_int8.tosa +``` + +Compare it with the FP32 TOSA graph: + +- The input and output shapes still match MobileNetV2: `[1, 3, 224, 224]` to `[1, 1000]`. +- Tensor types are INT8. +- You should see many `RESCALE` operations, which are common in quantized graphs. +- You should still see the core CNN structure, including `CONV2D`, `DEPTHWISE_CONV2D`, `ADD`, `AVG_POOL2D`, and `RESHAPE`. + +![Screenshot of examining an INT8 MobileNetV2 TOSA artifact in Model Explorer.#center](tosa_ethos_int8.png "Inspecting INT8 TOSA with Model Explorer") + +The INT8 graph is still a full MobileNetV2 graph, but the operator mix changes. You should see fewer floating-point arithmetic nodes and many `RESCALE` nodes. These are used in quantized graphs to move values between quantization scales after integer operations. The convolution and depthwise convolution operators use INT32 accumulation, which is typical for INT8 convolution workloads. + +This is the kind of TOSA graph that can be compiled by the Ethos-U Vela compiler into an Ethos-U command stream, then packaged into a `.pte`. + +## Inspect fragmented TOSA artifacts + +Next, inspect the TOSA artifacts from the LRN example: + +```output +model-explorer-artifacts/tosa/mv2_lrn_int8_1.tosa +model-explorer-artifacts/tosa/mv2_lrn_int8_2.tosa +``` + +You saw earlier that the LRN `.pte` contained two `EthosUBackend` delegate nodes with CPU-side work between them. These two TOSA files help explain why. + +Inspect both TOSA files and answer: + +- Why did this example produce more than one TOSA artifact? +- What are the input and output shapes for each fragment? +- Which fragment contains most of the MobileNetV2 CNN structure? +- Which fragment represents the graph region after the inserted LRN-related work? +- Are the fragment boundaries consistent with the two `EthosUBackend` regions you saw in the `.pte`? + +![Screenshot of examining the first fragmented INT8 TOSA artifact in Model Explorer.#center](tosa_ethos_int8_frag_1.png "Inspecting the first fragmented INT8 TOSA artifact") + +![Screenshot of examining the second fragmented INT8 TOSA artifact in Model Explorer.#center](tosa_ethos_int8_frag_2.png "Inspecting the second fragmented INT8 TOSA artifact") + +The first LRN TOSA file is the smaller fragment. It has two inputs, with shapes `[1, 1280, 7, 7]` and `[1, 1, 1280, 7, 7]`, and one `[1, 1000]` output. It contains the later part of the graph after the inserted LRN-related work, including a small number of `RESCALE`, `TABLE`, `MUL`, `AVG_POOL2D`, and `CONV2D` operations. + +The second LRN TOSA file is the larger fragment. It starts from the original image input shape `[1, 3, 224, 224]` and contains most of the quantized MobileNetV2 CNN structure. It has many `RESCALE`, `CONV2D`, and `DEPTHWISE_CONV2D` operations, and produces intermediate outputs with shapes `[1, 1280, 7, 7]` and `[1, 1, 1284, 7, 7]` that cross the break in the graph. + +The graph fragmentation has become visible as multiple backend-ready TOSA artifacts. That matches the fragmented `.pte` view, where two `EthosUBackend` delegate regions were separated by CPU-side work. + +## Use TOSA outside ExecuTorch + +TOSA is not limited to ExecuTorch. ExecuTorch can lower supported graph partitions to TOSA, but a different framework, internal compiler, or proprietary model frontend could also produce TOSA directly. + +That makes the TOSA adapter useful even when there is no `.pte` file in the workflow. For example, you might be: + +- Converting from a framework, ONNX graph, or internal model dialect into TOSA. +- Writing a compiler, graph optimizer, or Vela-like backend tool that consumes TOSA. +- Checking whether your frontend produced the TOSA operators, tensor shapes, layouts, and quantized types you expected. +- Looking for missed optimization opportunities, such as long chains of `ADD`, `MUL`, `RESHAPE`, or layout operations that could potentially be fused or lowered differently. +- Comparing two frontend or compiler versions to see whether the generated TOSA graph became simpler, more fragmented, or more backend-friendly. + +One valid artifact flow is: + +```output +Custom model format or framework + | +Frontend or compiler conversion + | + TOSA + | +Arm backend compiler or model converter + | +Target-specific artifact +``` + +This matters because TOSA provides a contract between the model frontend and the backend tool. If the TOSA graph has the expected operators, shapes, layouts, and quantized types, the backend compiler or converter has a clearer input to work from. If the graph looks noisy, fragmented, or unexpectedly generic, the issue may be in the frontend conversion, an earlier graph optimization pass, or the backend support boundary. + +## Inspect TOSA for VGF conversion + +TOSA can also feed Vulkan ML workflows. The Arm ML SDK Model Converter takes TOSA as input, applies transforms and optimizations, lowers to SPIR-V graph IR, and packages the result into a VGF file. + +The examples in this section use a small neural upscaling model. It takes a low-resolution image-like tensor and produces a higher-resolution output, which makes it a useful compact example for Vulkan ML and neural graphics workflows. + +These artifacts were generated in the [Quantize neural upscaling models with ExecuTorch](https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/quantize-neural-upscaling-models/) learning path. Go through that learning path if you want to learn how to generate the `.tosa` and `.vgf` files yourself, and how to apply post-training quantization (PTQ) and quantization-aware training (QAT) before export. + +Open the TOSA artifacts used by the VGF examples: + +```output +model-explorer-artifacts/tosa/small_upscaler_ptq.tosa +model-explorer-artifacts/tosa/small_upscaler_qat.tosa +``` + +These small upscaler graphs are INT8 TOSA artifacts. In Model Explorer, look for: + +- Input shape `[1, 16, 16, 3]` +- Output shape `[1, 32, 32, 3]` +- `RESIZE` with bilinear mode +- `CONV2D` operations +- `RESCALE` operations from quantized arithmetic +- Similarities and differences between the PTQ and QAT artifacts + +![Screenshot of examining a PTQ small upscaler TOSA artifact in Model Explorer.#center](tosa_ptq.png "Inspecting a PTQ small upscaler TOSA artifact with Model Explorer") + +Do not expect a major visual difference between the PTQ and QAT TOSA graphs. They represent the same small upscaler architecture and were lowered to the same visible TOSA structure: 41 nodes, including one bilinear `RESIZE`, three `CONV2D` operations, four `RESCALE` operations, three `CONST_SHAPE` nodes, and constants for weights and quantization parameters. + +The important difference is how the quantized model parameters were produced. PTQ applies quantization after training, usually using calibration data to choose quantization parameters. It is simpler and faster to apply, so it is often the first option to try. QAT simulates quantization during training, so the model can adapt to quantization effects. It takes more work, but can recover accuracy when PTQ causes too much quality loss. + +In Model Explorer, that difference is not likely to appear as a different graph shape. Look instead at tensor metadata, constants, scales, shifts, zero-points, and any downstream accuracy or runtime behavior. + +These files are useful because they connect the TOSA view to the VGF view. TOSA shows the backend-neutral graph representation. The next section shows the VGF artifact produced for Vulkan ML and neural graphics integration. + +## What you have learned + +You have inspected TOSA as the intermediate representation between model lowering and backend compilation or conversion. For Ethos-U, TOSA helps explain why FP32 does not delegate, why INT8 can produce a compact NPU region, and why an inserted unsupported operation can fragment the graph. You have also seen that TOSA is not ExecuTorch-specific and can be produced by other frontends. + +Next, you will inspect VGF artifacts and see what the TOSA-to-VGF conversion produces for Vulkan ML workflows. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/7-inspect-vgf.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/7-inspect-vgf.md new file mode 100644 index 0000000000..8915867087 --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/7-inspect-vgf.md @@ -0,0 +1,103 @@ +--- +title: "Inspect VGF artifacts" + +weight: 7 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## Inspect Vulkan ML artifacts + +VGF is the artifact used by Arm ML SDK for Vulkan workflows. It is useful in Vulkan ML and neural graphics workflows where you need to inspect the graph consumed by the Vulkan ML side of an application. + +In the previous section, you inspected the TOSA artifacts for a small neural upscaling model, obtained from the [Quantize neural upscaling models with ExecuTorch](https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/quantize-neural-upscaling-models/) learning path. In this section, you inspect the VGF and PTE artifacts produced from that flow. + +Although that flow uses ExecuTorch, and hence it generates `.pte` files as well, VGF is not ExecuTorch-specific. If you have a `.tosa` file obtained from another flow, it can be converted to `.vgf` using the Arm ML SDK Model Converter. + +## Compare PTE and VGF views + +There are two related but different Model Explorer views in a VGF workflow: + +| View | Open with | Layer inspected | What to look out for | +| --- | --- | --- | --- | +| VGF-backend `.pte` | PTE adapter | ExecuTorch program and deployment container | Where does the VGF backend call appear? Is there surrounding quantize, dequantize, or CPU work? | +| Standalone `.vgf` | VGF adapter | Vulkan ML backend graph | What operators, tensors, shapes, constants, descriptors, and graph connectivity will the Vulkan ML side consume? | + +Use the `.pte` file when you want to understand how ExecuTorch wraps and calls the VGF backend. Use the `.vgf` file when you want to inspect the Vulkan ML artifact directly. You can also open the VGF backend in model explorer from the `.pte` file and you will see it is the same view as opening the `.vgf` file directly. + +## Open the VGF-backend PTE + +Start with the PTQ small upscaler packaged as an ExecuTorch program: + +```output +model-explorer-artifacts/pte/small_upscaler_ptq_vgf.pte +``` + +Inspect the graph and answer: + +- Is there a `VgfBackend` delegate node? +- What work happens before and after the backend call? +- Does the top-level graph show the internal upscaler operators? +- What does this view tell you about the ExecuTorch runtime path? + +![Screenshot of examining a VGF-backed PTE in Model Explorer.#center](ptq_vgf_pte.png "Inspecting a VGF-backed PTE with Model Explorer") + +In Model Explorer, this `.pte` should look compact. You should see a `quantized_decomposed::quantize_per_tensor` node, a single `VgfBackend` delegate node, a `quantized_decomposed::dequantize_per_tensor` node, and graph inputs and outputs. + +This is a very similar view to the Ethos delegation, where aside from inputs/outputs and quantize/dequantize operators, the model is completely delegated to the backend. + +Click to expand the `VgfBackend` delegate graph: + +![Screenshot of examining the VGF backend subgraph inside a PTE in Model Explorer.#center](ptq_vgf_pte_subgraph.png "Inspecting the VGF backend subgraph inside a PTE") + +Now open the matched standalone VGF artifact: + +```output +model-explorer-artifacts/vgf/small_upscaler_ptq.vgf +``` + +You will see they show the same view. + +Inspect the graph and answer: + +- What input tensor shape and Vulkan tensor format are shown? +- What output shape does the graph produce? +- Do you see the `Resize` and `Conv2D` structure from the TOSA graph? +- Where do `Rescale` operations appear? +- Which details are visible here that were hidden behind the `VgfBackend` node in the `.pte` view? + +The VGF graph shows the backend-level structure consumed by the Vulkan ML workflow. For the PTQ upscaler, you should see a small graph with the same high-level structure you saw in TOSA: + +- `Resize` +- `Rescale` +- `Conv2D` +- `Rescale` +- `Conv2D` +- `Rescale` +- `Conv2D` +- `Rescale` + +You should also see Vulkan tensor descriptor nodes such as `VK_DESCRIPTOR_TYPE_TENSOR_ARM`. The input descriptor has shape `[1, 16, 16, 3]` and format `VK_FORMAT_R8_SINT`. The graph produces an INT8 output with shape `[1, 32, 32, 3]`. + +This is the view to use when you care about the Vulkan ML integration contract: tensor shapes, tensor formats, graph connectivity, quantized operators, and backend-visible layout choices. + +## Look at other artifacts + +Also provided in the repo are the quantization-aware-training version, and a toy `add_sigmoid` model (`.pte` and `.vgf`) from the [Prepare models for neural graphics](https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/preparing-models-for-nt/) learning path. + +```output +model-explorer-artifacts/pte/small_upscaler_qat_vgf.pte +model-explorer-artifacts/vgf/small_upscaler_qat.vgf +model-explorer-artifacts/pte/add_sigmoid_vgf.pte +model-explorer-artifacts/vgf/add_sigmoid.vgf +``` +You can use Model Explorer to inspect these graphs in the same way. + +## What you have learned + +You have inspected the same VGF workflow from two angles. The `.pte` view shows the ExecuTorch program and where it calls the VGF backend. The standalone `.vgf` view shows the Vulkan ML backend graph: tensor descriptors, graph connectivity, operator structure, quantization-related rescale operations, and the input/output contract used by a neural graphics application. + +At this point, you have used Model Explorer to inspect the main static artifacts in this learning path: `.pte` files for the deployed ExecuTorch program, `.tosa` files for backend-ready intermediate graphs, and `.vgf` files for Vulkan ML integration. These views answer what was exported, lowered, converted, and packaged. + +The final section adds runtime profiling context. You will load ETRecord and ETDump data to see how exported graph structure connects to measured operator and delegate events during execution. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/8-etrecord-etdump-overlays.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/8-etrecord-etdump-overlays.md new file mode 100644 index 0000000000..a31a755b2a --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/8-etrecord-etdump-overlays.md @@ -0,0 +1,291 @@ +--- +title: "Inspect ETRecord and ETDump overlays" + +weight: 8 + +### FIXED, DO NOT MODIFY +layout: "learningpathall" +--- + +## View ExecuTorch runtime profiling data + +In the previous sections, you inspected static artifacts. PTE, TOSA, and VGF views help you answer what was exported, lowered, compiled, converted, or packaged. + +Runtime profiling answers a different set of questions. It tells you what happened when the artifact ran on a specific runtime, runner, target hardware, and tracing configuration. + +In this final section, you use the ExecuTorch extension for Model Explorer to view profiling data overlaid onto the model graph. To do this, the extension reads ETRecord and ETDump files. This closes the loop: you start from graph inspection and end by connecting that graph structure to measured runtime behavior. + +## ETRecord and ETDump + +In ExecuTorch, ETRecord is the static side of the profiling workflow. It is produced at export time and preserves graph, operator, debug handle, and delegate partition metadata. This is the information that lets a runtime event map back to a graph node. + +ETDump is the runtime side of the profiling workflow. It is produced while executing the model with ExecuTorch event tracing enabled. It can contain events such as `Method::execute`, `OPERATOR_CALL`, `DELEGATE_CALL`, backend-specific events, start times, durations, and cycle counts. + +The two artifacts are most useful together: + +| Artifact | Layer inspected | What it adds | +| --- | --- | --- | +| `.etrecord` | Export-time graph context | Graph structure, debug handles, operator names, and delegate partitions | +| `.etdp` | Runtime event trace | Timing data from a specific execution | +| `.pte` | ExecuTorch program | The packaged program and backend/delegate structure | + +Without a matching ETRecord, an ETDump can still contain useful timing data. With the ETRecord loaded as graph context, the timing data becomes much easier to interpret because runtime measurements can be overlaid on the exported graph. + +## Generate ETRecord and ETDump + +ETRecord and ETDump are created at different points in the ExecuTorch workflow: + +- Generate the ETRecord when you export or lower the model. +- Generate the ETDump when you run the exported `.pte` program with event tracing enabled. +- Analyze them together with the ExecuTorch Inspector, or load them together in Model Explorer when using the ETRecord and ETDump extensions. + +This learning path provides `.etrecord` and `.etdp` files for you to use, but if you are interested in learning how to generate your own, a brief overview is covered here, including links to the relevant documentation. The [Profile ExecuTorch models with SME2 on Arm](https://learn.arm.com/learning-paths/cross-platform/sme-executorch-profiling/) learning path may also be of interest. + +The [ExecuTorch ETRecord documentation](https://docs.pytorch.org/executorch/stable/etrecord.html) describes ETRecord as an ahead-of-time debug artifact. It contains the Edge dialect graph, debug handles, and delegate debug maps that allow runtime data to be linked back to graph nodes and, when available, Python source information. + +For recent ExecuTorch versions, the recommended pattern is to enable ETRecord generation during export or lowering, then retrieve the ETRecord from the resulting program manager: + +```python +from executorch.exir.program import to_edge +from torch.export import export + +exported_program = export(model, example_inputs) + +edge_program = to_edge( + exported_program, + generate_etrecord=True, +) + +# Apply backend partitioning or lowering here if your flow uses it. +executorch_program = edge_program.to_executorch() + +with open("model.pte", "wb") as f: + f.write(executorch_program.buffer) + +etrecord = executorch_program.get_etrecord() +etrecord.save("model.etrecord") +``` + +The [ExecuTorch ETDump documentation](https://docs.pytorch.org/executorch/stable/etdump.html) describes ETDump as the runtime trace artifact. To produce it from a native runner, build ExecuTorch with developer tools and event tracing enabled: + +```bash +cmake \ + -DEXECUTORCH_BUILD_DEVTOOLS=ON \ + -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \ + ... +``` + +If you use a runner such as `executor_runner`, ETDump generation is usually exposed as a command-line option: + +```bash +./executor_runner \ + -model_path model.pte \ + -etdump_path model.etdp \ + -num_executions 1 +``` + +If you are integrating ETDump into your own C++ runner, create an `ETDumpGen`, pass it into runtime loading or the Module API, execute the model, then write the returned buffer to a file: + +```cpp +#include +#include +#include + +#include +#include + +using executorch::etdump::ETDumpGen; +using executorch::extension::Module; + +Module module( + "model.pte", + Module::LoadMode::Mmap, + std::make_unique()); + +// Execute the model, for example with module.forward(...). + +if (auto* etdump = dynamic_cast(module.event_tracer())) { + const auto trace = etdump->get_etdump_data(); + + if (trace.buf && trace.size > 0) { + std::unique_ptr guard(trace.buf, free); + std::ofstream file("model.etdp", std::ios::binary); + + if (file) { + file.write(static_cast(trace.buf), trace.size); + } + } +} +``` + +For Python runtime experiments, ExecuTorch can also emit ETDump when loading a program with `enable_etdump=True`: + +```python +from pathlib import Path + +from executorch.runtime import Runtime + +runtime = Runtime.get() +program = runtime.load_program( + Path("model.pte"), + enable_etdump=True, + debug_buffer_size=int(1e7), +) + +forward = program.load_method("forward") +outputs = forward.execute(inputs) + +program.write_etdump_result_to_file("model.etdp", "debug_output.bin") +``` + +After generating the files, you can check them with the [ExecuTorch Inspector API](https://docs.pytorch.org/executorch/stable/model-inspector.html): + +```python +from executorch.devtools import Inspector + +inspector = Inspector( + etdump_path="model.etdp", + etrecord="model.etrecord", +) +inspector.print_data_tabular() +``` + +If you pass only an ETDump to the Inspector, you still get runtime events. If you also pass the matching ETRecord, the Inspector can correlate events with graph operators and delegate metadata. + +## Load profiling overlays + +The ExecuTorch extension for Model Explorer contains an ETRecord adapter and an ETDump data provider. The ETRecord adapter opens the exported graph. The first version of the ETDump data provider overlays runtime timing data on top of that graph. + +Open the `.etrecord` first, then add the matching `.etdp` profiling data. Keep the pairs together: an ETDump from one export can be misleading if it is overlaid on a different ETRecord. + +The workflow is: + +```output +Open the ETRecord + | +Inspect export-time graph context + | +Load the matching ETDump profiling data + | +Overlay runtime timing data + | +Connect graph structure to runtime cost +``` + +## Inspect a portable kernel CPU profile + +Start with a portable CPU run of OPT-125M: + +```output +model-explorer-artifacts/etrecord/opt125m_portable.etrecord +model-explorer-artifacts/etdump/opt125m_portable.etdp +``` + +Inspect the graph and profiling overlay, then answer: + +- Are there any delegate partitions? +- Is most of the time in native `OPERATOR_CALL` events? +- Which repeated operators dominate the profile? +- How does the runtime view compare with the portable `.pte` view you inspected earlier? + +![Screenshot of examining portable OPT-125M ETRecord and ETDump overlays in Model Explorer.#center](portable_profile.png "Inspecting portable OPT-125M runtime overlays") + +This is a pure portable CPU run. The ETDump contains about 1,199 events, with `Method::execute` around 9,082 ms. There are no delegate calls. Almost all runtime is in native calls, with repeated `aten.addmm` events forming the main hotspot. + +Use this profile as the baseline. The graph has no accelerated delegate region to inspect, so the overlay points directly at CPU operator execution. + +## Compare with XNNPACK delegation + +Now open the XNNPACK version of the same model: + +```output +model-explorer-artifacts/etrecord/opt125m_xnnpack.etrecord +model-explorer-artifacts/etdump/opt125m_xnnpack.etdp +``` + +Compare it with the portable profile: + +- Do you see `XnnpackBackend` delegate events? +- How much of the total runtime is inside `DELEGATE_CALL` events? +- Which native operators still run outside the delegate? +- How much faster is this run than the portable CPU baseline? + +![Screenshot of comparing XNNPACK OPT-125M ETRecord and ETDump overlays in Model Explorer.#center](xnnpack_profile.png "Inspecting XNNPACK OPT-125M runtime overlays") + +The XNNPACK ETDump contains about 813 events, with `Method::execute` around 125 ms. The profile includes 98 delegate calls to `XnnpackBackend`. Delegate calls account for about 116.7 ms, while native calls account for about 8.5 ms. + +This shows a clean CPU delegate acceleration pattern. The model still has some native work, but the expensive compute has moved into XNNPACK delegate calls. Compared with the portable run at about 9,082 ms, the runtime profile makes the acceleration effect visible. + +## Inspect the FP32 Ethos-U example + +Next, inspect the MobileNetV2 FP32 example. We tried to delegate this to an Ethos-U, but ultimately it falls back to CPU execution on Cortex-M because Ethos-U requires INT8 quantization: + +```output +model-explorer-artifacts/etrecord/mobilenetv2_fp32_ethosu.etrecord +model-explorer-artifacts/etdump/mobilenetv2_fp32_ethosu.etdp +``` + +Look for: + +- No `EthosUBackend` delegate calls +- Native `aten.convolution.default` events +- A large `Method::execute` total +- Runtime behavior that matches the FP32 `.pte` fallback view from the Ethos-U section + +![Screenshot of examining an FP32 MobileNetV2 ETRecord and ETDump overlay in Model Explorer.#center](fp32_profile.png "Inspecting FP32 MobileNetV2 runtime overlays") + +The ETDump contains about 309 events, with `Method::execute` around 395 million cycles. There are no delegate calls, and the native call sum accounts for almost all of the runtime. The profile is dominated by CPU-side convolution work. + +## Inspect clean Ethos-U delegation + +Now open the regular INT8 MobileNetV2 Ethos-U profile: + +```output +model-explorer-artifacts/etrecord/mobilenetv2_int8_ethosu.etrecord +model-explorer-artifacts/etdump/mobilenetv2_int8_ethosu.etdp +``` + +Inspect the overlay and answer: + +- Is there one `EthosUBackend` delegate call? +- What work remains outside the delegate? +- Does the largest runtime cost come from the NPU event or from CPU-side quantization? +- How does this compare with the clean INT8 `.pte` view? + +![Screenshot of examining a clean INT8 Ethos-U ETRecord and ETDump overlay in Model Explorer.#center](ethos_int8_profile.png "Inspecting clean INT8 Ethos-U runtime overlays") + +This is the clean Ethos-U delegation case. The ETDump is small, with about 13 events. `Method::execute` is around 6.59 million cycles. There is one `EthosUBackend` delegate call, and only a small number of native calls. + +The important observation is that successful delegation does not mean every runtime cost is inside the accelerator. In this profile, the visible `DELEGATE_CALL` is about 96.9 thousand cycles, while CPU-side `quantize_per_tensor` accounts for about 6.47 million cycles. The graph is cleanly delegated, but the measured runtime is dominated by work around the delegate boundary. + +## Inspect fragmented Ethos-U delegation + +Finally, inspect the fragmented INT8 MobileNetV2 profile: + +```output +model-explorer-artifacts/etrecord/mobilenetv2_lrn_int8_ethosu.etrecord +model-explorer-artifacts/etdump/mobilenetv2_lrn_int8_ethosu.etdp +``` + +Compare it with the clean INT8 profile: + +- Are there two `EthosUBackend` delegate calls instead of one? +- Which CPU operators appear between or around the delegate regions? +- How large is the CPU fallback cost compared with the NPU cost? +- Do quantize and dequantize events appear around delegate boundaries? + +![Screenshot of examining fragmented INT8 Ethos-U ETRecord and ETDump overlays in Model Explorer.#center](ethos_lrn_profile.png "Inspecting fragmented INT8 Ethos-U runtime overlays") + +This profile shows why "delegated" is not always enough. The ETDump contains about 29 events, with two `EthosUBackend` delegate calls. The delegate calls are small compared with the overall profile, at about 105 thousand cycles and 47 thousand cycles. + +The overall `Method::execute` total is about 1.70 billion cycles because a large CPU `aten.convolution.default` island accounts for about 1.67 billion cycles. The profile also shows quantize and dequantize work around the delegate boundaries, including `dequantize_per_channel` at about 21.6 million cycles and `quantize_per_tensor` at about 6.47 million cycles. + +This is the runtime version of the fragmentation pattern you saw in the `.pte` and TOSA sections. The graph contains Ethos-U delegate regions, but unsupported or poorly placed CPU work dominates execution. + +## What you have learned + +ETRecord and ETDump add runtime context to static graphs. ETRecord gives you the graph and debug metadata. In the first Model Explorer overlay used in this Learning Path, ETDump contributes runtime timing data. Used together, with the adapter and data provider in the new Model Explorer ExecuTorch extension, they show which parts of the graph actually cost time on the target. + +The OPT-125M profiles made the CPU case clear: the portable run stayed on native operators, while the XNNPACK run moved most of the work into delegate calls. The MobileNetV2 profiles showed the same pattern for Ethos-U. CPU fallback, clean Ethos-U delegation, and fragmented delegation are much easier to tell apart once the runtime data is overlaid on the exported graph. + +You have now completed the full artifact-inspection flow in this learning path. You started with `.pte` files to understand deployed ExecuTorch programs, moved through TOSA and VGF to inspect backend and Vulkan ML artifacts, and finished with ETRecord and ETDump overlays to connect the graph to runtime cost. The next steps page points to deeper workflows for generating, running, profiling, and optimizing your own models. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_index.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_index.md new file mode 100644 index 0000000000..51cc097e69 --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_index.md @@ -0,0 +1,94 @@ +--- +title: Visualize ExecuTorch PTE, TOSA, VGF, ETRecord, and ETDump artifacts with Google's Model Explorer + +draft: true +cascade: + draft: true + +description: Learn how to inspect ExecuTorch PTE, TOSA, VGF, ETRecord, and ETDump model artifacts with Google Model Explorer and Arm adapters. + +minutes_to_complete: 90 + +who_is_this_for: This learning path is for Edge AI developers who need to inspect model artifacts after backend delegation, understand graph structure and delegate coverage, and use those insights to reason about performance and behavior. + +learning_objectives: + - Explain what Google Model Explorer is and how adapters add support for Arm model artifacts + - Install Model Explorer, launch it with the PTE, TOSA, and VGF adapters, and use the runtime overlay extension for ETRecord and ETDump + - Open ExecuTorch .pte files and compare portable CPU, XNNPACK CPU, and Ethos-U artifacts + - Use PTE visualization to reason about delegate regions, CPU fallback, graph fragmentation, and backend-specific changes + - Inspect TOSA flatbuffers as an intermediate representation used by Arm compiler and backend workflows + - Inspect VGF artifacts for Vulkan ML and neural graphics workloads + - Use ETRecord and ETDump overlays to connect exported graph structure with runtime profiling data + +prerequisites: + - Python 3.10 or later + - Basic familiarity with PyTorch, ExecuTorch, or model deployment workflows + +author: + - Matt Cossins + +### Tags +skilllevels: Introductory +subjects: ML +armips: + - Cortex-A + - Cortex-M + - Ethos-U + - Mali +tools_software_languages: + - Model Explorer + - ExecuTorch + - PyTorch + - Python + - TOSA + - VGF + - ETRecord + - ETDump + +operatingsystems: + - Linux + - macOS + - Windows + +shared_path: true +shared_between: + - embedded-and-microcontrollers + - mobile-graphics-and-gaming + - ai + +further_reading: + - resource: + title: Google Model Explorer + link: https://github.com/google-ai-edge/model-explorer + type: repository + - resource: + title: PTE adapter for Model Explorer + link: https://github.com/arm/pte-adapter-model-explorer + type: repository + - resource: + title: TOSA adapter for Model Explorer + link: https://github.com/arm/tosa-adapter-model-explorer + type: repository + - resource: + title: VGF adapter for Model Explorer + link: https://github.com/arm/vgf-adapter-model-explorer + type: repository + - resource: + title: ExecuTorch documentation + link: https://docs.pytorch.org/executorch/stable/index.html + type: documentation + - resource: + title: ExecuTorch ETRecord + link: https://docs.pytorch.org/executorch/stable/etrecord.html + type: documentation + - resource: + title: ExecuTorch ETDump + link: https://docs.pytorch.org/executorch/stable/etdump.html + type: documentation + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 +layout: "learningpathall" +learning_path_main_page: "yes" +--- diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_next-steps.md b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_next-steps.md new file mode 100644 index 0000000000..88b9543b9e --- /dev/null +++ b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/_next-steps.md @@ -0,0 +1,46 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 +title: "Next Steps" +layout: "learningpathall" +--- + +## PTE next steps + +Generate `.pte` files from your own models and inspect the exported graph before you start measuring performance. + +Use the XNNPACK delegate documentation to generate your own XNNPACK `.pte` from a model you care about. Then inspect both the portable and XNNPACK artifacts in Model Explorer. + +For CPU runtime profiling on modern Arm systems, continue with [Profile ExecuTorch models with SME2 on Arm](/learning-paths/cross-platform/sme-executorch-profiling/). + +## Ethos-U next steps + +Generate Ethos-U `.pte` files from your own quantized models and inspect delegation coverage. + +If you are comparing Ethos-U55 and Ethos-U85, generate target-specific artifacts for each target. Do not infer one target's graph behavior from the other target's artifact. + +For a board-focused ExecuTorch workflow, continue with [Deploy ExecuTorch firmware on NXP FRDM i.MX 93 for Ethos-U65 acceleration](/learning-paths/embedded-and-microcontrollers/observing-ethos-u-on-nxp/). + +## TOSA next steps + +Generate TOSA from your own model export or compiler flow and inspect it before backend compilation. + +Use TOSA inspection when debugging custom frontend, ONNX-to-TOSA, or framework-dialect-to-TOSA conversion flows. Continue to Vela or another backend compiler after the TOSA graph looks correct. + +## VGF next steps + +Use the VGF adapter to validate tensor contracts before integrating a graph with a Vulkan ML or neural graphics application. + +Continue with the related neural graphics Learning Paths: + +- [Train neural graphics models with Model Gym](/learning-paths/mobile-graphics-and-gaming/model-training-gym/) +- [Quantize neural upscaling models](/learning-paths/mobile-graphics-and-gaming/quantize-neural-upscaling-models/) +- [Prepare models for neural technology](/learning-paths/mobile-graphics-and-gaming/preparing-models-for-nt/) + +## Runtime profiling next steps + +Use ExecuTorch Inspector APIs with ETDump and ETRecord to connect runtime events to graph structure outside Model Explorer. + +Use ETRecord and ETDump together when you want to move from "what did I compile?" to "what actually cost time?" If the runtime overlay points to unexpected CPU fallback, graph fragmentation, or delegate-boundary overhead, return to the `.pte`, `.tosa`, or `.vgf` views to inspect why. diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_inspect.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_inspect.png new file mode 100644 index 0000000000..e24db386b7 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_inspect.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_top.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_top.png new file mode 100644 index 0000000000..43ec7b01bb Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/cortex_m_top.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos-u-int8-fragmented.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos-u-int8-fragmented.png new file mode 100644 index 0000000000..986b808fcf Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos-u-int8-fragmented.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_fp32.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_fp32.png new file mode 100644 index 0000000000..1008ca99ff Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_fp32.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_int8_profile.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_int8_profile.png new file mode 100644 index 0000000000..e5e54b6f78 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_int8_profile.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_lrn_profile.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_lrn_profile.png new file mode 100644 index 0000000000..4770f633da Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethos_lrn_profile.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethosu-int8-clean.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethosu-int8-clean.png new file mode 100644 index 0000000000..aa745a1cb8 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ethosu-int8-clean.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/fp32_profile.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/fp32_profile.png new file mode 100644 index 0000000000..cff707d7ff Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/fp32_profile.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/model_explorer.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/model_explorer.png new file mode 100644 index 0000000000..0e0a5e206f Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/model_explorer.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable.png new file mode 100644 index 0000000000..1d25072af5 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable_profile.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable_profile.png new file mode 100644 index 0000000000..54f2117a86 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/portable_profile.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte.png new file mode 100644 index 0000000000..c5e9bd7480 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte_subgraph.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte_subgraph.png new file mode 100644 index 0000000000..5b82ead409 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/ptq_vgf_pte_subgraph.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_fp32.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_fp32.png new file mode 100644 index 0000000000..43c6f6bfb2 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_fp32.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8.png new file mode 100644 index 0000000000..2f6b15ca99 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_1.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_1.png new file mode 100644 index 0000000000..5fe0d127f9 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_1.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_2.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_2.png new file mode 100644 index 0000000000..d1a045a1b2 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ethos_int8_frag_2.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ptq.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ptq.png new file mode 100644 index 0000000000..63b252c0f3 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/tosa_ptq.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack.png new file mode 100644 index 0000000000..293353f4ac Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_profile.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_profile.png new file mode 100644 index 0000000000..c1101f64e9 Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_profile.png differ diff --git a/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_subgraph.png b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_subgraph.png new file mode 100644 index 0000000000..21a719814d Binary files /dev/null and b/content/learning-paths/cross-platform/explore-model-artifacts-with-model-explorer/xnnpack_subgraph.png differ