Skip to content
4 changes: 2 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,9 @@ This pattern ensures proper encoding, timestamps, and file attributes are handle

## CI / Build Investigation

**dotnet/android's primary CI runs on Azure DevOps (internal), not GitHub Actions.** When a user asks about CI status, CI failures, why a PR is blocked, or build errors:
**dotnet/android PR validation runs on the public Azure DevOps `dotnet-android` pipeline on `dnceng-public`, not GitHub Actions.** When a user asks about CI status, CI failures, why a PR is blocked, or build errors:

1. **ALWAYS invoke the `ci-status` skill first** — do NOT rely on `gh pr checks` alone. GitHub checks may all show ✅ while the internal Azure DevOps build is failing.
1. **ALWAYS invoke the `ci-status` skill first.** The pipeline surfaces as ~39 `dotnet-android (...)` GitHub checks, but the skill adds build progress, ETA, per-stage failures, and failed-test names that `gh pr checks` alone doesn't give you.
2. The skill auto-detects the current PR from the git branch when no PR number is given.
3. For deep .binlog analysis, use the `azdo-build-investigator` skill.
4. Only after the skill confirms no Azure DevOps failures should you report CI as passing.
Expand Down
1 change: 0 additions & 1 deletion .github/skills/android-reviewer/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,6 @@ Review the CI results. **Never post ✅ LGTM if any required CI check is failing
- Investigate the failure using the **azdo-build-investigator** skill (for Azure DevOps pipeline failures) or GitHub Actions job logs.
- If the failure is caused by the PR's code changes, flag it as ❌ error.
- If the failure is a known infrastructure issue or pre-existing flake unrelated to the PR, note it in the summary but still use ⚠️ Needs Changes — the PR isn't mergeable until CI is green.
- If **all public CI checks pass** but only the internal `Xamarin.Android-PR` check is failing, still use ⚠️ Needs Changes with a note that the internal pipeline may need a re-run. Do not give ✅ LGTM.
- If the PR description acknowledges the failure and documents a dependency (e.g., "blocked on X"), note it in the summary.

### 5. Load review rules
Expand Down
357 changes: 126 additions & 231 deletions .github/skills/ci-status/SKILL.md

Large diffs are not rendered by default.

88 changes: 88 additions & 0 deletions .github/skills/ci-status/references/azdo-queries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# AZDO queries (dnceng-public)

Deeper `az` commands for the `dotnet-android` build, beyond the core ones in SKILL.md. Shared setup:

```bash
ORG=https://dev.azure.com/dnceng-public; PROJECT=public
RES=499b84ac-1321-427f-aa17-267ca6975798 # Azure DevOps app id, for `az rest --resource`
```

`build`-area `az devops invoke` works unauthenticated; in the `test` area only `--resource runs` is broken (404 on dnceng-public, so `runs` and `ResultsByBuild` go through `az rest`) — other resources like `--resource results` work fine. `az rest` and artifact/log downloads need `az login`.

## ETA for an in-progress build

Duration is dominated by hosted-agent queue time (same ~38 jobs every run, yet ~50 min to ~3 h+). Pull recent green runs of def `333`, take the **median** duration, `ETA = startTime + median`; present it as a rough window.

```bash
az devops invoke --area build --resource builds --org $ORG \
--route-parameters project=$PROJECT \
--query-parameters "definitions=333&statusFilter=completed&resultFilter=succeeded&\$top=10" \
--query "value[].{start:startTime, finish:finishTime}" -o json
```

## Failed-test error message / stack trace

`ResultsByBuild` (SKILL.md) gives the names + `runId`. For messages, list the run's failed results — the single-result-by-`testId` route returns null here. Repeat per distinct `runId`:

```bash
az devops invoke --area test --resource results --org $ORG \
--route-parameters project=$PROJECT runId=$RUN_ID \
--query-parameters "outcomes=Failed&\$top=20" \
--query "value[].{test:testCaseTitle, error:errorMessage, stack:stackTrace}" -o json
```

## Per-flavor test breakdown — fields & run → job mapping

The breakdown in SKILL.md fetches `/tmp/runs.json` from `/_apis/test/runs?...&includeRunDetails=true`. Field meanings per run (one run = one test *flavor*, e.g. `Mono.Android.NET_Tests-NativeAOT`):

| Field | Source | Meaning |
|-------|--------|---------|
| `total` | `totalTests` | all tests in the run |
| `passed` | `passedTests` | passed |
| `failed` | `unanalyzedTests` | failed/aborted |
| `skipped` | `notApplicableTests` | skipped / inconclusive |
| `phase` | `pipelineReference.phaseReference.phaseName` | the pipeline phase the run belongs to |

`run.phase` equals a timeline **Phase** record's `refName`; that record's `name` is the human lane — e.g. `mac_apk_tests_net_2` → `macOS > Tests > APKs 2`. That join (`runs` × timeline phases) is what the breakdown `jq` does. **Matrix lanes that share one phase** (e.g. all `MSBuild+Emulator N` jobs are phase `mac_dotnetdevice_tests`) aggregate into a single breakdown block — use the per-job timing table to see which numbered job actually failed/timed out.

Quick per-run counts without the join:

```bash
az rest --method get --resource $RES \
--url "$ORG/$PROJECT/_apis/test/runs?buildUri=vstfs:///Build/Build/$BUILD_ID&api-version=7.1&includeRunDetails=true" \
--query "value[].{name:name, total:totalTests, passed:passedTests, failed:unanalyzedTests, skipped:notApplicableTests}" -o json
```

To enrich the breakdown with the **actual error message** under each failed test, replace `/tmp/failed.json` with per-run results that include `errorMessage` (the "Failed-test error message" query above) — key them by `runId` the same way the breakdown's `$ft` lookup does.

## Fetch a failed task's log

Take `log.id` from a `records[?result=='failed']` timeline entry, then (works unauthenticated via `az rest`):

```bash
az rest --method get --resource $RES \
--url "$ORG/$PROJECT/_apis/build/builds/$BUILD_ID/logs/$LOG_ID?api-version=7.1" --output-file "/tmp/azdo-$LOG_ID.log"
```

The per-flavor `run <flavor>` task log holds the MTP summary (`Test run summary: Zero tests ran` ⇒ the app crashed at startup); the per-test lifecycle and native crash are **not** here — they are in logcat (below).

## Crash culprit from logcat

`scripts/ci_failures.cs` flags crashed/incomplete/timed-out lanes, but the culprit test is only in the device **logcat**, published inside that lane's `Test Results - ...` build artifact (100 MB–2 GB — prefer the smaller `Debug` lane). Download it, then scan `logcat-<flavor>.txt`:

```bash
# list artifacts + sizes to pick the failing lane:
az rest --method get --resource $RES \
--url "$ORG/$PROJECT/_apis/build/builds/$BUILD_ID/artifacts?api-version=7.1" \
--query "value[].{name:name, mb:(resource.properties.artifactsize)}" -o json

az pipelines runs artifact download --run-id $BUILD_ID --org $ORG --project $PROJECT \
--artifact-name "Test Results - APKs .NET Debug - macOS 1" --path /tmp/cilogs

# The crasher is the LAST test that logged a start with no matching pass/fail,
# usually right before a native signal:
grep -nE 'Running |\[PASS\]|\[FAIL\]|SIGSEGV|SIGABRT|tombstone|FATAL|art::|JNI DETECTED|Process .* died' \
/tmp/cilogs/**/logcat-*.txt | tail -60
```

For a `Zero tests ran` lane the crash is at app startup (look for the first `SIGSEGV`/`tombstone`/`JNI DETECTED ERROR`, not a specific test); for a timeout the suspect is the last `Running <test>` with no result.
6 changes: 5 additions & 1 deletion .github/skills/ci-status/references/binlog-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,11 @@ az pipelines runs artifact list --run-id $BUILD_ID --org $ORG_URL --project $PRO
az pipelines runs artifact list --run-id $BUILD_ID --org $ORG_URL --project $PROJECT --output json
```

Look for artifact names containing `binlog`, `msbuild`, or `build-log`.
Look for artifact names that contain build logs. On the `dotnet-android` (dnceng-public) pipeline the relevant ones are:
- `Build Results - macOS` / `Build Results - Windows` / `Build Results - Linux` — contain the `.binlog` files (published mainly when a build stage fails or when `XA.PublishAllLogs` is set).
- `Test Results - ...` — per-test-stage logs and artifacts. For the on-device `Package Tests` (APKs) stage these also include each device test's `build-<testName>.binlog`, `run-<testName>.binlog`, the `.trx`, and `logcat-<testName>.txt` (essential for native/JNI crash diagnosis).

If a green build has no `Build Results - *` artifact, the binlogs weren't published; re-run with `XA.PublishAllLogs` or rely on the timeline/test queries instead.

### Download

Expand Down
2 changes: 1 addition & 1 deletion .github/skills/ci-status/references/error-patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ These are CI environment issues, not code problems.
| Network | `Unable to load the service index`, `Connection refused` |
| NuGet feed | `NU1301` (feed connectivity) |
| Agent issues | `The agent did not connect`, `##[error] The job was canceled` |
| Timeout (job-level) | Job canceled after 55+ minutes |
| Timeout (job-level) | `result: canceled` + `issues[]` says *"ran longer than the maximum time of N minutes"* |

## Decision Tree

Expand Down
Loading
Loading