Skip to content

Commit 40b1b6c

Browse files
authored
Expand README for src/observability (#58887)
1 parent 660614b commit 40b1b6c

1 file changed

Lines changed: 202 additions & 5 deletions

File tree

src/observability/README.md

Lines changed: 202 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,208 @@
11
# Observability
22

3-
Observability, for lack of simpler term, is our ability to collect data about how the Docs operates. These tools allow us to monitor the health of our systems, catch any errors, and get paged if a system stops working.
3+
The observability subject provides logging, error tracking, and monitoring infrastructure for docs.github.com. These tools help monitor system health, catch errors, and provide operational visibility through structured logging and alerting.
44

5-
In this directory we have files that connect us to our observability tools, as well as high-level error handling that helps keep our systems resilient.
5+
## Purpose & Scope
66

7-
We collect data in our observability systems to track the health of the Docs systems, not to track user behaviors. User behavior data collection is under the `src/events` directory.
7+
This subject is responsible for:
8+
- Structured logging with logfmt format in production
9+
- Logger abstraction over `console.log` for server-side code
10+
- Error handling and resilience (catch and report errors)
11+
- Integration with Sentry for error tracking
12+
- Integration with StatsD for metrics
13+
- Integration with Failbot for alerts
14+
- Automatic request logging middleware
15+
- Request context tracking via `requestUuid`
816

9-
## Logging
17+
Note: This tracks system health, not user behavior. User behavior tracking is in [`src/events`](../events/README.md).
18+
19+
## Architecture & Key Assets
20+
21+
### Key capabilities and their locations
22+
23+
- `logger/index.ts` - `createLogger()`: Creates logger instance for a module
24+
- `logger/middleware/get-automatic-request-logger.ts` - Express middleware for automatic request logging
25+
- `middleware/handle-errors.ts` - Global Express error handler that logs and reports errors
26+
- `middleware/catch-middleware-error.ts` - Wraps async middleware to catch errors
27+
- `lib/failbot.ts` - Reports errors to Failbot for alerting
28+
- `lib/statsd.ts` - Sends metrics to StatsD for monitoring
29+
30+
## Setup & Usage
31+
32+
### Using the logger
33+
34+
Instead of `console.log`, use the logger:
35+
36+
```typescript
37+
import { createLogger } from '@/observability/logger'
38+
39+
// Pass import.meta.url to include filename in logs
40+
const logger = createLogger(import.meta.url)
41+
42+
// Log levels: error, warn, info, debug
43+
logger.info('Processing request', { userId: '123' })
44+
logger.error('Failed to process', { error })
45+
```
46+
47+
Log levels (highest to lowest):
48+
1. `error` - Errors that need attention
49+
2. `warn` - Warnings that may need attention
50+
3. `info` - Informational messages
51+
4. `debug` - Detailed debugging information
52+
53+
Set `LOG_LEVEL` environment variable to filter logs:
54+
```bash
55+
LOG_LEVEL=info npm run dev # Filters out debug logs
56+
```
57+
58+
### Benefits of structured logging
59+
60+
1. **Logfmt format in production** - Easy to query in Splunk with key-value pairs
61+
2. **Log level grouping** - Filter by severity (`error`, `warn`, `info`, `debug`)
62+
3. **Request context** - Every log includes `path` and `requestUuid`
63+
4. **Sentry integration** - Errors in Sentry include `requestUuid` to find related logs
64+
5. **Development clarity** - Simple string logs in development, structured in production
65+
66+
### Automatic request logging
67+
68+
Request logging happens automatically via middleware:
69+
- Development: `GET /en 200 2ms`
70+
- Production: Logfmt with full context including `requestUuid`
71+
72+
All application logs from the same request share the same `requestUuid`.
73+
74+
### Error handling
75+
76+
Wrap async middleware to catch errors:
77+
78+
```typescript
79+
import catchMiddlewareError from '@/observability/middleware/catch-middleware-error'
80+
81+
router.get('/path', catchMiddlewareError(async (req, res) => {
82+
// Errors here are caught and handled
83+
const data = await fetchData()
84+
res.json(data)
85+
}))
86+
```
87+
88+
Global error handler in `middleware/handle-errors.ts` catches all Express errors.
89+
90+
## Data & External Dependencies
91+
92+
### Data inputs
93+
- Application logs from `logger.<method>()` calls
94+
- Request metadata (path, method, status, duration)
95+
- Error objects with stack traces
96+
- Request context (`requestUuid`, user agent, etc.)
97+
98+
### Dependencies
99+
- **Splunk** - Log aggregation and querying (index: `docs-internal`)
100+
- **Sentry** - Error tracking and alerting
101+
- **StatsD** - Metrics collection
102+
- **Failbot** - Error reporting and alerting
103+
- **Logfmt** - Log format library
104+
105+
### Data outputs
106+
- Structured logs sent to Splunk
107+
- Errors reported to Sentry with context
108+
- Metrics sent to StatsD
109+
- Alerts sent via Failbot
110+
111+
## Cross-links & Ownership
112+
113+
### Related subjects
114+
- [`src/events`](../events/README.md) - User behavior analytics (separate from observability)
115+
- [`src/frame`](../frame/README.md) - Middleware pipeline where error handlers run
116+
- All subjects - All should use `createLogger()` instead of `console.log`
117+
118+
### Internal documentation
119+
- Splunk dashboard: https://splunk.githubapp.com/en-US/app/gh_reference_app/search
120+
- For detailed logging guide, see `logger/README.md` in this directory
121+
- Sentry dashboard: (internal link)
122+
- On-call runbooks: (internal Docs Engineering repo)
123+
124+
### Ownership
125+
- Team: Docs Engineering
126+
- Note: We don't own Datadog or the observability infrastructure itself - we're working with what the observability team provides.
127+
128+
## Current State & Next Steps
129+
130+
### Querying logs in Splunk
131+
132+
All queries should specify index:
133+
```splunk
134+
index=docs-internal
135+
```
136+
137+
Find logs by request:
138+
```splunk
139+
index=docs-internal requestUuid="abc-123"
140+
```
141+
142+
Find errors:
143+
```splunk
144+
index=docs-internal level=error
145+
```
146+
147+
Find logs from specific module:
148+
```splunk
149+
index=docs-internal module="src/search/middleware/general-search.ts"
150+
```
151+
152+
### Request context
153+
154+
Every log includes:
155+
- `requestUuid` - Unique ID for the request
156+
- `path` - Request path
157+
- `method` - HTTP method
158+
- `statusCode` - Response status
159+
- `duration` - Request duration
160+
- `module` - Source file (from `import.meta.url`)
161+
162+
### Error reporting flow
163+
164+
1. Error occurs in application code
165+
2. Caught by `catchMiddlewareError` or global error handler
166+
3. Logged with `logger.error()` including stack trace
167+
4. Reported to Sentry with `requestUuid`
168+
5. Critical errors trigger Failbot alerts
169+
170+
### Adding observability to new code
171+
172+
1. Import and create logger at top of file:
173+
```typescript
174+
import { createLogger } from '@/observability/logger'
175+
const logger = createLogger(import.meta.url)
176+
```
177+
178+
2. Log important events:
179+
```typescript
180+
logger.info('Cache hit', { key })
181+
logger.warn('Rate limit approaching', { count })
182+
logger.error('Database connection failed', { error })
183+
```
184+
185+
3. Wrap async middleware:
186+
```typescript
187+
import catchMiddlewareError from '@/observability/middleware/catch-middleware-error'
188+
router.use(catchMiddlewareError(myMiddleware))
189+
```
190+
191+
### Known limitations
192+
- Logs are verbose in production (logfmt includes full context)
193+
- `requestUuid` tracking requires middleware initialization
194+
- Development logs are simplified strings (less structured)
195+
196+
### Planned work
197+
- We have an epic to improve our logging
198+
199+
### Monitoring and alerting
200+
201+
Active monitoring:
202+
- Error rates tracked in Sentry
203+
- Performance metrics tracked in StatsD
204+
- Critical errors trigger Failbot alerts to #docs-ops
205+
- On-call rotation notified for production incidents
206+
207+
For on-call procedures and escalation, see internal Docs Engineering runbooks.
10208

11-
Please see the [logger README](./logger/README.md).

0 commit comments

Comments
 (0)