We turned eleven checklists into Claude Skills. Six survived.

Anderson da SIlva · CTO · Thu, 14 May 2026 07:47:10 GMT

Three weeks ago I told the team to convert every internal checklist we run more than once a month into a Claude Skill. The bet was simple: if the checklist is good enough to live in a doc, it's good enough to live as a tool an agent can invoke.

Eleven checklists went in. Six are still active. Two got deprecated. Three never made it past the first run.

Here's what we kept, what we killed, and the pattern underneath both.

## What a Claude Skill is, in one sentence

A folder. A SKILL.md describing when to use it. Whatever scripts, prompts, or reference data the skill needs alongside. The agent reads the description, decides if the skill applies, and runs.

That's it. The shape is so light that the question stops being "can you build this?" and becomes "is this worth being a separate thing?"

## The six that worked

1. Supply-chain scan before commit. We had a paragraph in our CLAUDE.md about an obfuscated payload that snuck into one of our config files via an evil merge last year. The skill is the same paragraph plus a grep for the markers we know about. The agent now runs it automatically before any commit that touches *.config.{js,ts,mjs} or package.json. Caught nothing in three weeks. Still worth the eighty lines.

2. Migration applier. We have a tiny TypeScript script that bypasses drizzle-kit's interactive prompts and just runs a numbered SQL file against the right database. Wrapping it in a skill means I can say "apply 0014" and the agent figures out the env, the file, and the order. Used eleven times in three weeks. Saves me reading the script every time.

3. PR description from diff. Reads the staged commits, structures them into Summary / Schema changes / Frontend changes / Ops checklist / Test plan. Replaced a markdown template nobody updated. Used on every PR since.

4. CSV / JSON inspector. Given a path, returns row count, column distribution, null rates, a sample. Built for ad-hoc data debugging — we use it more than I'd like to admit. Killed two stale Python notebooks we never deleted.

5. Stripe payload validator. Pastes a Stripe webhook JSON, the skill confirms signature shape, identifies the event type, runs through the idempotency check our drainer expects, and tells you what would happen if it landed in prod. Two of our engineers use it more than the Stripe dashboard's "send test event" button.

6. IAB category lookup. We work with the IAB Tech Lab v3 taxonomy across AdCrier. The skill is the taxonomy plus a fuzzy search over names and descriptions. "Find the IAB id for hotel booking" returns the right four ids in under a second. Beats grepping the CSV every time.

## The two that got deprecated

Lint-then-fix runner. Sounded great. Run pnpm lint, parse the output, propose fixes. In practice it was slower than just running lint manually, because the agent had to read every file before fixing it. Lint already tells you exactly what to change. The skill added a confused middleman. Killed in week two.

Onboarding doc generator. Idea: paste a new repo, get a CLAUDE.md tailored to it. Useless in week one. By week two we realized the issue: CLAUDE.md is most useful when it captures the judgment calls a human made on first contact with a repo. Automating that strips out the part that made it valuable. The doc became generic and confidently wrong.

## The three that never launched

These are the more honest failures.

A. "Investigate flaky test." Couldn't ship because we couldn't articulate the heuristic. A human investigating a flaky test is doing a hundred small things — checking the seed, checking dependencies, checking external service mocks, checking if it's the test or the code under test. We tried to write that down for the skill and gave up after the third revision. The shape of the work refuses to compress.

B. "Cost-aware code generation." We wanted a skill that, given a feature, would prefer cheap implementations (cache hits, batching) over fast-to-write ones. Built it. The agent ignored it. The skill description ended up reading like a vague code-review preference, and the agent — correctly — weighted concrete user intent above abstract style preferences.

C. "Fraud-risk explainer." Translate an ipriskcache row into plain language for a customer support reply. Architecturally fine. We just couldn't get the tone right — replies were too clinical for customer-facing use, and the work of softening them was the same as just writing the reply. The skill saved nothing.

The honest takeaway: writing a Claude Skill is a forcing function. If you can't write the description in a paragraph, you

don't understand the work well enough to delegate it. Half of the dead skills above died because writing them taught us they shouldn't exist.

## The pattern underneath

Looking at the six that stuck, they share three properties:

The work has a deterministic core. Apply this SQL file. Sign this payload. Look up this category. The skill is a thin

wrapper around a thing that already works.

The judgment is small and well-defined. Choose the right migration file. Identify which Stripe event this is. Match a

description to a category. Big enough to be useful, small enough to articulate in one paragraph.

The output is structured. A list of categories, a JSON validation report, a PR description against a template. Not

"make this better" — make this into this shape. The skills that failed had at least one of these inverted. The flaky-test investigator had no deterministic core. The cost-aware generator had ill-defined judgment. The fraud-risk explainer had unstructured output.

## What we'd build differently

If I were starting over, I'd run every proposed skill through three questions before writing any code:

What's the deterministic core?
Can I describe the judgment in one paragraph?
What shape is the output, exactly?

If you can't answer the third question in concrete nouns ("a JSON object with these fields", "a list of IAB ids", "a markdown PR description"), don't build it yet. Go figure out what the output is supposed to look like first. We wasted two weeks on dead skills that would have failed the third question on day one.

## Numbers

📁 Skills built: 11 ✅ Still active after three weeks: 6 🗑️ Deprecated: 2 🚧 Never shipped: 3 ⏱️ Average session-time saved per active skill, per developer per week: ~22 min (self-reported) 🎯 Single biggest time-saver: PR description from diff (17 PRs in three weeks) 💀 Time wasted on dead skills: ~14 dev-hours

## Recommendation

Yes — if you have checklists you run more than once a month.

No — if you're using skills to automate work you don't yet understand. The format won't teach you the work. It compounds clarity you already have and exposes the lack of it where you don't.

Three weeks in we're more deliberate, not less. The folder is shorter than I expected. That's the right direction.

— Anderson & the WNC team