Gappy — The Judgment-Data Layer for Vertical AI

One judgment

Ship with a standard you can defend

Domain experts grade what your AI actually said — 15 minutes per task

judgment · legal-contracts-v2

input…

ai…

expert…

exception…

NG verdict NG · reason + exception logged · who / when / why

The real failure mode

Models ship demos. Judgment data ships products.

Abandoned after PoC

30–50%

of GenAI projects never get past proof of concept

Gartner 2024–25

Reach production

0%

of AI projects ever make it into prod

Gartner

#1 failure cause

Data

data readiness — not the model — tops every failure survey

Gartner · CDO Insights 2025

DIY eval set

$0

to expert-grade one eval set in-house — one-off, non-reusable

Expert eval rates 2025–26

Product

Six floes between you and open water

Everything you need to turn expert judgment into reusable eval data

🗺️

Expert Graph

Experts organized by industry, role, years, and facility type — vetted operators, not crowd labelers.

🧊

Task Builder

Turn real AI answers and edge cases from your product into judgment tasks in minutes.

⚖️

Judgment UI

Experts mark OK / NG, the reason, and the exceptions — about 15 minutes per task.

📜

Audit Trail

Who judged what, when, and why — a record you can show customers and auditors.

📦

Eval Export

Ship results as eval sets for CI, guardrail specs, and RAG-ready knowledge.

🎖️

Credential Layer

Scoped expert supervision — show exactly what was reviewed, no name-lending.

Why this, not that

Domain judgment, made into data

Annotation vendors

Crowd labelers, general-purpose, noisy on specialized domains.

Gappy

Vetted domain experts who actually run the operation.

Advisors & consultants

Advice in a meeting you can't reuse or audit.

Gappy

Judgment work that becomes structured, reusable eval data.

In-house grading

Your engineers burning weeks on one-off eval sets.

Gappy

A workflow that produces eval sets, guardrails, and RAG knowledge.

Name on a slide

An expert title with no record of what they approved.

Gappy

An audit trail: who judged what, when, and why.

🐧 First Penguin program · 5 slots

The first one in gets the whole ocean.

Bring 100 real AI outputs from a feature you're trying to ship. In two weeks, we return a graded eval set and a guardrail spec, hand-built with a domain expert in your vertical. You keep the data.

100 expert OK / NG judgments with reasons + exceptions
An eval set you can run in CI
A guardrail spec for the cases your AI must not get wrong
A 30-minute readout on what's blocking production

FAQ

Questions before you dive?

Or email mitsuki@gappy.jp

Gappy is the judgment-data layer for vertical AI. Domain experts grade your real AI outputs — OK / NG, reasons, exceptions — and we turn those calls into eval sets, guardrail specs, and RAG-ready knowledge, with a full audit trail of who judged what, when, and why.

In a penguin colony, the first one to dive into unknown water takes the risk — and eats first. Our First Penguins are the 5 design partners who get their eval set hand-built, free, before anyone else. They shape the product and keep all the data.

Teams shipping vertical AI — AI dev shops, internal AI teams, and startups — in domains where a wrong answer carries liability: legal, healthcare, finance, tax, insurance, compliance. Your PoC works, but you can't certify it for production because the domain's judgment standards don't exist as data.

Annotation vendors give you crowd labelers at scale; they're noisy on specialized domains. Gappy's judges are vetted operators in your vertical, and the output is structured for AI operations: eval sets, guardrails, and exceptions — not just labels.

Pricing is a platform fee plus per-judgment task fees — we'll validate exact numbers with First Penguins. The benchmark we beat: expert-grading one eval set in-house runs $100k+ and isn't reusable.

A team that shipped vertical AI into production, hit this exact wall at the certification gate, and won the OpenAI Business Hackathon 2026. We're not arguing experts matter — we got punched by the missing judgment data ourselves.

The water's cold.
The fish are real.

5 First Penguin slots. 100 judgments. 2 weeks. ¥0.

🐧 Register as the 1st First Penguin

Expert judgments

0

per pilot eval set

Turnaround

0

weeks to delivery

First Penguins

0

slots open now

Pilot cost

¥0

you keep the data

Ship with a standard you can defend

Models ship demos. Judgment data ships products.

Six floes between you and open water

Expert Graph

Task Builder

Judgment UI

Audit Trail

Eval Export

Credential Layer

Domain judgment, made into data

The first one in gets the whole ocean.

Questions before you dive?

The water's cold.The fish are real.

The water's cold.
The fish are real.