Files
myeasycms-v2/.claude/evals/feature-implementation.md
Giancarlo Buomprisco cfa137795b refactor: consolidate AGENTS.md and CLAUDE.md files, update tech stac… (#444)
* refactor: consolidate AGENTS.md and CLAUDE.md files, update tech stack and architecture details

- Merged content from CLAUDE.md into AGENTS.md for better organization.
- Updated tech stack section to reflect the current technologies used, including Next.js, Supabase, and Tailwind CSS.
- Enhanced monorepo structure documentation with detailed directory purposes.
- Streamlined multi-tenant architecture explanation and essential commands.
- Added key patterns for naming conventions and server actions.
- Removed outdated agent files related to Playwright and PostgreSQL, ensuring a cleaner codebase.
- Bumped version to 2.23.7 to reflect changes.
2026-01-18 10:44:40 +01:00

11 KiB

Agent Evaluation: Full Feature Implementation

This eval tests whether the agent correctly follows Makerkit patterns when implementing a complete feature spanning database, API, and UI layers.

Eval Metadata

  • Type: Capability eval (target: improvement over time)
  • Complexity: High (multi-step, multi-file)
  • Expected Duration: 15-30 minutes
  • Skills Tested: /feature-builder, /server-action-builder, /react-form-builder, /postgres-expert, /navigation-config

Task: Implement "Projects" Feature

Prompt

Implement a "Projects" feature for team accounts with the following requirements:

1. Database: Projects table with name, description, status (enum: draft/active/archived), and account_id
2. Server: CRUD actions for projects (create, update, delete, list)
3. UI: Projects list page with create/edit forms
4. Navigation: Add to team sidebar

Use the available skills for guidance. The feature should be accessible at /home/[account]/projects.

Reference Solution Exists

A correct implementation requires:

  • 1 schema file
  • 1 migration
  • 1 Zod schema file
  • 1 service file
  • 1 server actions file
  • 2-3 component files
  • 1 page file
  • Config updates (paths, navigation, translations)

Success Criteria (Grading Rubric)

1. Database Layer (25 points)

Criterion Points Grader Type Pass Condition
Schema file created in apps/web/supabase/schemas/ 3 Code File exists with .sql extension
Table has correct columns 5 Code Contains: id, account_id, name, description, status, created_at
RLS enabled 5 Code Contains enable row level security
Uses helper functions in policies 5 Code Contains has_role_on_account OR has_permission
Permissions revoked/granted correctly 4 Code Contains revoke all AND grant select, insert, update, delete
Status enum created 3 Code Contains create type with draft/active/archived

Anti-patterns to penalize (-3 each):

  • SECURITY DEFINER without access checks
  • Missing on delete cascade for account_id FK
  • No index on account_id

2. Server Layer (25 points)

Criterion Points Grader Type Pass Condition
Zod schema in _lib/schemas/ 3 Code File exists, exports schema with z.object
Service class pattern used 5 Code Contains class with methods, uses getSupabaseServerClient
Actions use enhanceAction 5 Code Import from @kit/next/actions, wraps handler
Actions have auth: true 3 Code Options object contains auth: true
Actions have schema validation 3 Code Options object contains schema:
Uses revalidatePath after mutations 3 Code Import and call revalidatePath
Logging with getLogger 3 Model Appropriate logging before/after operations

Anti-patterns to penalize (-3 each):

  • Manual auth checks instead of trusting RLS
  • await logger.info() (logger methods are not promises)
  • Business logic in action instead of service

3. UI Layer (25 points)

Criterion Points Grader Type Pass Condition
Components in _components/ directory 2 Code Path contains _components/
Form uses react-hook-form with zodResolver 5 Code Imports both, uses useForm({ resolver: zodResolver() })
No generics on useForm 3 Code NOT contains useForm<
Uses @kit/ui/form components 4 Code Imports Form, FormField, FormItem, FormLabel, FormControl, FormMessage
Uses Trans for strings 3 Code Import from @kit/ui/trans, uses <Trans i18nKey=
Uses useTransition for loading 3 Code const [pending, startTransition] = useTransition()
Has data-test attributes 3 Code Contains data-test= on form/buttons
Error handling with isRedirectError 2 Code Import and check in catch block

Anti-patterns to penalize (-3 each):

  • useForm<SomeType> with explicit generic
  • Using watch() instead of useWatch
  • Hardcoded strings without Trans
  • Missing FormMessage for error display

4. Integration & Navigation (15 points)

Criterion Points Grader Type Pass Condition
Page in correct route group 3 Code Path is app/home/[account]/projects/page.tsx
Uses await params pattern 3 Code Contains const { account } = await params
Path added to paths.config.ts 3 Code Contains projects path
Nav item added to team config 3 Code Entry in team-account-navigation.config.tsx
Translation key added 3 Code Entry in public/locales/en/common.json

5. Code Quality (10 points)

Criterion Points Grader Type Pass Condition
TypeScript compiles 5 Code pnpm typecheck exits 0
Lint passes 3 Code pnpm lint:fix exits 0
Format passes 2 Code pnpm format:fix exits 0

Grader Implementation

Code-Based Grader (Automated)

interface EvalResult {
  score: number;
  maxScore: number;
  passed: boolean;
  details: {
    criterion: string;
    points: number;
    maxPoints: number;
    evidence: string;
  }[];
  antiPatterns: string[];
}

async function gradeFeatureImplementation(): Promise<EvalResult> {
  const details = [];
  const antiPatterns = [];

  // 1. Check schema file
  const schemaFiles = glob('apps/web/supabase/schemas/*project*.sql');
  const schemaContent = schemaFiles.length > 0 ? read(schemaFiles[0]) : '';

  details.push({
    criterion: 'Schema file exists',
    points: schemaFiles.length > 0 ? 3 : 0,
    maxPoints: 3,
    evidence: schemaFiles[0] || 'No schema file found'
  });

  details.push({
    criterion: 'RLS enabled',
    points: schemaContent.includes('enable row level security') ? 5 : 0,
    maxPoints: 5,
    evidence: 'Checked for RLS statement'
  });

  // Check anti-patterns
  if (schemaContent.includes('security definer') &&
      !schemaContent.includes('has_permission') &&
      !schemaContent.includes('is_account_owner')) {
    antiPatterns.push('SECURITY DEFINER without access validation');
  }

  // 2. Check server files
  const actionFiles = glob('apps/web/app/home/[account]/projects/**/*actions*.ts');
  const actionContent = actionFiles.length > 0 ? read(actionFiles[0]) : '';

  details.push({
    criterion: 'Uses enhanceAction',
    points: actionContent.includes('enhanceAction') ? 5 : 0,
    maxPoints: 5,
    evidence: 'Checked for enhanceAction import/usage'
  });

  if (actionContent.includes('await logger.info')) {
    antiPatterns.push('await on logger.info (not a promise)');
  }

  // 3. Check UI files
  const componentFiles = glob('apps/web/app/home/[account]/projects/_components/*.tsx');
  const formContent = componentFiles.map(f => read(f)).join('\n');

  details.push({
    criterion: 'No generics on useForm',
    points: !formContent.includes('useForm<') ? 3 : 0,
    maxPoints: 3,
    evidence: 'Checked for useForm<Type> pattern'
  });

  if (formContent.includes('useForm<')) {
    antiPatterns.push('Explicit generic on useForm (should use zodResolver inference)');
  }

  // 4. Check integration
  const pathsConfig = read('apps/web/config/paths.config.ts');
  details.push({
    criterion: 'Path configured',
    points: pathsConfig.includes('projects') ? 3 : 0,
    maxPoints: 3,
    evidence: 'Checked paths.config.ts'
  });

  // 5. Run verification
  const typecheckResult = await exec('pnpm typecheck');
  details.push({
    criterion: 'TypeScript compiles',
    points: typecheckResult.exitCode === 0 ? 5 : 0,
    maxPoints: 5,
    evidence: `Exit code: ${typecheckResult.exitCode}`
  });

  // Calculate totals
  const score = details.reduce((sum, d) => sum + d.points, 0);
  const maxScore = details.reduce((sum, d) => sum + d.maxPoints, 0);
  const penaltyPoints = antiPatterns.length * 3;

  return {
    score: Math.max(0, score - penaltyPoints),
    maxScore,
    passed: (score - penaltyPoints) >= maxScore * 0.8, // 80% threshold
    details,
    antiPatterns
  };
}

Model-Based Grader (For Nuanced Criteria)

You are evaluating an AI agent's implementation of a "Projects" feature in a Makerkit SaaS application.

Review the following files and assess:

1. **Logging Quality** (0-3 points):
   - Are log messages descriptive and include relevant context (userId, projectId)?
   - Is logging done before AND after important operations?
   - Are error cases logged with appropriate severity?

2. **Code Organization** (0-3 points):
   - Is business logic in services, not actions?
   - Are files in the correct directories per Makerkit conventions?
   - Is there appropriate separation of concerns?

3. **Error Handling** (0-3 points):
   - Are errors handled gracefully?
   - Does the UI show appropriate error states?
   - Are redirect errors handled correctly?

Provide a score for each criterion with brief justification.

Trial Configuration

trials: 3  # Run 3 times to account for non-determinism
pass_threshold: 0.8  # 80% of max score
metrics:
  - pass@1: "Passes on first attempt"
  - pass@3: "Passes at least once in 3 attempts"
  - pass^3: "Passes all 3 attempts (reliability)"

Environment Setup

Before each trial:

  1. Reset to clean git state: git checkout -- .
  2. Ensure Supabase types are current: pnpm supabase:web:typegen
  3. Verify clean typecheck: pnpm typecheck

After each trial:

  1. Capture transcript (full conversation)
  2. Capture outcome (files created/modified)
  3. Run graders
  4. Reset environment

Expected Failure Modes

Document these to distinguish agent errors from eval problems:

Failure Likely Cause Is Eval Problem?
Missing RLS Agent didn't follow postgres-expert skill No
useForm<Type> Agent ignored react-form-builder guidance No
Wrong file path Ambiguous task description Maybe - clarify paths
Typecheck fails on unrelated code Existing codebase issue Yes - fix baseline
Agent uses different but valid approach Eval too prescriptive Yes - grade outcome not path

Iteration Log

Track eval refinements here:

Date Change Reason
Initial Created eval -

Notes

  • Grade outcomes, not paths: If agent creates a working feature with slightly different file organization, that's acceptable
  • Partial credit: A feature missing navigation but with working CRUD is still valuable
  • Read transcripts: When scores are low, check if agent attempted to use skills or ignored them entirely