How to build AI-powered apps with feature flags

Supa DeveloperSupa Developer
··4 min read

AI features are inherently experimental. You might train three different recommendation models and need to test which performs best with real users. Or deploy a new language model that works perfectly in testing but needs careful monitoring in production. A standard deployment has no graceful recovery path if something goes wrong.

Feature flags change that. They let you roll out AI features gradually, compare them against existing behaviour, and pull the ripcord instantly - without a deployment.

The core problem with shipping AI

Traditional features fail in predictable ways: a button doesn't render, an API returns a 500. AI features fail in subtler ones: engagement goes up but conversion goes down, response quality degrades for a specific input distribution, latency spikes under load.

These problems are hard to catch in staging. They need real users and real traffic to surface. That's exactly what gradual rollouts with feature flags give you: production exposure with a controlled blast radius.

Scenario 1: Rolling out a new ML model

You've trained a new recommendation algorithm that shows 15% better engagement in offline tests. Replacing the existing model for everyone immediately is a large, irreversible bet. Instead, roll it out behind a flag:

import {
  SupashipClient,
  FeaturesWithFallbacks,
} from '@supashiphq/javascript-sdk'

const FEATURE_FLAGS = {
  'recommendation-model': { version: 'v1' as 'v1' | 'v2' },
} satisfies FeaturesWithFallbacks

const client = new SupashipClient({
  sdkKey: process.env.SUPASHIP_SDK_KEY,
  environment: 'production',
  features: FEATURE_FLAGS,
  context: { userId: user.id },
})

async function getRecommendations(userId: string) {
  const { version } = await client.getFeature('recommendation-model')

  if (version === 'v2') {
    return newModel.predict(userId)
  }
  return legacyModel.predict(userId)
}

Start at 5% of users. Watch conversion alongside engagement - better engagement that hurts conversion is still a regression. Increase the rollout only when both metrics look healthy, and roll back immediately if either degrades.

Using an object-valued flag here (rather than a boolean) means you can add a v3 later without touching the flag structure. The dashboard controls which version each cohort sees.

Scenario 2: Gradual AI chatbot rollout

Your team built an AI customer support chatbot that handles 90% of queries correctly in testing. Customer support is mission-critical - you can't afford to frustrate users with a bot that confidently gives wrong answers.

const FEATURE_FLAGS = {
  'ai-chat-support': false,
} satisfies FeaturesWithFallbacks

const client = new SupashipClient({
  sdkKey: process.env.SUPASHIP_SDK_KEY,
  environment: 'production',
  features: FEATURE_FLAGS,
  context: {
    userId: user.id,
    plan: user.subscriptionPlan,
  },
})

async function renderSupportInterface(user: User) {
  const aiChatEnabled = await client.getFeature('ai-chat-support')

  if (aiChatEnabled) {
    return renderAIChat({ fallbackToHuman: true })
  }
  return renderTraditionalForm()
}

Start with internal users or your most engaged customers - they're more forgiving and give better feedback. Monitor conversation success rates and escalation rates before expanding. The human fallback stays in place until you're confident the AI handles the full range of real queries.

Scenario 3: Configuration flags for AI behaviour

Not every AI flag is a simple on/off. Sometimes you want to tune the behaviour without deploying new code. Object-valued flags handle this well:

const FEATURE_FLAGS = {
  'ai-response-config': {
    model: 'gpt-4o-mini' as string,
    maxTokens: 512,
    temperature: 0.7,
  },
} satisfies FeaturesWithFallbacks

const config = await client.getFeature('ai-response-config')

const response = await openai.chat.completions.create({
  model: config.model,
  max_tokens: config.maxTokens,
  temperature: config.temperature,
  messages: [...],
})

Change the model, token limit, or temperature from the Supaship dashboard and the change takes effect on the next request - no deploy, no code review cycle for a configuration tweak.

What to monitor during an AI rollout

AI features need broader monitoring than typical ones. Before increasing rollout percentage, make sure you're watching:

MetricWhy it matters
Error rateModel failures, API timeouts, unexpected output formats
Latency (p95 / p99)AI inference is slower than traditional logic; watch for regressions
Business outcomeConversion, retention, task completion - not just engagement
Fallback rateHow often users hit the non-AI path (a rising rate signals a problem)
User feedback signalsThumbs down, re-prompts, support tickets mentioning the feature

Set up alerts before enabling the flag in production. The window between 5% and 50% is where most production AI issues surface - you want to catch them there, not at 100%.

Getting started

  1. Wrap the AI feature in a flag before it ships. The fallback should always be the known-working path.
  2. Start at 1–5%, ideally with internal users or a willing beta group.
  3. Monitor the metrics that matter - business outcomes, not just technical ones.
  4. Increase gradually with a deliberate pause at each step.
  5. Clean up the flag once the feature is at 100% and stable.

The pattern is the same whether you're rolling out a new LLM integration, a fine-tuned classification model, or an AI-powered search feature. The flag gives you the safety net; the gradual rollout gives you real-world evidence before you commit fully.


Ready to ship AI features with confidence? Try Supaship - feature flags built for modern development teams. Free forever up to 1M events/month. Pro plan is $30/month for your entire workspace.


Feedback

Got thoughts on this?

We're constantly learning how developers actually use these tools. Ideas, use cases, integration requests — every bit of feedback makes the platform better for everyone.

Thanks for being part of the journey — Supaship