How to Identify Which SaaS Vendors Are Training AI Models on Your Data

Martin Snyder
Apr 13
4 min read

If you cannot clearly verify how your SaaS vendors handle AI training, you should assume your data is at risk of being used.

How to Identify Which SaaS Vendors Are Training AI Models on Your Data

The Transparency Gap No One Talks About

Most SaaS vendors today will tell you some version of the same story:

“We take your data seriously.” “We prioritize privacy.” “We do not use customer data inappropriately.”

But when it comes to AI, those statements are often incomplete.

Because the real question is not whether vendors protect your data.

It is whether they use it to train their models—directly, indirectly, or under specific conditions.

And in many cases, the answer is difficult to verify.

Why This Matters More Than Ever

AI has fundamentally changed how SaaS platforms interact with data.

Traditional SaaS applications stored and processed data within defined boundaries. AI-enabled platforms, however, may:

Use data to improve models
Retain prompts and outputs for analysis
Share data with underlying AI providers
Apply different policies based on licensing tiers

This introduces a new category of risk—one that is not always visible in standard security reviews.

Regulatory frameworks increasingly emphasize this distinction between data processing and model training, highlighting the need for clear governance around AI usage: https://www.nist.gov/itl/ai-risk-management-framework

The Core Challenge: Most Policies Are Ambiguous

If you review the documentation of most SaaS vendors, you will notice a pattern.

Statements about AI and data usage are often:

Broad rather than specific
Conditional rather than absolute
Scattered across multiple documents

For example, a vendor may state that they do not use customer data for training—while also noting exceptions for:

“Service improvement”
“Aggregated or anonymized data”
“Optional features or beta programs”

From a legal standpoint, these distinctions are meaningful.

From a security standpoint, they create uncertainty.

What “Training on Your Data” Actually Means

Before you can identify risk, you need to define it clearly.

Training can take multiple forms:

Direct training: Using your data to update core models
Indirect training: Using prompts or outputs to refine behavior
Feature-level training: Improving specific AI capabilities based on usage
Third-party training: Passing data to external AI providers

Not all of these are treated equally in vendor disclosures.

Some may be explicitly mentioned. Others may be implied or omitted.

This is why verification matters.

A Practical Framework for Verifying Vendor Behavior

Instead of relying on high-level statements, organizations should adopt a structured approach to evaluating SaaS vendors.

1. Start With the AI-Specific Documentation

Do not rely solely on general privacy policies.

Look for:

AI or “machine learning” policy pages
Data processing addendums (DPAs)
Product-specific documentation for AI features

Vendors often separate AI disclosures from general privacy statements.

If no AI-specific documentation exists, that is already a signal of limited transparency.

2. Identify Conditional Language

Pay close attention to how statements are framed.

Phrases such as:

“We do not use customer data to train models”
Followed by: “except for improving our services”

should be treated as indicators of potential ambiguity.

You are looking for:

Clear opt-in vs opt-out mechanisms
Explicit definitions of “training”
Statements about prompt and output retention

If the language requires interpretation, it introduces risk.

3. Verify Enterprise Controls

Even when vendors allow customers to opt out of training, the control mechanisms matter.

Key questions include:

Can administrators disable AI features entirely?
Is there an organization-wide opt-out for training?
Are settings applied consistently across all users?
Are there differences based on licensing tiers?

In many cases, controls exist—but are not enabled by default.

Guidance from cybersecurity agencies reinforces the importance of verifying not just policy, but enforceability: https://www.cisa.gov/resources-tools/resources/ai-cybersecurity-guidelines

4. Assess Data Flow Beyond the Vendor

Many SaaS platforms rely on external AI providers.

This creates an additional layer of complexity.

You need to determine:

Whether data is sent to third-party AI models
What agreements govern that data
Whether those providers use data for training

Even if your direct vendor does not train on your data, downstream providers might.

Without visibility into this chain, risk remains incomplete.

5. Validate Through Product Behavior

Policies tell you what should happen. Behavior tells you what actually happens.

Where possible, validate:

Whether prompts are stored or retrievable
If outputs persist across sessions
Whether usage logs include AI interactions
How quickly data can be deleted

This does not replace contractual verification, but it provides an additional layer of confidence.

The Role of Discovery in AI Transparency

One of the biggest challenges in verifying vendor behavior is knowing which vendors to evaluate in the first place.

Most organizations underestimate how many AI-enabled SaaS applications are in use.

Some are approved. Many are not.

Without discovery, your verification process is incomplete.

You may rigorously assess a handful of known vendors while missing dozens of tools that employees are actively using.

This is where Shadow AI and AI transparency intersect.

Where Waldo Security Fits

Waldo Security approaches this problem from a visibility-first perspective.

It identifies SaaS applications across the organization—including those introduced through email signups or OAuth connections—and surfaces where AI is likely in use.

This allows organizations to:

Build a complete inventory of AI-enabled SaaS tools
Prioritize which vendors require deeper verification
Detect usage patterns that may introduce training risk

Waldo Security does not train AI models on customer data and operates entirely on metadata, aligning with a privacy-first approach to discovery.

A Different Standard for Trust

In the context of AI, trust cannot be based on general statements.

It needs to be based on:

Clear documentation
Verifiable controls
Observable behavior

If a vendor cannot provide clarity across these dimensions, the risk is not theoretical.

It is simply unmeasured.

Final Thought: If You Haven’t Verified It, You Don’t Know

Most organizations assume their data is not being used for AI training.

Few can prove it.

As AI continues to integrate into SaaS platforms, this assumption becomes increasingly dangerous.

Because the question is no longer whether vendors could train on your data.

It is whether you have taken the steps to confirm that they are not.

To understand how organizations are uncovering hidden SaaS and AI usage, visit: https://www.waldosecurity.com/2025-saas-and-cloud-discovery-report