McGarrah Technical Blog

Five Stages of a Successful Cloud Data Science Platform

· 11 min read

Every enterprise I have worked in for over a decade has hit the same wall: data scientists need production data to build useful models, but the standard software development lifecycle assumes production data stays in production. The SDLC model that works perfectly for application development — synthetic data in dev, sanitized data in staging, real data only in prod — fundamentally breaks when applied to machine learning.

This is not a tooling problem. It is an organizational architecture problem that sits at the intersection of data engineering, security, compliance, and platform engineering. Getting it wrong means either your data scientists train on garbage data and produce garbage models, or your security team grants exceptions that erode your compliance posture. Getting it right requires a promotion framework purpose-built for the constraints of data science work.

The Conflict

In classic SDLC, the environment hierarchy is straightforward:

---
title: Classic SDLC Environment Promotion
---
graph LR
    DEV[Development<br/>Synthetic data<br/>Maximum flexibility] --> MID[Staging / QA / UAT<br/>Sanitized data<br/>Moderate controls]
    MID --> PROD[Production<br/>Real data<br/>Maximum security]
    style DEV fill:#4CAF50,color:#fff
    style MID fill:#FF9800,color:#fff
    style PROD fill:#f44336,color:#fff

Development environments never have production data. The middle environments (staging, QA, UAT — every enterprise names them differently, and the naming is hotly debated) may have sanitized copies. Production has the real data under the strictest controls. Security and flexibility are inversely correlated as you move up the stack.

This model works because application developers do not need real data to write and test code. A payment processing service can be fully tested with synthetic transactions. A user management system works fine with fake users.

Data science is different. Training a machine learning model on synthetic or sanitized data produces a model that has learned the patterns of synthetic data — not the patterns of your actual business. Feature engineering on sanitized datasets misses the edge cases, distributions, and correlations that exist in production. The entire value proposition of ML depends on learning from real data.

So the conflict becomes clear: a data scientist needs the interactive flexibility of a development environment — notebooks, experimentation, iterative exploration — combined with access to production data that requires the security controls of a production environment. These two requirements are architecturally opposed in the standard SDLC model.

I have had this argument with development managers, project managers, and security teams across healthcare (BCBSNC), financial services (Envestnet), and government (NC DIT). The conversation always starts the same way: “Why can’t they just use the staging data?” And the answer is always the same: because staging data is not production data, and the model quality difference is measurable.

The Five-Stage Framework

The resolution is a promotion framework designed specifically for data science workloads. Instead of forcing DS into the SDLC model, you build a parallel track that acknowledges the production data requirement from the start.

---
title: Five-Stage Data Science Platform
---
graph TD
    subgraph "**Infrastructure Track**"
        IDEV[Infrastructure<br/>Development] --> IPRE[Infrastructure<br/>Pre-Production]
    end
    subgraph "**Data Science Track** (Production Data)"
        DISC[Prod Discovery<br/>DS Interactive] --> INT[Prod Integration<br/>DS Automation]
        INT --> FINAL[Production<br/>Final Prod]
    end
    IPRE -.->|releases to| DISC
    IPRE -.->|releases to| INT
    IPRE -.->|releases to| FINAL
    style IDEV fill:#4CAF50,color:#fff
    style IPRE fill:#8BC34A,color:#fff
    style DISC fill:#FF9800,color:#fff
    style INT fill:#FF5722,color:#fff
    style FINAL fill:#f44336,color:#fff

Stage 1: Infrastructure Development

Where you develop and test the platform itself — new tools, infrastructure changes, configuration updates. No production data. Maximum flexibility. This is standard IaC development: Terraform modules, Helm charts, CI/CD pipelines, new service integrations.

This environment protects your data science users from infrastructure development cycles. You do not want a Terraform apply breaking a data scientist’s notebook session.

Likewise, this is often where you manually install a piece of software or infrastructure to test it out before committing to IaC implementations.

Stage 2: Infrastructure Pre-Production

Validated infrastructure changes are promoted here before release to the production data environments. This is your gate between “infrastructure team is experimenting” and “infrastructure is ready for data science users.” Changes released from here deploy to all three production-data environments in close succession.

This is the promotion from the prior stage that could have had manual operations introduced that means the IaC has some unintended affects from those manual operations. This region never has anything introduced without a clean IaC promotion.

Stage 3: Prod Discovery (Data Science Development)

This is where the SDLC model breaks and the DS model begins. Discovery has production data. It is the interactive, exploratory environment where data scientists do their work — notebooks, feature engineering, model experimentation, data exploration.

The “Prod” prefix is deliberate and important. It signals to security and compliance teams that this environment holds real data and requires production-grade controls. But it also has development-like flexibility: data scientists can install packages, run experiments, create copies of datasets for feature engineering, and iterate rapidly.

Key characteristics:

Stage 4: Prod Integration (Data Science Pre-Production)

No interactive work happens here. This is the automation layer — where data science work is promoted from Discovery for validation before reaching Final Prod.

Automated pipelines run here: model training jobs, data pipeline promotions, scheduled retraining. If something breaks in Integration, it does not affect customers in Final Prod. This is the same concept as a staging environment in SDLC, but with production data and stricter controls than Discovery.

Stage 5: Production (Final Prod)

Where customers consume AI/ML insights. Hosts the final copies of data engineering datasets, trained models, and inference endpoints. The most restrictive controls, the least flexibility, the highest audit requirements.

Changes arrive here only through automated promotion from Integration. No interactive access. No ad-hoc queries. No notebook sessions.

The Three-Stage Variant

Not every organization needs five stages. Startups and smaller teams can collapse this to three environments while preserving the core principle: production data in an interactive environment with appropriate controls.

graph LR
    DISC3[Prod Discovery<br/>DS Interactive<br/>+ Infra Dev] --> INT3[Prod Integration<br/>DS Automation]
    INT3 --> FINAL3[Production<br/>Final Prod]
    style DISC3 fill:#FF9800,color:#fff
    style INT3 fill:#FF5722,color:#fff
    style FINAL3 fill:#f44336,color:#fff

This variant merges infrastructure development into Discovery (accepting the risk of infra changes affecting DS users) and eliminates the separate infrastructure pre-production stage. It works when:

An even more minimal variant drops Integration entirely:

graph LR
    DISC2[Prod Discovery<br/>DS Interactive] --> FINAL2[Production<br/>Final Prod]
    style DISC2 fill:#FF9800,color:#fff
    style FINAL2 fill:#f44336,color:#fff

This is the startup model: data scientists work directly in an environment with production data, and promotions go straight to Final Prod. It works until your first compliance audit asks how you validate ML pipeline changes before they affect customers.

Security and Compliance Implications

The “Prod” designation on Discovery is not just naming — it carries real consequences:

The last point is pragmatic but important. When a SOC 2 auditor sees an environment called “Development” with production data, they flag it immediately. When they see “Prod Discovery” with documented controls, monitoring, and access restrictions, the conversation is about whether the controls are adequate — not whether the architecture is fundamentally wrong.

Where I Have Applied This

This framework is not theoretical. It evolved from watching the problem manifest across multiple organizations over fifteen years — each role adding a layer to the thinking.

The Compliance Foundation (2006–2013)

Three roles established the security and data governance principles that underpin this framework. At BD Biosciences (2006–2007), I managed IT for a medical device manufacturing plant under FDA quality system regulations — where every system change required documented validation. At NC Department of Revenue (2007–2011), I managed taxpayer data under IRS Safeguard compliance (Publication 1075), passed a seven-month federal audit producing a 1,300-page report, and learned that production data requires production-grade controls regardless of who is accessing it. At SAS Institute (2011–2013), I administered validated systems for 80+ pharmaceutical customers under FDA CFR Part 11 — clinical trial data across dozens of companies where biostatisticians needed interactive access for analysis while the production systems required change-controlled, auditable deployments. The pharmaceutical tension between analytical flexibility and regulatory compliance is the same DS-vs-SDLC conflict described above, just with FDA enforcement authority behind it.

The common thread: regulated industries figured out decades ago that “interactive access to sensitive data” is not incompatible with “auditable, controlled environments.” You just have to design for both simultaneously rather than treating them as mutually exclusive.

The Catalyst (2013–2016)

Full Implementation (2019–present)

The Organizational Challenge

The technical architecture is the straightforward part. The hard part is getting three constituencies to agree:

  1. The CISO / Security team — wants minimal data exposure, maximum controls, and no exceptions to the standard SDLC model
  2. The CDO / Data Science team — wants production data access with development-like flexibility and minimal friction
  3. The business unit leaders — want ML models that actually work (which requires real data) without compliance risk

The five-stage framework gives each constituency what they need: security gets production-grade controls and audit logging on every environment with real data; data science gets interactive access to production data; business gets models trained on real data with a documented compliance posture.

The EMBA coursework I am completing at UNC Wilmington has reinforced something I learned through experience: the technical architecture is a negotiation artifact. The framework succeeds not because it is technically elegant, but because it gives each stakeholder a way to say yes without compromising their core requirements.

Implications

If you are building an AI organization — or evaluating whether your current platform can support one — this is the first infrastructure decision you need to get right. Everything downstream depends on it:

The standard SDLC model will not give you this. You need a purpose-built framework that acknowledges the fundamental difference between application development and data science: production data is not the destination — it is the starting point.

Categories: technical, ai

About the Author: Michael McGarrah is a Cloud Architect with 25+ years in enterprise infrastructure, machine learning, and system administration. He holds an M.S. in Computer Science (AI/ML) from Georgia Tech and a B.S. in Computer Science from NC State University, and is currently pursuing an Executive MBA at UNC Wilmington. LinkedIn · GitHub · ORCID · Google Scholar · Resume