How It Works
The AI stack is a layered architecture of interconnected services, platforms, and infrastructure components that collectively enable organizations to build, deploy, and operate artificial intelligence systems at scale. Understanding how these layers interact — from raw compute through model training to production inference — is essential for procurement officers, enterprise architects, and technical evaluators navigating the US AI services market. This page maps the structural mechanics of how AI service delivery operates across that stack, including where oversight enters, how delivery paths diverge, and what professional practitioners monitor in production.
Where Oversight Applies
AI stack deployments operate under a patchwork of regulatory and standards-based oversight rather than a single governing body. At the federal level, the National Institute of Standards and Technology (NIST) published the AI Risk Management Framework (AI RMF 1.0) in January 2023, establishing voluntary governance criteria across four core functions: Govern, Map, Measure, and Manage. Organizations operating in regulated industries — healthcare, financial services, federal contracting — face additional overlay requirements. Healthcare AI applications intersect with FDA oversight under 21 CFR Part 11 for software as a medical device, while federal contractors are subject to NIST SP 800-53 security controls applied to AI-enabled systems.
Data governance is a parallel compliance layer. Training data pipelines touching personally identifiable information fall under the FTC Act Section 5 (unfair or deceptive practices) and, in California, the California Consumer Privacy Act (CCPA). The EU AI Act, while not US law, affects any US-based provider with European users, establishing risk tiers that classify general-purpose AI models above a defined capability threshold as subject to mandatory transparency obligations.
The broader service landscape — including managed AI services, AI security and compliance services, and responsible AI services — is shaped by these overlapping standards rather than a single unified regulator.
Common Variations on the Standard Path
The "standard" AI deployment path assumes a sequence of data ingestion, model training or selection, fine-tuning, and production inference through an API or embedded application. In practice, four major delivery variations alter this path significantly:
-
Foundation model adoption (no training): Organizations access pretrained models from foundation model providers via AI API services, bypassing the training phase entirely. This path reduces time-to-deployment but constrains customization and raises data-privacy questions when prompts are transmitted to third-party endpoints.
-
Fine-tuning on proprietary data: Rather than full model training, practitioners use fine-tuning services to adapt a foundation model to domain-specific vocabulary or task formats. This approach requires substantially less compute than training from scratch — fine-tuning a 7-billion-parameter model typically requires fewer than 8 A100 GPUs for a short run — but demands careful dataset curation.
-
Retrieval-augmented generation (RAG): Retrieval-augmented generation services attach a live retrieval layer — typically a vector database — to a foundation model, enabling real-time grounding in organizational knowledge bases without weight updates. RAG is the dominant pattern for enterprise knowledge management deployments.
-
On-premises or air-gapped deployment: Organizations in defense, intelligence, and regulated finance sectors route through on-premises AI deployment to avoid external data egress. This path requires self-managed GPU cloud services or dedicated hardware procurement, significantly increasing operational overhead.
The comparison between open-source vs proprietary AI services cuts across all four paths and is a primary decision variable in enterprise architecture reviews.
What Practitioners Track
Production AI systems require continuous operational monitoring across dimensions that differ from conventional software. AI observability and monitoring covers four primary signal categories:
- Model drift: Statistical divergence between training-time and inference-time input distributions, measured via population stability index (PSI) or Kullback-Leibler divergence metrics.
- Latency and throughput: Inference latency at the 95th and 99th percentile thresholds, governed by AI service level agreements that typically specify p95 response times in milliseconds.
- Cost per inference: Token-based pricing on API-served models makes cost-per-query a first-class operational metric; AI stack cost optimization practitioners use batching strategies and model quantization to manage this.
- Safety and output quality: Hallucination rates, toxicity scores, and refusal rates are tracked through automated evaluation pipelines, increasingly standardized against NIST AI RMF Measure function criteria.
MLOps platforms and tooling provides the operational scaffolding for this monitoring layer, integrating with AI data pipeline services upstream and alerting systems downstream.
The Basic Mechanism
At its core, an AI stack converts raw data into predictions, classifications, or generated content through a sequence of discrete processing phases:
- Data ingestion and preprocessing: Raw data is collected, cleaned, and transformed into formats suitable for model consumption. This phase is handled by AI data pipeline services and may include vectorization for embedding-based retrieval.
- Model selection or training: A model is either selected from existing foundation model inventories or trained from scratch using AI model training services on GPU compute infrastructure.
- Evaluation and validation: The model is evaluated against held-out test sets using task-specific metrics (F1 score, BLEU, ROUGE, or domain-specific benchmarks).
- Deployment and serving: The validated model is packaged and served via an inference endpoint — either through AI API services for cloud-hosted models or through containerized serving stacks for on-premises AI deployment.
- Monitoring and iteration: Live performance signals feed back into retraining or prompt-engineering cycles, closing the operational loop.
The AI Stack Components Overview maps how each layer of this mechanism corresponds to distinct service categories in the commercial market. The homepage provides the top-level orientation to how this reference authority organizes those categories for professional research and procurement use.