Models as a Service. Your weights. Our GPU.
Rent a dedicated NVIDIA Blackwell GPU in our Auckland facility. Ship us any model weights you have rights to — open-weight, fine-tuned, or proprietary — and we hand back a private, OpenAI-compatible endpoint. No shared tenancy. No tokens leaving the country.
Your model. Our hardware. Private API.
Rent a dedicated GPU in our Auckland facility. Bring any model weights you have rights to — we load them and give you a private, OpenAI-compatible endpoint. Nothing shared, nothing logged, nothing offshore.
Dedicated 24GB
Entry tier for small-to-mid models and prototyping.
Contact for pricing
NVIDIA RTX PRO 4000 Blackwell
24 GB GDDR7
Dedicated · single tenant
Up to 14B dense models
Quantised 30B-class (Q4/Q5)
Fine-tune serving & LoRA adapters
Embedding and reranker workloads
Internal prototypes and pilots
Dedicated 96GB
Flagship for 70B-class models and long-context workloads.
Contact for pricing
NVIDIA RTX PRO 6000 Blackwell
96 GB GDDR7
Dedicated · single tenant
70B-class dense models
100B+ MoE (active params permitting)
Long-context coding and agent stacks
Vision-language and multimodal models
Production workloads at single-tenant latency
From weights to API in days, not quarters.
Four steps. No procurement cycle for the GPU, no data leaving the country.
01
Pick a model
Tell us what you want to run. Open-weight model from HuggingFace, your own fine-tune, a LoRA adapter, or a stack of smaller models in memory. If it fits in 24GB or 96GB of VRAM, we can serve it.
02
Pick a GPU
RTX PRO 4000 Blackwell (24GB) for smaller models and pilots, or RTX PRO 6000 Blackwell (96GB) for 70B-class and long-context workloads. Both are single-tenant — your GPU is yours.
03
We deploy
We load the weights onto your GPU in our Auckland facility, wire up an OpenAI-compatible endpoint, and hand you an API key. Typical lead time is a few business days from signed agreement.
04
You ship
Point your existing OpenAI SDK at your private endpoint. Streaming, tool use, system prompts — all standard. We monitor the hardware; you own the model and the data.
The honest details.
Anything that fits the VRAM. The 24GB tier comfortably runs models up to about 14B at full precision, or 30B-class with quantisation (Q4/Q5). The 96GB tier handles 70B-class dense models at full precision, larger MoE models within active-parameter limits, and most vision-language models. Common picks: Llama 3.x, Qwen 3, DeepSeek, GPT-OSS, Mistral, Gemma — plus your fine-tunes and LoRAs.
You do. We need a model file (or a HuggingFace repo we can pull from) and confirmation that you have the licence to run it. We do not redistribute your weights, we do not log your prompts or responses, and we do not train on anything you send through the endpoint.
In our Auckland facility, hosted in Datacom Datacentres on NZ-controlled networking. Your traffic never leaves the country — no offshore failover, no cloud passthrough, no cross-border monitoring.
Yes. Swapping the loaded model on your dedicated GPU is part of the service — typical turnaround is a business day. If you want to hot-swap or run multiple models concurrently within the VRAM budget, we can configure multi-model serving (vLLM or TGI) on request.
OpenAI-compatible /v1/chat/completions with streaming, tool use, and system prompts. Your private endpoint is a unique subdomain on api.aifoundry.co.nz protected by an API key issued just for your deployment. Drop-in with the OpenAI SDK, LangChain, LiteLLM, or anything that speaks the OpenAI shape.
Month-to-month with a notice period, or a discounted term (6–24 months) for production workloads. Hardware setup and model loading are billed once at deployment. Pricing depends on the GPU tier, expected concurrency, and any custom serving requirements — talk to us for a quote.
Start building today.
Sign up, generate an API key, and send your first request in minutes. Cancel from the Stripe portal any time.