Compare your current skills to what this opportunity needs—we'll show you what you already have and what could strengthen your application.
Job Description
Job Title:
Victoria Metrics Architect Location :
Bellevue, WA /
Overland Park, KS Contract Required Qualifications:
VictoriaMetrics — Expert Level Candidates should have hands-on Victoria Metrics production experience at scale for this role. Significant production experience operating VictoriaMetrics at scale — VMCluster deployments handling sustained, high-cardinality workloads in live environments. This is the non-negotiable baseline for the role.
VMCluster internals at depth:
the write path from VMInsert through VMStorage replication, the query fan-out and merge behavior of VMSelect, and the performance implications of topology decisions on ingestion throughput and query latency.
Active time series lifecycle management:
how time series are created, sustained, and expired; the relationship between cardinality and memory pressure; and the ability to diagnose and remediate a cardinality explosion in a production environment.
MetricsQL fluency:
advanced aggregation, rollup window semantics, subquery patterns, and query design that reduces load on VMStorage at scale.
VMAgent at depth:
scrape configuration, stream aggregation for edge-side cardinality reduction, rate limiting, deduplication, and write buffering continuity during upstream unavailability.
VMAuth multi-tenancy:
per-tenant routing via VMUser custom resources, token-based authentication, and read/write path segregation.
VMAlert and VMAnomaly:
alerting and recording rule design, anomaly model selection, and integration with enterprise alert dispatch systems.
Federation design:
global query layer architecture, cross-cluster deduplication, and remote_write performance tuning under high-cardinality ingestion at sustained scale.
Storage architecture:
retention modelling, down sampling, backup and restore, and capacity planning for time-series workloads.
VictoriaMetrics Operator:
lifecycle management of all VM custom resource definitions and upgrade strategy on OpenShift. Red Hat OpenShift — Production Depth Substantial Kubernetes experience with a material portion on Red Hat OpenShift in bare-metal or on-premises enterprise environments — not exclusively managed cloud Kubernetes.
OpenShift security model:
Security Context Constraints, Network Policy, namespace RBAC, and the constraints that apply to stateful, high-throughput workloads. StatefulSet lifecycle, PersistentVolumeClaim management, and StorageClass selection for write-intensive time-series workloads. OCP upgrade path management and the implications for Operator compatibility and cluster monitoring interactions.
Multi-cluster OpenShift topology:
hub and spoke architectures, cross-cluster networking, and remote scrape or remote_write connectivity across cluster boundaries. Comfort designing for IPv6 and dual-stack network environments — increasingly common in carrier-grade infrastructure deployments. GitOps and CI/CD Delivery GitOps-native delivery as a professional standard: all platform configuration managed in Git, no manual changes to production cluster state, and a clear promotion gate model from lab through to production.
ArgoCD at production scale:
application hierarchy design, sync policy configuration, health checks for custom resources, and multi-cluster application deployment. Kustomize overlay strategy for multi-cluster and multi-tenant deployments — base definitions with environment-specific patches.
GitLab CI/CD pipeline design:
manifest validation, environment promotion gates, and automated operator upgrade pipelines. Terraform or equivalent infrastructure-as-code for provisioning supporting platform resources. Security and Identity HashiCorp Vault at production depth: dynamic secrets, Vault Secrets Operator synchronisation, token lifecycle management, and PKI secrets engine integration for certificate issuance.
Enterprise PKI:
TLS certificate lifecycle, automated renewal, and CA distribution to distributed cluster workloads. OIDC and OAuth2 integration: platform service authentication via an enterprise identity provider, service account token federation, and the elimination of static credential patterns. Zero Trust design as a default: every interface between platform components authenticated and encrypted; no implicit trust between tenants, ingestion sources, or query consumers. Telecommunications and Network Observability Proven experience designing or operating observability platforms for telecommunications infrastructure — 5G core, RAN, transport, or carrier-grade edge environments.
FCAPS framework alignment:
mapping Fault, Configuration, Accounting, Performance, and Security monitoring requirements to metric taxonomies, alerting rules, and operational dashboards.
Heterogeneous vendor telemetry integration:
Prometheus exporter compatibility assessment, OpenMetrics format validation, and labelling standardization across multi-vendor sources. Multi-vendor, multi-tenant metrics ingestion design: label isolation strategy, per-vendor cardinality allocation, and data segregation enforcement at the proxy and routing layer.
Enterprise NOC integration:
alert routing design from evaluation engine through to ticketing or event management platforms, deduplication, suppression, and severity mapping.