Senior DevOps / Platform Engineering · v1.0 · 2026

New Project Intake
Checklist

Complete this at the start of every new project. Every unchecked item is a risk. Fill in answers, then download as PDF for sign-off.

Kubernetes CI/CD Data Classification Backup & DR Security Observability Cost Planning

01
Project Context & Stakeholders
Who owns this, what does it do, who approves things
0 / 8
Purpose: Establish shared understanding before any infrastructure work begins. Misaligned expectations here cause the most expensive rework.
What is this project? STANDARD
High-level description: web app, data pipeline, ML model, internal tool, API service, etc.
Who is the business owner / sponsor? CRITICAL
The person accountable for budget and go/no-go decisions. Must be a named individual, not a team.
Who is the application-side technical lead?
Primary engineering contact for app-level decisions and architecture questions.
Who approves infrastructure changes in production? CRITICAL
CAB process, named individual, or approval ticket workflow?
What are the SLA / uptime requirements? CRITICAL
e.g. 99.9% uptime, RTO < 1h, RPO < 15min. Drives HA topology, replica count, and backup strategy.
What is the expected user base and traffic profile?
Internal only or external? Concurrent users? Peak load windows? Seasonal spikes?
What is the environment timeline?
When is dev needed? Staging? Production go-live? Any hard deadlines?
Are there any known dependencies on other teams or systems?
Shared databases, downstream APIs, external vendors, or blocked by another project?
02
Kubernetes & Cluster Access
We do not own the cluster — all access must be requested and approved
0 / 11
Important: We do not own the Kubernetes cluster. All namespace provisioning, RBAC, resource quotas, and network policies require approval from Aaron or Enterprise Ops. Do NOT assume access — request it early. Delays here block everything else.
Which cluster will this project run on? CRITICAL
Get the cluster name, API server URL, and environment (dev/staging/prod). Confirm it exists.
Who do we contact to request namespace creation? CRITICAL
Aaron? Enterprise Ops team? Ticket queue? Get the exact process and expected turnaround time.
Has a namespace been requested and approved? CRITICAL
Namespace name, cluster, environment. Do not begin CI/CD setup until this is confirmed in writing.
What RBAC roles are we being granted? IMPORTANT
view / edit / admin? Cluster-scoped or namespace-scoped? Who grants it and how?
What resource quotas are applied to our namespace? IMPORTANT
CPU limits, memory limits, PVC storage, number of pods/services. Request increases before dev starts.
What Kubernetes version is the cluster running?
Affects API compatibility. Validate against Helm charts, CRDs, and operator versions.
What ingress controller is available? IMPORTANT
nginx, Traefik, ALB Ingress? Who manages it? Do we get a subdomain automatically?
Is there a service mesh? (Istio, Linkerd)
Affects mTLS, traffic routing, observability. If yes, are sidecars injected automatically?
Is there an internal container registry?
Harbor, ECR, Artifactory? Get push credentials before CI/CD setup begins.
Are there existing cluster-wide admission controllers? IMPORTANT
OPA/Gatekeeper, Kyverno? These may silently block deployments if policies are not met.
What is the node pool / node selector strategy?
Spot vs on-demand? GPU nodes? Taints/tolerations we need to configure?
03
Data Classification & Compliance
Determines cluster tier, encryption, audit obligations, and backup retention
0 / 9
CRITICAL: Data classification determines EVERYTHING — which cluster tier, encryption requirements, audit logging, network isolation, and backup retention period. Get this answered in the first meeting. Wrong classification = compliance failure.
What is the data classification level? CRITICAL
Public / Internal / Confidential / Restricted / PII / PHI / PCI / ITAR / CUI?
Does this handle PII (Personally Identifiable Information)? CRITICAL
Names, emails, SSNs, IP addresses. Triggers GDPR / CCPA / state privacy law requirements.
Does this handle PHI (Protected Health Information)? CRITICAL
Medical records, diagnoses. Triggers HIPAA — dedicated cluster and BAA may be required.
Does this handle payment card data (PCI DSS)? CRITICAL
Card numbers, CVVs. Requires PCI-compliant cluster and strict network segmentation.
Does this handle classified / export-controlled data (ITAR / CUI)? CRITICAL
US government data. May require FedRAMP authorization or dedicated govcloud deployment.
What compliance frameworks apply? IMPORTANT
SOC 2, ISO 27001, NIST 800-53, HIPAA, FedRAMP, GDPR, CCPA?
Is encryption at rest required? IMPORTANT
LUKS, etcd encryption, encrypted PVCs, KMS-managed keys?
What audit logging is required? IMPORTANT
Who needs access to logs? How long must they be retained? (30d / 1yr / 7yr)
Are there data residency requirements? IMPORTANT
Data must stay in a specific country or region? Affects cloud region and cross-region backup strategy.
Data Classification → Infrastructure Tier Reference
ClassificationCluster TierEncryptionAudit LogsBackup Retention
Public / InternalShared clusterTLS only30 days7 days
Confidential / PIIIsolated namespaceTLS + at-rest90 days30 days
PHI / PCI / HIPAADedicated clusterFull + KMS1 year7 years
ITAR / CUI / FedRAMPGov-dedicatedFIPS 140-27 years7 years
04
CI/CD & Source Control
Pipeline toolchain, registry, deployment strategy, and secret management
0 / 9
Where is the source code hosted?
GitLab, GitHub, Bitbucket? Self-hosted or cloud? Group/org name and repo URL?
What CI/CD platform will we use? IMPORTANT
GitLab CI, GitHub Actions, Jenkins, ArgoCD, Tekton? Who manages the runners?
What is the deployment strategy? IMPORTANT
Rolling update, blue/green, canary? Who approves production deployments?
Do we use GitOps? (ArgoCD / Flux)
Is there an existing GitOps config repo? Who has push access to it?
What container registry will be used? IMPORTANT
Harbor, ECR, GCR, Artifactory? Push credentials needed before pipeline setup.
Will we use Helm charts or raw manifests?
Existing chart? Custom? Values files per environment? Chart repository location?
What environments need pipelines?
dev / staging / prod? Manual approval gates between stages?
What secret management solution is in use? CRITICAL
HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets, Sealed Secrets, External Secrets Operator?
Are there SAST / DAST / SCA scanning requirements? IMPORTANT
Checkmarx, Snyk, Trivy, OWASP ZAP? Must results pass before deploy?
05
Networking & Security
Ingress, egress, firewall rules, TLS, and zero-trust posture
0 / 9
What domains / subdomains does this project need? IMPORTANT
Who manages DNS? Route 53, Cloudflare, internal DNS? How long does provisioning take?
Who issues TLS certificates? IMPORTANT
cert-manager + Let's Encrypt, internal CA, purchased wildcard cert?
Does the app need outbound internet access?
If yes, what destinations? Proxy required? Egress firewall rules needed?
Does the app need to reach internal services / databases?
VPC peering, PrivateLink, VPN tunnel, service endpoints? Who requests connectivity?
Are NetworkPolicies required in the namespace? IMPORTANT
Default deny-all? Which pods can communicate with which? Who approves exceptions?
Is a WAF (Web Application Firewall) required?
AWS WAF, Cloudflare, ModSecurity? Who manages the rule sets?
What authentication method does the app use?
OAuth2, OIDC, SAML, LDAP, API keys? SSO / enterprise IdP integration required?
Is image signing required? IMPORTANT
Cosign, Notary, SBOM generation? Who signs and who verifies at deploy time?
Are there vulnerability scanning requirements for running containers?
Runtime scanning (Falco, Aqua, Prisma)? What happens on a critical CVE in prod?
06
Storage & Databases
Persistent volumes, databases, object storage, and retention policies
0 / 6
Does the app need persistent storage? IMPORTANT
PVCs? Which StorageClass? RWO vs RWX? Initial size and growth estimate?
What database(s) does this project use? CRITICAL
PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch? Managed service or self-hosted in cluster?
Who manages the database? IMPORTANT
Provisioning, patching, backups, failover — platform team or app team? Clearly assign ownership.
Is object storage needed?
S3, MinIO, GCS, Azure Blob? Bucket naming, lifecycle policies, access control?
What are the data retention requirements?
How long must data be kept? Hot/warm/cold tiers? Legal hold requirements?
What is the expected data volume and growth rate?
Current size? Monthly growth rate? Used for storage class selection and cost projection.
07
Backup & Disaster Recovery
RTO, RPO, backup schedule, and tested restore procedures
0 / 9
CRITICAL: A backup strategy with no tested restore is not a backup strategy. Restore tests must be scheduled, executed, and documented before go-live. No exceptions.
What is the Recovery Time Objective (RTO)? CRITICAL
How long can the system be down before it becomes a business-impacting problem?
What is the Recovery Point Objective (RPO)? CRITICAL
How much data loss is acceptable? RPO = 0 requires synchronous replication.
What needs to be backed up? CRITICAL
Databases, PVCs, config maps, secrets, application state, object storage, Helm values?
What backup tool will be used? IMPORTANT
Velero (K8s), pg_dump, mysqldump, RDS snapshots, AWS Backup, Kasten K10?
What is the backup schedule? IMPORTANT
Hourly / daily / weekly? Incremental or full? Retention per tier (7d / 30d / 1yr)?
Where are backups stored? IMPORTANT
Different region? Different cloud account? Air-gapped? Encrypted at rest?
Has a restore procedure been documented and tested? CRITICAL
Full restore test must be completed and signed off before production go-live.
Is multi-region or multi-AZ deployment required?
Active-active, active-passive, or single-region with cross-region backup?
Is there a DR runbook? IMPORTANT
Who declares a DR event? Who executes the runbook? Is the contact tree documented?
08
Observability & Monitoring
Metrics, logs, traces, alerts, and on-call rotation
0 / 7
What monitoring stack is available? IMPORTANT
Prometheus + Grafana, Datadog, New Relic, CloudWatch, Dynatrace?
What logging stack is in use? IMPORTANT
ELK, Loki + Grafana, Splunk, CloudWatch Logs? How do developers query logs?
Is distributed tracing required?
Jaeger, Zipkin, AWS X-Ray, Datadog APM? OpenTelemetry instrumentation needed?
Who receives alerts and what is the on-call rotation? CRITICAL
PagerDuty, OpsGenie, Slack? Who is primary? Who is escalation? After-hours coverage?
What are the critical alert thresholds? IMPORTANT
CPU > X%, error rate > Y%, p99 latency > Zms, disk > W%?
Are SLOs / SLIs defined?
Error budget? Who owns the SLO dashboard? Who reviews it and at what cadence?
How long must logs be retained?
30 days hot + archive? Compliance-driven retention rules apply here too.
09
Cost & Resource Planning
Budget, tagging, capacity estimates, and autoscaling
0 / 6
What is the monthly infrastructure budget? IMPORTANT
Hard limit or soft target? Who approves overages? Who gets cost alerts?
What cost center / cloud account does this bill to?
AWS account ID, GCP project, Azure subscription? Mandatory cost tags required?
What resource requests/limits should pods start with?
CPU request/limit and memory request/limit per container. Right-size from load test results.
Is autoscaling required?
HPA (CPU/memory), KEDA (event-driven), VPA, Cluster Autoscaler? Min/max replicas?
Are spot / preemptible instances acceptable?
Acceptable for non-critical workloads. Not appropriate for stateful or latency-sensitive services.
Has a cost estimate been run for steady-state and peak?
Run in Infracost or the cloud pricing calculator before provisioning begins.
10
Handoff, Documentation & Runbooks
Ensuring the system can be operated and handed off without the original author
0 / 6
Is there an architecture diagram? IMPORTANT
Draw.io, Lucidchart, C4 model. Must reflect actual deployed state, not aspirational design.
Is there a runbook for common operational tasks? IMPORTANT
Restart service, scale up/down, rotate secrets, trigger manual backup, check health.
Where is all documentation stored? IMPORTANT
Confluence, GitLab Wiki, GitHub Wiki? Link must be in the repo README.
Is there a defined on-call handoff process?
Who is on-call at launch? How are escalations handled after hours?
Has a go-live checklist been completed and signed off? CRITICAL
Infrastructure, security, monitoring, backup, and access all verified before launch.
Is there an incident response process defined?
Post-mortem template, severity levels, communication plan, RCA timeline.

Sign-Off & Approval
This checklist must be signed before infrastructure provisioning begins
Note: All six roles must sign. This document is a living artifact — re-sign when scope changes or when promoting between environments (dev → staging → prod).