GraphQL Infrastructure Plan

This document will address the approach that will be taken to allow us to begin building and deploying the new GraphQL instances as part of the CMS project.

n.b The assumption has been made that we will only be deploying this for UAT and Production environments.

Short to Medium Term

In the short to medium term we will look to extend the current GraphQL clusters in GCP to allow us to starting building and deploying the new GraphQL instances into there.

Infrastructure

Much of the current infrastructure we have in GCP has sufficient capacity to host a new service as the VPC and subnets used with the clusters are massively overprovisioned for what they currently run.

The VPC's (and their subnets) each have /22 CIDR ranges meaning we can accomodate up to 1024 unique IP addresses within there.

As with the VPC's and subnets, the node pool of the GraphQL kubernetes cluster is equally overprovisioned.

As well as this the firewall, routing and NAT capabilities of the current infrastructure are sufficient to accomodate the new implementation.

Kubernetes

Although the underpinning architecture can accomodate the new GraphQL implementation, we will still need to create new Kubernetes configuration to allow us to differentiate between the incumbent and the new GraphQL's.

As such the following will need ot be created to allow the new GraphQL to be deployed to this cluster:-

New Services
- 1 for UAT
- 1 for Production
New Deployments
- 1 for UAT
- 1 for Production
- We should use this as an opportunity to scale up the requests sections of this to prevent unwanted OOM's.
New Ingress
- 1 for UAT
- 1 for Production
- Preferable to modifying the incumbent as then they are decoupled.
- New origin certs will be required for this.
New Pod Disruption Budget (PDB)
- 1 for UAT
- 1 for Production
New Horizontal Pod Autoscaler (HPA)
- 1 for UAT
- 1 for Production
- We should use this as an opportunity to tweak what metrics we scale on as CPU doesn't seem appropriate given past experience.

We should label these objects with an appropriate name, e.g. cms-graphql-<env> so that we do not cause clashes with the current GraphQL implementation.

As well as this we should look get the metrics and logs from GKE into our central monitoring platform Grafana. We can ingest GCP as a data source in Grafana. If that is not sufficient we can then look at prometheus, promtail and loki however that is quite full on and requires maintenance which we don't really want to do for the current clusters.

Cloudflare

We will need to create appropriate DNS records for the new GraphQL services.

The DNS records for the current implementation take the form of graphql-<env>.raphadev.cc for staging environments and graphql.rapha.cc for Production.

Following on from the discussions around API naming convention, it would be wise to stick to a similar format so something along the lines of www.graphql.rapha.cc and uat.graphql.rapha.cc.

CI/CD

We will take the same approach for CI as we do for the current GraphQL.

Long Term

In the long term we will look to move this service out of GCP and into a new Kubernetes cluster hosted in AWS's managed Kubernetes offering EKS.

Before we can do that though a great deal of planning, terraforming and development is required so that the clusters are maintainable going forward and can be used to deploy more than just GraphQL.

GraphQL Infrastructure Plan

Short to Medium Term​

Infrastructure​

Kubernetes​

Cloudflare​

CI/CD​

Long Term​