Setting up Google FHIR to support research

Kyle Ellrott
15 min readMar 31, 2021

--

Brian Walsh, Eric Torstenson, Allison Heath, Robert Carroll and Kyle Ellrott

Fast Healthcare Interoperability Resources (FHIR, pronounced “fire”) is a data modeling language, a core model defined using that language, and an API specification with a primary focus on clinical data interoperability. The original draft of the FHIR specification was published by HL7, a not-for-profit, ANSI-accredited standards developing organization, in 2011, and the specification has continued to grow as more organizations adopt the standard.

The National Institute of Health (NIH) has seen the value of FHIR not only for healthcare records, but also as a formalized schema to enable sharing of clinical and phenotypic data for research. Statements from the Office of Data Science Strategy recognize that FHIR “can accelerate the use of clinical data for research” and thus “NIH encourages funded researchers to explore the use of the FHIR standard to capture and integrate patient- and population-level data from clinical information systems for research purposes and to use it as common structure for sharing research data”. Because of this, many researchers are now trying to figure out how to incorporate FHIR into their data strategies. Luckily, there are a number of FHIR products that can be deployed to start building a compliant data repository. Possible FHIR solutions include Firely, IBM, hapi.fhir.org, Aidbox and smilecdr. Aside from software solutions, there are also ‘FHIR as a service’ offerings from Azure, AWS and Google via their Cloud Healthcare API.

Google FHIR, which was originally launched in April 2019, provides a managed FHIR service backed by Google’s Cloud Healthcare API which may prove to be an easy, low cost solution for managing FHIR records for NIH funded researchers. Because it is built on their cloud technologies, the Google FHIR server can be integrated with other Google cloud technologies including BigQuery, Pub/Sub, Datasets and IAM. Google FHIR is easy to set up, making it easy for anyone to quickly dive into the world of FHIR. That isn’t to say that Google FHIR is feature complete. During testing, we found issues with the validation model, advanced search parameters and small quirks with id generation for newly posted records. Of course, Google FHIR is still marked as being in beta, so these issues may quickly disappear.

Despite these issues, getting started with Google FHIR is quite easy. It’s possible to create a FHIR service, load and query a full FHIR instance within 20 minutes. For this article we will give an overview of how to work through these steps, including a data model utilized by NIH Cloud-based Platform Interoperability Effort (NCPI) for research along with test records to populate your system. With all of this in place you’ll be able to start testing against your own personalized FHIR service.

FHIR Data Model

The FHIR standard has seen 4 major releases and is currently defined by a specification called R4, the first normative release which includes backwards compatibility guarantees. The core FHIR data model is described in the ImplementationGuide’s Profiles. An Implementation Guide (IG) is a set of configuration files that define various aspects of the deployed FHIR data model on a server. An IG describes the content and data model of a FHIR resource using the FHIR modeling language.

For this guide, we’ll be using an early draft IG developed by the NIH Cloud-based Platform Interoperability (NCPI) FHIR Working Group. To get started, you’ll need to use git to clone the repository found at https://github.com/ncpi-fhir/ncpi-model-forge. In this repository you’ll find most of the definition files under site_root/input/resources. These files add extensions, profiles, terminology definitions and additional validation to the base R4 model.

Resources

Primitive data types, such as strings, integers and dates form the atoms of FHIR data. These elements are combined together into structured records in the FHIR model that are defined as Resources. These Resources examples such as Organizations, Patients and Observations. The collection of defined resource types can be found in the FHIR resource list. For medical records, the R4 model defines the Patient resource, which represents one person or individual. A number of core attributes are attached to a Patient, such as name, gender, address and birthDate. Most medical attributes are attached to a patient, using resources from the Clinical Module, such as the Observation or Condition resource types. We use the FHIR Task resource to denote the pipeline that produced the omics files from the specimen.

Extensions

It is possible to extend FHIR beyond the elements defined in the R4 base model. To allow the storage of new data elements, FHIR provides Extensions, which can be seen as providing the ability to build ‘subclasses’ for the existing data classes. For example, in the case of research, oftentimes records will be defined by age of event, rather than date. In the NCPI model, the age of event extension can be found under site_root/input/resources/extensions/StructureDefinition-age-at-event.json.

Profiles

Profiles represent a set of constraints that can be put on a resource. Value constraints can take various forms, such as ‘age must be above 0’ or forcing a value field to have a term derived from a specified ontology. For ontology based constraints, the profile may point to an existing terminology. In the case of the NCPI Project Forge data model, disease values are coded using the Human Disease Ontology. This ontology, which was originally defined using Web Ontology Language (OWL) has been translated into a `CodeSystem` resource in the file site_root/input/resources/terminology/CodeSystem-doid.json, that declares each of the valid codes and their human readable display values for use as CodeableConcepts.

Data Types

Your project will contain instances of these resource types, tagged with your identifiers and described by a defined vocabulary of CodableConcepts. The key FHIR resources used by the NCPI ImplementationGuide.

DocumentReference resources point to data objects specified by the GA4GH Data Repository Service which provides a location agnostic url to digital assets.

Google FHIR Infrastructure Model

Google FHIR is a new service on the Google Cloud platform provided under their HealthCare API. Once created, it is attached to a cloud project and information related to it can be seen on the Google Cloud Console. A cloud project may have one or more FHIR datastores. Because the FHIR service is attached to a Google Cloud project this means that authentication and authorization is connected to the Google Cloud Identity and Access Management (IAM). Additionally, billing of all FHIR operations will be passed to the billing account attached to that project.

Google cloud billing of their FHIR service is calculated by data size and number of searches. For comparison, the entire NHGRI cohort of metadata held in their AnVIL system, which contains almost 271,000 samples and 192,000 subjects across 215 workspaces, weighs in as less than 1 GB. Projects from the AnVIL system include The Centers for Common Disease Genomics (CCDG) whose metadata expressed as FHIR JSON is 549MB, or the The Centers for Mendelian Genomics (CMG) with a data set of 23MB and finally, the 1000 Genomes whose dataset is only 14MB. Given that structured storage costs of Google FHIR is $0.39 per GB, with the first GB free, the cost of holding copies of these datasets should be minimal. Beyond storage costs, most of the fees would come from operations costs, which will scale to usage of the system. For example, 100,000 standard requests (such as ‘get’ or ‘insert’) per month would cost $0.29, and 100K complex requests (such as search) would cost $0.52 per month.

Getting started with Google FHIR

Before getting started on these examples, there are a few things that will need to be ready. First, you will need a Google cloud project that you own and that is connected to a valid billing account. You will need the Google Cloud command line tools install. We also assume that you will be using a unix based operating system, including MacOS, Linux or Windows Subsystem Linux (WSL).

As part of starting up and initializing the system, we will need to create a cloud project, a dataset and initialize the FHIR datastore. Part of this process will be uploading an ImplementationGuide. The first step will be to set up the environment variables that we will use in our scripting. We’ve provided values to use, but you can replace them with custom values. First, set the compute region to host your assets:

export GOOGLE_LOCATION=us-west2

Provide the human readable name for your project

export GOOGLE_PROJECT_NAME=fhir-test

You will also need your billing account, in XXXXXX-XXXXXX-XXXXXX format

export GOOGLE_BILLING_ACCOUNT=…

We will also define the human readable name for your FHIR endpoint, for these examples, we will be using ‘test-1’

export GOOGLE_DATASTORE=test1

Datasets are top-level containers in your project that are used to organize and control access to your datastores. Data stores are specific to a service i.e. FHIR, DICOM. Next, we will provide a name, unique to your project, for the dataset, e.g. `my-fhir-test`.

export GOOGLE_DATASET=…

We will also need to path to your local copy of the implementation guide. This will be the absolute path to the directory that contains ‘site_root’.

git clone https://github.com/ncpi-fhir/ncpi-model-forgeexport IMPLEMENTATION_GUIDE_PATH=$(pwd)/ncpi-model-forge

If you have FHIR compliant Resources stored as new line delimited json, specify the directory here. This is optional, we will use the examples provided in the ncpi-model-forge project. This should point to the parent directory of new line delimited _Resource_.json files, for example `Patient.json` would contain a patient resource data.

export PROJECT_DATA_PATH=…

Create project

Before getting started, you may want to create a new project for your FHIR work.

gcloud projects create — name=$GOOGLE_PROJECT_NAME — quiet

For the rest of the commands, we’ll need to work with the unique ID Google assigned to the project. You can quickly see that ID using the command:

gcloud projects list — filter=name=$GOOGLE_PROJECT_NAME

To make the rest of the commands easy, we will capture that ID and assign it to an environmental variable.

export GOOGLE_PROJECT=$(gcloud projects list — filter=name=$GOOGLE_PROJECT_NAME — format=”value(projectId)” )

We’ll also use the gcloud command line to set the config to point as this project by default.

gcloud config set project $GOOGLE_PROJECT

If you haven’t already done it, you’ll need to attach a billing to the project. This can be done on the website, which you will have to visit to set up your billing information. But if you already have the billing account information, you can attach it to the project with the command line:

gcloud beta billing projects link $GOOGLE_PROJECT — billing-account=$GOOGLE_BILLING_ACCOUNT

At this point, if you visit your project on the web console you should be able to see that the project exists and that it is attached to a billing account.

Enable the healthcare api and service account

Once that is done, on the main Google Cloud platform, at https://console.cloud.google.com/ go to the ‘APIs & Services’ tab, and on the main page should be a button to ‘Enable APIs and Services’. Clicking this button will bring up a list of services, including the Maps SDK and the YouTube Data API. Find the selection for the ‘Cloud Healthcare API’ and select it. Once on the page for the ‘Cloud Healthcare API’ click ‘Enable’ to add the API to the current project. Once the API is enabled, you should see the control panel. You can always find the Healthcare API management page at https://console.cloud.google.com/apis/api/healthcare.googleapis.com/overview

To get started, you must also create credentials with read/write permissions to the Healthcare API. This can be managed under the ‘IAM & Admin’ panel on the project. It is recommended that you create a service account for working with the API. Under the edit permissions panel, ‘Add Another Role’ find the ‘Cloud Healthcare’ set and select ‘Healthcare FHIR Resource Editor’.

If you want to enable the API using the command line, you can use the command:

gcloud services enable healthcare.googleapis.com

Next, you’ll need to get a service account. We’ll be storing the account name in the environmental variable GOOGLE_SERVICE_ACCOUNT.

export GOOGLE_SERVICE_ACCOUNT=$(gcloud projects get-iam-policy $GOOGLE_PROJECT — format=”value(bindings.members)” — flatten=”bindings[]” | grep serviceAccount)

With the service account in hand, we’ll need to assign bucket reader permissions so that it can be used to make queries.

gcloud projects add-iam-policy-binding $GOOGLE_PROJECT — member=$GOOGLE_SERVICE_ACCOUNT — role=roles/storage.objectViewer

Reviewing the console website, you should be able to verify the Healthcare API is enabled and that the service account has Storage Object Viewer permissions.

Create dataset, data store and load files

With the Healthcare API enabled and the account authorized to use it, we can now begin to set up the dataset attached to the service. In many ways these operations are similar to BigQuery. First create the dataset.

gcloud healthcare datasets create $GOOGLE_DATASET — location=$GOOGLE_LOCATION

Once the dataset is created, create a fhir-store, based on R4. The ‘enable-update-create’ parameter is useful for control over resource ids as we will see below.

gcloud beta healthcare fhir-stores create $GOOGLE_DATASTORE — dataset=$GOOGLE_DATASET — location=$GOOGLE_LOCATION — version R4 — enable-update-create

With the fhir-store created, we’ll start by bulk populating it with data. This will involve uploading data files into a Google bucket so it can be loaded from there. First create a bucket with the same name as the project.

gsutil mb -p $GOOGLE_PROJECT -c STANDARD -l $GOOGLE_LOCATION gs://$GOOGLE_PROJECT

Once the bucket has been created, copy the data from the implementation guide into the bucket.

gsutil cp -r $IMPLEMENTATION_GUIDE_PATH/site_root/input/*.json gs://$GOOGLE_PROJECTgsutil cp -r $IMPLEMENTATION_GUIDE_PATH/site_root/input/resources gs://$GOOGLE_PROJECT

Now copy the data files into the bucket.

gsutil -m cp -r $PROJECT_DATA_PATH gs://$GOOGLE_PROJECT
If you go to your cloud console bucket viewer, you should be able to see the listing of directories, including `examples`, `extensions`, `profiles`, `search` and `terminology`.

Import ImplementationGuide

Here we use load specific directories from the NCPI IG we cloned and then copied into our Google storage above. These commands, sent to the gcloud healthcare API, copy data directly from the Google storage. First, load the project data

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/*.json — content-structure=resource-pretty

Load the extensions

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/resources/extensions/*.json — content-structure=resource-pretty

Load the profiles

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/resources/profiles/*.json — content-structure=resource-pretty

Load the search definitions

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/resources/search/*.json — content-structure=resource-pretty

Load the terminologies.

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/resources/terminology/*.json — content-structure=resource-pretty

Now we can enable the implementation guide.

curl -X PATCH -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -H "Content-Type: application/fhir+json;charset=utf-8" \
--data '{
"validationConfig": {
"enabledImplementationGuides": ["http://fhir.ncpi-project-forge.io/ImplementationGuide/NCPI-Project-Forge"]
}
}'\
"https://healthcare.googleapis.com/v1beta1/projects/$GOOGLE_PROJECT/locations/$GOOGLE_LOCATION/datasets/$GOOGLE_DATASET/fhirStores/$GOOGLE_DATASTORE?updateMask=validationConfig"

With all of the loading done, we can now verify that the dataset has been loaded correctly.

curl -X GET -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" “https://healthcare.googleapis.com/v1beta1/projects/$GOOGLE_PROJECT/locations/$GOOGLE_LOCATION/datasets/$GOOGLE_DATASET/fhirStores/$GOOGLE_DATASTORE"
In addition to querying the datastore via curl you can view the implementation guide and its StructureDefinitions in the google console.

Import examples and query data

gcloud healthcare fhir-stores import gcs $GOOGLE_DATASTORE — location=$GOOGLE_LOCATION — dataset=$GOOGLE_DATASET — gcs-uri=gs://$GOOGLE_PROJECT/resources/examples/*.json — content-structure=resource-pretty

Querying FHIR

To test run the system we can start doing some queries. The FHIR standard defines a URL format of the format VERB [base]/[type]/[id]. The Google implementation has a convention for the [base], that includes the project id, dataset and location. For more details see the docs. You may wish to provide a reverse proxy or api gateway with your own domain. For all of our queries the base url of the FHIR server will be:

https://healthcare.googleapis.com/v1beta1/projects/$GOOGLE_PROJECT/locations/$GOOGLE_LOCATION/datasets/$GOOGLE_DATASET/fhirStores/$GOOGLE_DATASTORE/fhir

For the moment we’ll be starting out with using the curl command line to issue commands, but at this point it is also possible to start playing around with a generic FHIR client like the google console fhirviewer.

To make invoking these command lines easier, we’ll store the token in an environmental variable and create an alias for curl:

export TOKEN=$(gcloud auth application-default print-access-token)alias fhir_curl=’curl -H “Authorization:Bearer $TOKEN” -H “Content-Type: application/json; charset=utf-8”’

We’ll also capture the FHIR base URL

export FHIR_URL="https://healthcare.googleapis.com/v1beta1/projects/$GOOGLE_PROJECT/locations/$GOOGLE_LOCATION/datasets/$GOOGLE_DATASET/fhirStores/$GOOGLE_DATASTORE/fhir"

First we will start by inspecting the schema to verify that the elements of the IG that we added to the system are properly represented. Using a curl to StructureDefintion we’ll get back a list of all the extensions and profiles. We’ll use the jq program to parse the JSON document and list out the resource IDs.

fhir_curl -s -X GET $FHIR_URL/StructureDefinition | jq ‘.entry[] | .resource.id’>>>
"cpi-phenotype"
"ncpi-family-relationship"
"ncpi-drs-document-reference"
"ncpi-drs-attachment"
"ncpi-disease"
"us-core-race"
"us-core-ethnicity"
"age-at-event"

With the data loaded, and the schema working, we can start making queries against the FHIR server. First, we will view the ids of ResearchSubject in the example project.

fhir_curl -s $FHIR_URL/ResearchSubject?study=ResearchStudy/example-research-study-id | jq ‘.entry[] | .resource.id’>>>
"example-research-subject-id"

Then we can leverage outbound edges from the Study by using the _include search parameter and query for the ResearchSubjects while specifying a projection to include the Patient:

fhir_curl -s "$FHIR_URL/ResearchSubject?study=ResearchStudy/example-research-study-id&_include=ResearchSubject:individual" | jq ‘.entry[] | .resource.id’>>>>
"example-research-subject-id"
"pt-001"

Now that we know the Patient, we can query for Observations:

fhir_curl -s "$FHIR_URL/Observation?subject:Patient=pt-001" | jq -rc ‘.entry[] | .resource | [.code.coding[0].code, .valueCodeableConcept.coding[0].display] ‘>>>>
["HP:0000076","Clinical finding present (situation)"]
["FAMMEMB","child"]
["FAMMEMB","child"]

Specimens associated with Patient:

fhir_curl -s "$FHIR_URL/Specimen?subject:Patient=pt-001" | jq ‘.entry[] | .resource.id ‘>>>>
"example-specimen-id"

We can also leverage inbound edges by using the _revinclude search parameter and query for all Patients with Observations

fhir_curl -s "$FHIR_URL/Patient?_revinclude=Observation:subject:Patient" | jq ‘.entry[] | .resource.id ‘>>>
"pt-004"
"pt-003"
"pt-002"
"pt-001"
"ph-001"
"fm-004"
"fm-003"
"fm-002"
"fm-001"

A complete illustration of FHIR based search and aggregation is beyond the scope of this article. We will have more to say on this subject in future posts.

We can also engage the write components of the API. For these exercises we will start adding records. For this example we will add a resource with a given ID. At this point we will see a small deviation in the way Google implements the FHIR API. The entity returned by the POST should use the id provided by the user in payload. However, the google implementation uses an internally generated UUID. This means that to read the records we will need to record the result of the submission.

Create a new DocumentReference with {"id":"example-document-reference-id3"}. Note that the id returned is a hash and the query fails:

curl -s -L https://git.io/JqUsb | fhir_curl -s -X POST -d@- "$FHIR_URL/DocumentReference/example-document-reference-id3" | jq .id>>>
"273e2aa0-bbf9–46b8-bafb-701468375d6b"

But if we attempt to access the document we the id we provided, we get a 404

fhir_curl -s "$FHIR_URL/DocumentReference/example-document-reference-id3" | jq .’issue[] | .diagnostics’>>> 404"resource not found: DocumentReference/example-document-reference-id3"

However, as a work around, since we specified enableUpdateCreate, we can issue a PUT and retrieve using identifier:

curl -s -L https://git.io/JqUsb | fhir_curl -s -X PUT -d@- "$FHIR_URL/DocumentReference/example-document-reference-id3" | jq .id>>>
"Example-document-reference-id3"

Now accessing the document with its given id will work.

fhir_curl -s "$FHIR_URL/DocumentReference/example-document-reference-id3" | jq .id>>>>
"example-document-reference-id3"

Bulk Operations

In addition to the REST based, resource orientated endpoints, FHIR servers support a bulk data endpoint to enable upload and download without issuing thousands of requests. It is also possible to bulk load data from json files stored in Google Cloud Storage.

fhir_curl -X POST --data ‘{
"contentStructure": "RESOURCE_PRETTY",
"gcsSource": {
"uri": "gs://$BUCKET_NAME/site_root/input/resources/examples/*.json"
}}’ "https://healthcare.googleapis.com/v1beta1/projects/$GOOGLE_PROJECT/locations/$GOOGLE_LOCATION/datasets/$GOOGLE_DATASET/fhirStores/$GOOGLE_DATASTORE:import"

The google project supports the bulk data access endpoint

fhir_curl -s "$FHIR_URL/ResearchStudy/$export" | jq ".total">>>>
1

Missing FHIR Conformance

As mentioned previously, Google FHIR data validation is still in alpha and there are a number of operations that are still not fully conformant with the specification. The FHIR conformance statement lists elements that are still not present in their implementation. One example is the concept of Profiles, and how that can be utilized for document validation. While Profiles can be loaded into the Google FHIR server, the conformance statement says that ‘Profiles aren’t validated or enforced by the server.’ We can test this using rules that were introduced as part of the NCPI IG. That implementation guide introduces that ability to link records to GA4GH DRS URLs. The original definition was

{
"key": "must-be-drs-uri",
"severity": "error",
"human": "attachment.url must start with ^drs://. A drs:// hostname-based URI, as defined in the DRS documentation, that tells clients how to access this object. The intent of this field is to make DRS objects self-contained, and therefore easier for clients to store and pass around. For example, if you arrive at this DRS JSON by resolving a compact identifier-based DRS URI, the `self_uri` presents you with a hostname and properly encoded DRS ID for use in subsequent `access` endpoint calls.",
"expression": "$this.url.matches(‘^drs://.*’)",
"source": "http://fhir.ncpi-project-forge.io/StructureDefinition/ncpi-drs-attachment"
}

This definition states that DRS URLs must begin with the prefix ‘drs://’. But it we attempt to PUT an record with an invalid DRS URL, that error isn’t flagged as invalid:

curl -s -L https://git.io/JqUcl | fhir_curl -I -s -X PUT -d@- "$FHIR_URL/DocumentReference/example-document-reference-fail" | grep 404

If this had properly validated against the profile, you would have seen a 404.

Another missing feature is the _summary parameter. For example, if we wanted to confirm that all Subjects were loaded, it would be useful to get a count of all ResearchSubjects.

fhir_curl -s "$FHIR_URL/ResearchSubject?study=ResearchStudy/example-research-study-id&_summary=count">>>
{
"issue": [{
"code": "value",
"details": {
"text": "invalid_query"
},
"diagnostics": "generic::unimplemented: _summary argument is not supported yet.",
"severity": "error"
}],
"resourceType": "OperationOutcome"
}

Delete the project

Once you are done with your testing and are happy with what you’ve been able to do with little introduction, you’ll probably want to clear out your test project, to stop it from causing any additional fees. To delete your project, you can simply issue the following command. Note this action is not recoverable, please ensure you have no valuable work in the project prior to deleting it.

gcloud projects delete $GOOGLE_PROJECT --quiet

--

--

Kyle Ellrott

Assistant Professor at Oregon Health and Science University