Data

Grey Matter Data is a platform service for the versioned and encrypted storage of media blobs and assets.

Note that we mark sensitive values with "❗" so that it is clear what must be kept private, versus what is safely made public.

API Basics

Grey Matter Data uses JavaScript Web Tokens (JWT) for authentication. Each request to Grey Matter Data must include a cookie within the header that is based on the authentication JWT. Grey Matter Data tracks information through Event Objects. These Event Objects capture all changes and reflect the Kafka event queue that supports the system. Each Event Object is associated with each file through Object ID (oid) parameters. The parameters of the Event Object form relationships between files in the system.

JWT Service

A JWT service, such as jwt-server, assumes a system has authenticated you via proxy, and it will insert the USER_DN header. The JWT service will take a redirect argument and a path argument. The path is the URLs over which the cookie will be sent. The redirect is an URL in the path. The cookie is written out with name userpolicy and with HttpOnly set to true, preventing client scripts from accessing this cookie.

The JWT token includes claims with the following format:

  • Label: the name that the token will be logged under goes here.

  • Values: a hashtable from string to lists of strings is used to evaluate the JWT token against an objectPolicy.

Here is an example of a JWT token representing a userPolicy.

{
  "label": "asRob",
  "values": {
    "age": [
      "driving"
    ],
    "citizenship": [
      "USA"
    ],
    "email": [
      "rob.johnson@email.com"
    ],
    "name": [
      "rob johnson"
    ],
    "org": [
      "www.deciphernow.com"
    ]
  }
}

Event

Grey Matter Data tracks all changes through JSON Event object. Events represent a portion (limited by user’s security access) of the Kafka messaging queue that supports Grey Matter Data. Any modification to the system will be carried out through the /write endpoint by supplying a single or multiple Events that describes required actions.

Event parameters define the relationships between files in Grey Matter Data. For example, the parentoid parameter defines folder-to-child relationships. Updates will effectively move an Object from one folder to another. Parameter derived will point to Object IDs related to the current oid. For example, the thumbnails that might be derived from an image can be pointed to that image through the derived Event parameter.

For a conceptual insight into why gm-data is designed the way that it is: Event Sourcing

Event Examples

Full Event Structure

[
  {
     "action": "string",         // Actions on a file to put it into its current state.
     "blobalgorithm": "string",  // The blob algorithm; if not specified, we store in S3 with SSECustomerKey.
     "checkedtstamp": "string",  // the tstamp of the previous version of the oid we compared against.
     "custom": "string",         // Put custom fields in here that gm-data does not understand, but need to be tracked by the application.
     "defaultfile": "string",     // For a directory, if we try to stream the directory, ${defaultfile} will be the assumed name.  index.html is the default value.
     "derived": "string",        // Fields to denote that a file is derived from another - ie: doc to pdf. (oid,tstamp) are a primary key for the original. Type allows us to track what kind of derivation it is, such as doc-to-txt.
     "description": "string",    // Description of what is in the file or directory.
     "encrypted": "string",      // Allow for fields to be encrypted on a case-by-case basis.
     "expiration": "string",     // tstamp as of which this record will not come back from queries.  May legally require purges at some point.
     "isfile": "string",          // Set this if it is a file.  You will get a directory if you don't set this.
     "mimetype": "string",       // same as content-type. Set for this object, and also for S3 blobs.
     "name": "string",           // file/dir name without pathing.
     "objectpolicy": "string",   // The rules that allow action (C R U D X P).
     "oid": "string",            // A numeric identifier assigned to a file/dir. Happens to be a nanoseconds timestamp. (oid, tstamp) are the primary key for these events.
     "parentoid": "string",      // oid of the parent directory.
     "purgetstamp": "string",    // In order to just purge a single event, supply a purgetstamp along with the oid.
     "references": "string",     // When uploading files, an array of updates can be supplied; so that we can upload files into directories that don't yet exist. Indices are negative relative values.
     "rname": "string",          // ${AWS_S3_BUCKET}/${AWS_S3_PARTITION}/${rname} is where this object is in S3. Name is assigned even if only Local is used.
     "schema": "string",         // Version of the schema from which this object came.
     "security": "string",       // Security labels are written here, along with their foreground/background pen colors.
     "sha256plain": "string",    // sha256 of the plaintext, can be used in the client to calculate a minimum number of files to send for update.
     "size": "string",           // Same as Content-Length.
     "tstamp": "string",         // Nanosecond timestamp of this objects creation time - unique per event.
     "userpolicy": "string"      // JWT claims used when creating this event.
  }
    ]

Upload Event Examples

Note: when action: “C” (create / upload) is specified, the system will backfill Object ID when created internally. In this case, you should not specify oid parameter. Note: when action: “C” (create / upload) is specified, the parameters below are the bare minimum that must be specified for the create action to complete internally.

const events = [
  {
    action: "C",
    name: ”New Folder”,
    isfile: false,
    parentoid: 1,
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];

Delete Event Example

Note: when action: “D” (delete) is specified, the system will backfill most of the Object parameters, thus it is only necessary to specify oid, action, parentoid, and objectpolicy.

const events = [
  {
    action: "D",
    oid: 42,
    parentoid: 1,
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
  }
];

Rename Event Example

Note: when action: “U” (update) is specified, all parameters – except the few that are being updated and tstamp – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

const events = [
  {
    action: "U",
    oid: 42,
    parentoid: 1,
    name: "New Name",
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];

Move Event Example

Note: when action: “U” (update) is specified, all parameters – except the few that are being updated and tstamp – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

const events = [
  {
    action: "U",
    oid: 42,
    parentoid: 2,
    name: "New Name",
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];

This section introduced several Events Objects that Grey Matter Data uses to track information. Understanding these objects will help you perform the following actions: uploading (action:“C”), moving, renaming, altering (action: “U”), or removing (action: “D”).

Object ID

Grey Matter Data tracks stored data through unique Object IDs that are assigned on upload of files into the system. Relationships between Object IDs are established through the parentoid parameter of the Event. Creating an update Event with a new parentoid effectively moves an Object to a new folder. Learn more in the /write endpoint section.

When an Event with (param){action: “C”} (create) is sent into the system on upload through /write endpoint, (param){oid:} does not need to be specified. The system will assign it to this Event internally.

API Overview

This section covers accessing and manipulating data within Grey Matter Data.

We begin with the overarching concepts of information retrieval and information modification. Then we dive deeper into specifics of each API endpoint and code examples.

Information Retrieval

When starting to use the API, you will most likely direct your first request at the root folder to get initial file listings. You can accomplish this with a GET request to the /list endpoint with path of /1 (GET /list/1/).

The root directory has the well known Object ID (oid) of 1 by default. This will be the root folder for each user. However due to specific permissions prescribed through authentication JWT, each user will only be able to see and manipulate a subset of folders.

You can extract data from the system in the following three ways, leveraging numerous read endpoints:

  1. As a JSON Object mimicking internal Events Object through one of the read endpoints

  2. As a raw byte stream through the /stream endpoint, used to download the Object locally, and

  3. As a raw byte stream within an iFrame that displays security meta data of the Object, through the /show endpoint, used to view the Object within the browser window.

More information regarding each of those methods can be found in respected endpoint sections (/read, /stream, /show)

Information Modification

The only way to modify the content of Grey Matter Data is through the /write endpoint. When request is sent to /write endpoint, the request body has to carry form data with an appended {'meta': [Event ]} object.

More details regarding data modification can be found in the endpoints /write section.

Prerequisites

When authenticating to the API, there is a prioritized set of options. Our JWT is a format that allows for LDAP-like groups. They are signed by our signer that we trust, and have a label field that has the username or generic name to be logged in audits. It has the values field which is a map[string][]string, which is to say a set of multi-valued values; similar to LDAP groups. This is done so that we can write policies as boolean combinations of these attributes. In short, you need a userpolicy somehow as a prerequisite to make use of this API. The order we look is:

  • http parameter setuserpolicy set to a JWT, which we turn into a setcookie and re-forward you back in with this parameter removed. This may be used in setups without a JWT server or an edge proxy.

  • http parameter userpolicy set to a JWT. This may be used in setups without a JWT server or an edge proxy.

  • cookie userpolicy set to a JWT. This may be used in setups without a JWT server or an edge proxy.

  • http header userpolicy set to a JWT, and is set by the edge server, usually using USER_DN header as input. This is used in conjunction with the JWT filter.

  • configurable header USER_DN, which we trust was securely set by the edge server(!!). This can be used to look up a JWT in the JWT server. This must be used with an edge proxy with inheaders enabled.

  • anonymous.

API File System Examples

This section covers multiple examples of http request configurations and explains the results they return.

Common GET Requests

All requests in this section can be accomplished by modifying javaScript code presented below.

const requestURL = `${gmDataEndpoint}/list/1`;
axios.get(requestURL, {
  // necessary to pass on server set HttpOnly authentication cookies
  withCredentials: "true"
})
.then(resp => console.log(resp.data)
.catch(error => console.log(error));

Request Method

Endpoint URL

Request Body

Credentials Include

Description

GET

/list/1

None

True

Get a listing of the root Object ID (oid) of 1, choosing a path / relative to it. / symbol at the end of listing path URL is mandatory. Each folder within /1 root folder will have its own unique security policy thus limiting access to groups of users. Each user navigating to /1 folder will see a unique folder landscape tailored by their security credentials.

GET

/list/1/Project1Folder

None

True

This returns listings for Project1Folder, a folder that is child of root folder. This folder may have unique security settings rendering it invisible to groups of users.

GET

/list/42/

None

True

If the Project1Folder dir had an Object ID (oid) of 42, then this would be an equivalent URL to list it. Note how we include / symbol at the end of the path.

GET

/props/42/

None

True

This URL would produce the metadata about the Project1Folder directory. Once we have found an Object that we are looking for, we can perform operations on it.

GET

/stream/900/

None

True

This will produce a bytestream of an Object with Object ID (oid) of 900. Presume this Object’s name property is resume.pdf.

GET

/stream/42/resume.pdf

None

True

The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.

GET

/props/900/

None

True

The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.

GET

/history/900/

None

True

A list of Event Objects for every state of resume.pdf, ordered by time stamp of the Event.

GET

/show/900/

None

True

Is a convenience wrapper around stream to show an html security banner with file’s security metadata around the byte stream.

Common POST Requests

Above GET requests can be dispatched separately or in bulk using POST request to the /read endpoint. This lets you minimize back-and-forth HTTP traffic to improve performance in low bandwidth situations.

Request Method

Endpoint URL

Request Body

Credentials Include

Description

POST

/read

stringified([{URL:”/list/900/“}, {URL:”/list/42/“}])

True

This endpoints requires a string encoded array in the body of the request in the following form: [{URL:”/list/900/“}, {URL:”/list/42/“}]. A detailed example can be found in the Read endpoint section. This call will yield an array with data identical to the same calls performed individually using GET requests. In this specific example, we list two directories simultaneously. This allows for quick file system exploration with significantly fewer requests.

POST

/read

stringified([{URL:”/history/900/?count=10“}, {URL:”/history/42/?count=10“}])

True

Simultaneously getting last 10 revisions of 2 separate Object IDs

POST

/read

stringified([{URL:”/derived/900/“}, {URL:”/derived/42“}])

True

Simultaneously getting derived file meta data from 2 separate Object IDs.

To get data into the system, a request with attached multi-part/form-data needs to be performed to /write endpoint. The transaction is an array of individual JSON Event Objects, in the order in which they need to be applied in the database (optionally including file objects in BLOB format appended to the form data when performing an upload). Detailed examples can be found in the /write endpoint section.

Request Method

Endpoint URL

Request Body

Credentials Include

Description

POST

/write

form data [{'meta':[Event1Object]}]

True

This endpoints requires a form data with appended array of Event Objects under ‘meta’ property, specifying a modification to the system. Detailed example can be found in the /write endpoint section.

POST

/write

form data [{'meta':[Event1Object, Event2Object]}]

True

This endpoint can accept multiple Event objects at the same time.

POST

/write

form data [{'meta':[Event1Object, Event2Object]}, {'blob':[BLOB1]}, {'blob':[BLOB2]}]

True

This endpoint can accept multiple Event objects at the same time.

Common Request Error Response Codes

HTTP Error Code

Common Causes

400

Bad Request code is most often caused when using /write endpoint and Event Object in form data is malformed.

403

Forbidden code is most often caused when JWT authentication token doesn't match Object's privileges.

404

Not Found code is most often caused when Object ID (oid) that is specified in the request is incorrect

Command Line Interface and Go Client Package

There is a command-line interface to support bulk, and automated scenarios. This should help ease the implementation burden for some very common tasks:

The CLI commands all need to be able to connect in an authenticated manner, so there are environment variables associated with connecting. Here is an example of connecting to a PKI enabled setup. The environment variables only need to be set once in a script. After environment variables are setup:

#!/bin/bash

# Name this script:
# gmdatatool.sh

## Environmental setup - depends on how gm-data TLS and address is configured
(
u=`uname`
if [ "${u}" == "Darwin" ]
then
  b64="base64"
else
  b64="base64 -w 0"
fi
export MONGO_USE_TLS=false
export CLIENT_PORT=9443
export CLIENT_CN=localhost
export CLIENT_ADDRESS=localhost
export CLIENT_PREFIX=/services/gmdatax/latest
export CLIENT_USE_TLS=true
# wherever your certs are
export CLIENT_CERT=`cat  ../../certs/localhost.crt    | ${b64}`
export CLIENT_KEY=`cat   ../../certs/localhost.key    | ${b64}`
export CLIENT_TRUST=`cat ../../certs/intermediate.crt | ${b64}`
export MONGO_USE_TLS=false

./gmdatatool.linux $*
)
# Create a directory that we control under self-service directory /world
./gmdatatool.sh mkdir --securitylabel "localuser owned" \
                     --securityfg "white" \
                     --securitybg "red" \
                     --policylabel "localuser owned" \
                     --objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
                     /world/localuser@deciphernow.com
# Upload an entire application into /world/localuser@deciphernow.com
./gmdatatool.sh upload --securitylabel "SECRET" \
                     --securityfg "white" \
                     --securitybg "red" \
                     --policylabel "localuser owned" \
                     --objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
                     /world/localuser@deciphernow.com  ../../static/ui  

)

The tool ./gmdatatool.sh is a special-case use of the go package github.com/greymatter-io/gm-data/client. The client is based around two important ideas:

  • Listening for changes in gm-data, and invoking callbacks when they happen.

  • Providing an API to respond to changes. Example uses:

    • Statically generated thumbnails

    • Run AWS Rekognition to upload derived files on images, such as object-labelling.

    • The written back files are json, and they point to the image that they are derived from

  • Responding to changes may happen through REST or Kafka.

There is a responder, with REST or Kafka constructors. The REST constructor filters out information based on objectPolicy (ie: it runs as a real user). The Kafka constructor runs on a privileged, unfiltered view of all events that happen on gm-data. Generally, the Kafka view is appropriate for back-end processes. The REST constructor is usable from front-end (ie: not originating from within Fabric itself, possibly even from web browsers calling the /notifications endpoint), or back-end.

// Create a client at the root
c, err := client.NewRESTResponder(
  logger,
  client.GetURL(),
  getClient(),
  listing.DefaultRootOID,
  policy.CurrentTstamp(),
  1000,
  time.Duration(2)*time.Second,
  client.CLIENT_IDENTITY.Str(),
  func(c *client.Responder, ev *listing.Event) error {
    return nil
  },
)
if err != nil {
  log.Printf("create client failed: %v", err)
  panic(err)
}

This responder will poll every second for new information, and get up to 1000 events at a time. The callback allows us to inspect events with our code. Generally, when we see something interesting in the event (ev), we call different parts of the API:

  # Get an io.Reader on ev, as it is a file type that we are interested in
  blobData, err := c.StreamOf(ev.Oid, ev.Tstamp)

We may then go do something outside the scope of gm-data, such as turn a blob into a json file (ie: submit a jpg and get back a json description of it). Note that when we are doing listen and write-back like this, we typically end up setting Derived fields, so that we can track the lineage of why the file exists, and what created it. We can correlate a jpg of a face with a json about it, so that we can delete them both if we are asked to delete the file.

m := c.NewWriteMarshaler()
defer m.Close()
err = m.Append(&listing.EventArgs{
  Action:       policy.ActionUpdate,
  IsFile:       true,
  ParentOID:    ev.ParentOID,
  Name:         newFname,
  MimeType:     "application/json",
  ObjectPolicy: policy.ForReadAllFull,
  Derived: listing.Derived{
    Oid:    ev.Oid,
    Tstamp: ev.Tstamp,
    Type:   kind,
  },
  Security:      ev.Security,
  BlobAlgorithm: "none",
}, newFname)
...
req, err := c.NewWriteRequest(m)
...
res, evs, err := c.DoWriteRequest(req)
...

Functions supported by the client API, all required to respond to changes in gm-data with write-backs of new derived files. For things related to read endpoints:

  • NewRESTResponder/NewKafkaResponder - Listen on /notifications, which is the critical reason for having a client library, to respond to changes being made in gm-data.

  • StreamOf - Get the bytes for an (oid,tstamp), where tstamp is optional, so that you get the latest blob.

  • EventOf - Get the properties for an (oid,tstamp), or latest if tstamp is not included.

  • DerivedOf - Find out what is already derived from this file. This is how you could know that a thumbnail already exists for a file.

  • Self - Discover what we are authenticated as, which is important for troubleshooting.

  • HistoryOf - Every event pertaining to an oid. This is the lifecycle of the inode, across all changes (including name, parent, policy, security labels, etc).

Note that more complex paging options are not being used with these simple client libraries.

For things related to the write endpoint, which are a bit more difficult to write directly against the API for yourself than the read endpoints:

  • AppendTree - Perform a bulk upload of a large directory, where you have the opportunity to set security labels and policies individually

  • Append - A raw append to update an individual file or directory

Example use case:

  • GDPR laws require that if a demand is to remove files "about" an individual, that individual can make this demand.

  • In order to comply, if we have a jpg with attached metadata that says that the individual is named in the file, then we can issue a delete on both files.

  • This is possible because we track the Derived file pointers.

  • The /derived endpoint lets us find all files that point to us with a Derived pointer, so that we can find an entire tree of files that started from a single input file. Example: elasticSearchEntry derivedfrom facesIndex, facesIndex derivedfrom jpg

Environment And Deployment

The gm-data service creates a binary called gmdatax.linux, that is configured entirely by environment variables (to avoid a requirement to mount files). This binary however is packaged with some other files.

  • ./runforever - a shell script that keeps ./gmdatax.linux in a re-start loop to handle non-intentional crashes of the binary. This allows us to catch things like array out of bounds, nil pointer dereference, or catastrophic resource exhaustion such as out of file handles. It is these latter cases that drive the decision to allow the binary to die.

  • ./gmdatax.linux - the actual gmdata binary, that reads in environment variables.

  • ./VERSION - the version of this service

  • ./static/ - a bundle of runtime API user documentation, and test user interface. this directory is served literally out of gm-data under the URL /static/

  • ./certs/ - a directory that the binary can write certificates into on startup. the certificates originate from environment variables passed in as single-line base64 encoding full pem files.

  • ./logs/ - a place to write logs (in non-default cases), and may be mounted over to keep the root partition from running out of space.

gm-data will make every possible attempt to look at your configuration and immediately crash with a detailed explanation of what to actually do about it. This includes looking up hostnames in DNS to verify that they exist. Always look in the log files for gm-data if something does not seem right on startup. But it cannot detect inconsistency issues at a higher level, such as one service offering a cert that is then trusted by a service that will try to connect to it. That would require analyzing a larger set of environment variables that are destined for multiple services.

Basic Environment Variables

  • MASTERKEY❗ is mandatory. This is the key that is used to encrypt data.

  • JWT_PUB is the single-line base64 encode of the signing key that the gm-data server trusts to sign JWT tokens. This is a mandatory parameter. It is not an X509 certificate. It is an actual Elliptic Curve key that is suitable for ES512 in the JWT standard.

  • FILE_BUCKET is mandatory (aka: AWS_S3_BUCKET). This says where we write gm-data ciphertext out to AWS.

  • FILE_PARTITION is mandatory (aka: AWS_S3_PARTITION). This should be set to a value that is unique to a set of replicated Fabric clusters. It is literally a subdirectory in FILE_BUCKET. This exists so that we don't need to create lots of buckets constantly, yet can still distinguish which bucket data belongs to which installation.

  • AWS_REGION is required if USES3=true.

  • AWS_S3_ENDPOINT is only required in government setups that need to point to a different hostname for S3.

  • AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY❗ may be set to give AWS credentials in the case where IAM roles are not used for the EC2 instance. AWS_SECRET_ACCESS_KEY❗ is a secret, obviously.

When you disable S3 use like USES3=false, the bucket and partition are still used. The directory ./buckets/${FILE_BUCKET}/${FILE_PARTITION} should exist and be writable by the gm-data process ./gmdatax.linux.

The JWT_PUB is the public part of an elliptic curve key. The private part of it is PRIVATE_KEY❗ for the JWT server. The parameters for use with the JWT libraries are rather specific, due to the curve name secp521r1. This is how we generate our keypairs, which is done specifically for gm-jwt-security to get a private key for signing (file jwtES512.key❗), and then the public key derived from that (jwtES512.key.pub) and set for gm-data as JWT_PUB.

openssl ecparam -genkey -name secp521r1 -noout -out jwtES512.key
openssl ec -in jwtES512.key -pubout -out jwtES512.key.pub

References Between services

  • Prefix patterns. When gm-data needs to make reference to another service, these are relevant environment variables:

    • CLIENT_PREFIX is the URL that the gateway is mapping gm-data service to. This is done so that we can send back links that resolve properly in html files. We do this because we cannot hardcode even our own service name, and also cannot correctly give a relative path. Example: /services/gmdatax/latest

    • CLIENT_JWT_PREFIX is the URL that the gateway is mapping our peer service gm-jwt-security to. This is done so that we can send back links that resolve properly in html files. Example: /services/jwt-server/1.0, or /services/jwt-server-gov/1.0.

We have explicit dependencies on these things:

  • a JWT token issuer, that has a proper sidecar, and is reachable through the edge

  • a Mongo database, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.

  • Kafka, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.

TLS

Services that use TLS will end up creating a large number of environment variables. We follow a principle of passing in pem files as a single line of base64 of the original pem file. That means that we create such files as environment variables on the host that is preparing the deployment. Here is an example of setting up the trust for our Mongo dependency:

MONGO_TRUST=`cat server.trust.pem | base64 -w 0`

When TLS connections to peer services are involved, this pattern in name suffixes arises:

  • ADDRESS (or HOST) - ip or hostname of the peer.

  • PORT - port for the peer.

  • USE_TLS - use TLS

  • CERT - a base64 single-line encode of the pem cert (which also happens to be multi-line base64).

  • KEY❗ - base64 single-line encode of the pem key (which is a multi-line base64). This is also a secret.

  • TRUST - this is similar to CERT. It may encode a concatenated list of pem files for certs.

  • CN - the ServerName expected. This is usually the same as the CN in the remote cert, but may also be an SNI name that matches a wildcard in the CN. If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.

With that being said, these variables are grouped together.

  • MONGO related connect info

    • MONGOHOST - slightly violates our pattern. This can be a list of host:port pairs, like mongodata:27017,mongodata:27017. This is because in a clustered setting, connections are not made to individual machines, but to entire clusters. The PORT part is already taken care of.

    • MONGODB - is not strictly part of TLS, but we need to know the database that we are connecting to.

    • MONGO_USE_TLS - says whether to use the TLS variables to make a TLS connection.

    • MONGO_CERT - is the client PKI cert that we identify ourselves with.

    • MONGO_KEY❗ - is the key that goes with MONGO_CERT.

    • MONGO_TRUST - is the trust file to connect to Mongo servers.

    • MONGO_CN - is SNI name for the mongo cert, the manually set serverName expected. If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.

    • MONGO_INITDB_ROOT_USERNAME - is the username we will use (not necessarily related to the root username however).

    • MONGO_INITDB_ROOT_PASSWORD - the password for MONGO_INITDB_ROOT_USERNAME. This is a secret of course.

  • GMDATA TLS info, for our own service. This generally only happens when the sidecar egress is mTLS.

    • GMDATA_USE_TLS - Says whether to use TLS. This will need to be coordinated with how our sidecar is setup. Our sidecar EGRESS will need to be a client of this TLS connection.

    • GMDATA_CERT - The identity cert of gmdata that will be presented to sidecar.

    • GMDATA_KEY❗ - The key that goes with GMDATA_CERT

    • GMDATA_TRUST - The sidecar will need to present a cert that is signed by something in this TRUST

  • CLIENT_JWT_ENDPOINT prefixed environment variables are relevant to gmdata looking up userpolicyid (a random key to find a JWT) to get a userpolicy (an actual JWT token). This is only needed in cases where we have a jwt server indirectly via userpolicyid.

    • CLIENT_JWT_ENDPOINT_ADDRESS - is the hostname of the JWT server

    • CLIENT_JWT_ENDPOINT_PORT - is the port of the JWT server

    • CLIENT_JWT_ENDPOINT_USE_TLS

    • CLIENT_JWT_ENDPOINT_CERT

    • CLIENT_JWT_ENDPOINT_KEY

    • CLIENT_JWT_ENDPOINT_CN - Expected SNI name

    • CLIENT_JWT_ENDPOINT_TRUST

    • CLIENT_JWT_ENDPOINT_PREFIX - if we connect directly or to the sidecar, then this is just left at empty string "". But if we go through the edge, which is an unlikely case, this ends up needing to be set to the same value as CLIENT_JWT_PREFIX.

    • JWT_API_KEY - is a base64 password that the JWT server will require to accept connections to resolve access codes for JWT tokens (userpolicyid) to actual JWT tokens (userpolicy).

Note that for the JWT server, we are trying to form a connection URL like:

# proto is either http or https depending on CLIENT_JWT_ENDPOINT_USE_TLS
# cert setup is the normal pattern:
# CLIENT_JWT_ENDPOINT_CERT
# CLIENT_JWT_ENDPOINT_KEY
# CLIENT_JWT_ENDPOINT_TRUST
GET ${proto}://${CLIENT_JWT_ENDPOINT_ADDRESS}:${CLIENT_JWT_ENDPOINT_PORT}${CLIENT_JWT_ENDPOINT_PREFIX}/policies

Internally, gm-data sees a userpolicyid header, and connects to that URL to try to get a userpolicy object, which may be too large to have fit into an http header. Notice that the inclusion of CLIENT_JWT_ENDPOINT_PREFIX exists only to go through the edge instead of the sidecar. In the normal case CLIENT_JWT_ENDPOINT_PREFIX="", because we want to talk to the sidecar.

Examples:

  • Talk to our own local sidecar in plaintext to reach JWT (preferred):

    • CLIENT_JWT_ENDPOINT_PREFIX=/services/jwt-server/latest

    • CLIENT_JWT_ENDPOINT_ADDRESS=gmdata-proxy

    • CLIENT_JWT_ENDPOINT_PORT=8080

  • Talk to a JWT sidecar directly (not preferred):

    • CLIENT_JWT_ENDPOINT_PREFIX=

    • CLIENT_JWT_ENDPOINT_ADDRESS=jwt-server-proxy

    • CLIENT_JWT_ENDPOINT_PORT=8080

CLIENT_JWT_ENDPOINT_USE_TLS may require connecting to a sidecar-issued cert, that may not exist at the time gm-data launches. So, note that using GMDATA_USE_TLS in the mesh may be complicated by this fact.

Miscellaneous parameters

  • DONT_PANIC - is an advanced parameter that says to only WARN, but do not CRASH when inconsistent environment variables are detected. If you run with this setting, you run the risk of creating a setup that we cannot support. Sometimes you need to temporarily ignore known problems. So, this should be disabled as soon as possible if it is ever used.

  • LESS_CHATTY_INFO - by default, we like less chatty logs. If you want a lot more logging information that includes the begin and end of sessions in which there were no problems, then you can set this to false.

  • GMDATAX_SESSION_MAX - is an admission control value. This imposes a limit on the number of outstanding requests gm-data will allow to be concurrently serviced. It is literally a maximum population at which gm-data just issues 503 to tell the client to get out of line, and come back later. It exists because if we run out of filehandles, the server will become unstable and crash in an irregular manner. If this server runs out of filehandles, than GMDATAX_SESSION_MAX should be lowered to a value that causes us to stop running out of filehandles. It may need to be raised if we get 503 errors that actually originate from gm-data itself. Our proxy may also issue 503 in the case of admission control, which complicated determining which one ran out. It is more likely that Envoy will run out of filehandles before gm-data will, because the front-end is dealing with a lot of services concurrently.

  • GMDATA_NAMESPACE Typical value is world. In order to avoid having to create root access tokens to get the system bootstrapped, We allow for the creation of a self-service directory. If this value is /world then the home directory can be created here, on the condition that the directory is named after the field mentioned GMDATA_NAMESPACE_USERFIELD, which is typically email. For example: /world is created empty on init of gm-data. User uses static/ui to create directory /world/rob.johnson@email.com, which is only allowed because he came in with a JWT token matching {values: {email: ["rob.johnson@email.com"]}}.

  • GMDATA_NAMESPACE_USERFIELD Typical value is email.

If an environment variable you are looking for was not mentioned here, it's likely something that is not something that you should need to change in a normal setup. For more detail of the auto-generated documentation on environment variables used in gm-data, see:

Kafka Connect

In order to point to a Kafka, in the simplest plaintext case, set env vars relating to Kafka. At a minimum, point to the brokers and name the topics.

KAFKA_PEERS=kafka:9092
KAFKA_TOPIC_ERROR=gmdatax-error
KAFKA_TOPIC_READ=gmdatax-read
KAFKA_TOPIC_UPDATE=gmdatax-update

Deploy - Environment Variables

Name

Default

Description

Example

Type

DISABLE_LOOKUPS

false

don't dns check env vars representing hosts

true

DONT_PANIC

false

disable panic when environment looks mis-configured

true

LESS_CHATTY_INFO

true

chatty info logs will write something to the log when a transaction begins, when there are no problems

false

CLIENT_JWT_PREFIX

/services/gm-jwt-security/1.0

endpoint prefix for primary jwt service to resolve pointers to JWT tokens

/services/gm-jwt-security-gov/1.0

CLIENT_JWT_ENDPOINT_ADDRESS

ip of jwt server in the network

a hostname

CLIENT_JWT_ENDPOINT_PORT

port of jwt server in the network

8443

an unsigned int

CLIENT_JWT_ENDPOINT_CERT

JWT server client cert

base64 line pem written to certs/jwt.cert.pem

CLIENT_JWT_ENDPOINT_KEY

❗JWT server client key

base64 line pem written to certs/jwt.key.pem

CLIENT_JWT_ENDPOINT_TRUST

JWT server trust

base64 line pem written to certs/jwt.trust.pem

CLIENT_JWT_ENDPOINT_PREFIX

prefix to reach the CLIENT_JWT_PREFIX when proxied

localhost

CLIENT_JWT_ENDPOINT_USE_TLS

false

use tls to connect to jwt endpoint

true

CLIENT_JWT_ENDPOINT_CN

the server name expected for this cert

GMDATA_FABRIC_CLUSTER

default

the name of this fabric cluster

us-east

ZEROLOG_LEVEL

WARN

logging level: INFO, DEBUG, WARN, ERR

INFO

MASTERKEY

❗❗Master key for the encrypted content

som3r9doMg1bberish

master key for the data

AWS_REGION

Bucket location

us-east-1

some non-whitespace token

AWS_S3_BUCKET

Bucket name, overridden by FILE_BUCKET

AWS_S3_BUCKET= must match a token without or special chars

AWS_S3_PARTITION

Subdirectory within the S3 bucket, overridden by FILE_PARTITION

username

FILE_BUCKET

Bucket name

FILE_BUCKET= must match a token without whitespace or special chars

FILE_PARTITION

Subdirectory within the file bucket

username

AWS_S3_ENDPOINT

Bucket host override

s3.region.aws.com

a hostname

AWS_REKOGNITION_ENDPOINT

Bucket host override

rek.region.aws.com

a hostname

AWS_ACCESS_KEY_ID

Set if not using IAM roles for the machine

AKAI...

iam roles used

AWS_SECRET_ACCESS_KEY

❗Set if not using IAM roles for the machine

AEFE...

iam roles used

USES3

true

Use S3

false

S3 bucket setup

S3_TASKS

512

Max number of concurrent S3 tasks

64

an unsigned int

KAFKA_PEERS

Kafka nodes to talk to directly. A comma-delimited list of host:port pairs

localhost:9092

a comma-delimited list of host:port

KAFKA_TOPIC_UPDATE

Kafka topic for update events

gmdu

some non-whitespace token

KAFKA_TOPIC_READ

Kafka topic for read events

gmdr

some non-whitespace token

KAFKA_TOPIC_ERROR

Kafka topic for errors

gmde

some non-whitespace token

KAFKA_CONSUMER_GROUP

test1

Kafka consumer group id

imageconverters

some non-whitespace token

KAFKA_CERT

id cert

single line base64 of pem

KAFKA_CERT is expecting a single-line base64 encoded string

KAFKA_KEY

id key

single line base64 of pem

KAFKA_KEY is expecting a single-line base64 encoded string

KAFKA_TRUST

id trust

single line base64 of pem

KAFKA_TRUST is expecting a single-line base64 encoded string

KAFKA_USE_TLS

false

use tls for kafka directly

true

KAFKA_CN

false

cn for kafka

true

TEST_JWT_PRIV

❗❗test only! a base64 encoded single line of the private key for internal signing during tests

base64 encoded line

JWT_PUB

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB is expecting a single-line base64 encoded string

JWT_PUB_1

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_1 is expecting a single-line base64 encoded string

JWT_PUB_2

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_2 is expecting a single-line base64 encoded string

JWT_PUB_3

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_3 is expecting a single-line base64 encoded string

JWT_PUB_4

the single-line base64 encode of the public key of jwt tokens we accept

export JWT_PUB=cat jwtRS256.key.pub \| base64 -w 0

JWT_PUB_4 is expecting a single-line base64 encoded string

JWT_NOT_BEFORE_SKEW_SECONDS

86400

seconds that not-before is in the past, to handle mutual clock skews

60

an unsigned int

MONGOHOST_MASTER

Mongo host ip:port that we replicate with

m1:27017,m2:27017

a comma-delimited list of host:port

MONGODB_MASTER

Mongo database we replicate with

gmdatadev

some non-whitespace token

MONGOHOST

Mongo host ip:port

m1:27017,m2:27017

a comma-delimited list of host:port

MONGODB

gmdatax

Mongo database

gmdatadev

some non-whitespace token

MONGO_CERT

Mongo TLS cert base64

cat ./certs/server.cert.pem | base64 -w 0

MONGO_CERT is expecting a single-line base64 encoded string

MONGO_KEY

❗Mongo TLS cert key base64

cat ./certs/server.key.pem | base64 -w 0

MONGO_KEY is expecting a single-line base64 encoded string

MONGO_TRUST

Mongo TLS trust base64

cat ./certs/server.trust.pem | base64 -w 0

MONGO_TRUST is expecting a single-line base64 encoded string

MONGO_CN

Mongo SNI name

MONGO_SOURCE

Mongo login source

$external

MONGO_MECHANISM

Mongo login mechanism

MONGODB-X509

MONGO_USE_TLS

false

Mongo use TLS

true

MONGO_INITDB_ROOT_USERNAME

MongoDB user id

mongoadmin

MONGO_INITDB_ROOT_PASSWORD

❗MongoDB password

S0m3Pass

TEST_LOAD_ITERATIONS

number of iterations for load test

10000

an unsigned int

GMDATA_NAMESPACE

A Directory in the root that lets you create content as yourself

GMDATA_NAMESPACE_USERFIELD

The field that is that matches up with the directory you can create

GMDATA_NAMESPACE_TEMPLATE

(if (contains %s "%s") (yield-all) (yield R X))

The default template to create a user implicitly

DELETE_EXPIRED

false

Actually remove expired entries periodically to comply with privacy laws

DELETE_EXPIRED= should be true or false

DELETE_EXPIRED_POLL_SECONDS

600

Number of seconds to poll for expired data

3600

an unsigned int

NOTIFICATION_CACHE_SIZE

1000

Number of items to cache when watching notifications on an oid

100

an unsigned int

MIMETYPES_OVERRIDE

Supply an alternate mime.types

./mime.types

LISTING_DEBUG

false

Turn on debug for listing package

true

BIND_ADDRESS

0.0.0.0

bind address for port

127.0.0.1

a hostname

BIND_PORT

8181

bind port

9123

an unsigned int

PRETTY_PRINT

true

pretty print returning json by default. set this to false in production, as it makes json larger.

false

HTTP_TRANSPORT_CANCEL_HOURS

4

Hours before http call is cancelled

24

an unsigned int

USE_PPROF_CPU

true

CPU profiling in pprof

false

USE_PPROF_MEM

true

mem profiling in pprof

false

HTTP_CACHE_SECONDS

10

http default cache in seconds

60

an unsigned int

TRACE_LOG

write a trace to file name

/logs/trace.out

REKOGNITION_FACE_INDEX

Set a face index for AWS Rekognition

hackathon

LOG_OPEN_FILE_HANDLES

true

log open file handles to look for leaks

false

GMDATAX_CATCH_PANIC

false

catch panics rather than restarting gmdatax

true

GMDATAX_SESSION_MAX

4096

max http sessions in progress

10000

an unsigned int

JWT_API_KEY

jwt api key

a password

JWT_API_KEY is expecting a single-line base64 encoded string

NAMED_BANNER

true

include name in banner

false

GMDATA_CERT

id cert

single line base64 of pem

GMDATA_CERT is expecting a single-line base64 encoded string

GMDATA_KEY

id key

single line base64 of pem

GMDATA_KEY is expecting a single-line base64 encoded string

GMDATA_TRUST

id trust

single line base64 of pem

GMDATA_TRUST is expecting a single-line base64 encoded string

GMDATA_USE_TLS

false

use tls for gmdata directly

true

GMDATA_REQUIRE_CLIENT_CERT

true

demand a client cert

false

GMDATA_AUTHENTICATION_HEADER

USER_DN

a header that is TRUSTED to contain an authenticated user id. disable with value '-'.

-

POLICY_CACHE_LIFETIME

60

amount of time an object lives in objectpolicy cache

30

an unsigned int

Last updated

Was this helpful?