# Data

Grey Matter Data is a platform service for the versioned and encrypted storage of media blobs and assets.

> Note that we mark sensitive values with "❗" so that it is clear what must be kept private, versus what is safely made public.

* [API Basics](#api-basics)
* [Authentication with JWT](#authentication-with-jwt)
  * [JWT Service](#jwt-service)
* [Event](#event)
  * [Event Examples](#event-examples)
    * [Full *Event* Structure](#full-event-structure)
    * [Upload *Event* Examples](#upload-event-examples)
    * [Delete *Event* Example](#delete-event-example)
    * [Rename *Event* Example](#rename-event-example)
    * [Move *Event* Example](#move-event-example)
* [Object ID](#object-id)
* [API Overview](#api-overview)
* [Information Retrieval](#information-retrieval)
* [Information Modification](#information-modification)
* [Prerequisites](#prerequisites)
* [API File System Examples](#api-file-system-examples)
* [Common GET Requests](#common-get-requests)
* [Common POST Requests](#common-post-requests)
* [Common Request Error Response Codes](#common-request-error-response-codes)
* [Command Line Interface and Go Client Package](#command-line-interface-and-go-client-package)
* [Environment And Deployment](#environment-and-deployment)
  * [Basic Environment Variables](#basic-environment-variables)
  * [References Between services](#references-between-services)
  * [TLS](#tls)
  * [Miscellaneous parameters](#miscellaneous-parameters)
  * [Kafka Connect](#kafka-connect)
* [Deploy - Environment Variables](#deploy---environment-variables)

## API Basics

Grey Matter Data uses JavaScript Web Tokens (JWT) for authentication. Each request to Grey Matter Data must include a cookie within the header that is based on the authentication JWT. Grey Matter Data tracks information through Event Objects. These Event Objects capture all changes and reflect the Kafka event queue that supports the system. Each Event Object is associated with each file through Object ID (`oid`) parameters. The parameters of the Event Object form relationships between files in the system.

### JWT Service

A JWT service, such as `jwt-server`, assumes a system has authenticated you via proxy, and it will insert the `USER_DN` header. The JWT service will take a redirect argument and a path argument. The `path` is the URLs over which the cookie will be sent. The `redirect` is an URL in the path. The cookie is written out with name `userpolicy` and with `HttpOnly` set to true, preventing client scripts from accessing this cookie.

The JWT token includes claims with the following format:

* **Label:** the name that the token will be logged under goes here.
* **Values:** a hashtable from string to lists of strings is used to evaluate the JWT token against an objectPolicy.

Here is an example of a JWT token representing a userPolicy.

```javascript
{
  "label": "asRob",
  "values": {
    "age": [
      "driving"
    ],
    "citizenship": [
      "USA"
    ],
    "email": [
      "rob.johnson@email.com"
    ],
    "name": [
      "rob johnson"
    ],
    "org": [
      "www.deciphernow.com"
    ]
  }
}
```

## Event

Grey Matter Data tracks all changes through JSON ***Event*** object. Events represent a portion (limited by user’s security access) of the Kafka messaging queue that supports Grey Matter Data. Any modification to the system will be carried out through the /write endpoint by supplying a single or multiple Events that describes required actions.

Event parameters define the relationships between files in Grey Matter Data. For example, the `parentoid` parameter defines folder-to-child relationships. Updates will effectively move an Object from one folder to another. Parameter `derived` will point to Object IDs related to the current `oid`. For example, the thumbnails that might be derived from an image can be pointed to that image through the `derived` Event parameter.

> For a conceptual insight into why gm-data is designed the way that it is: [Event Sourcing](https://greymatter.gitbook.io/grey-matter-documentation/1.3/usage/platform-services/data/internals/structure)

### Event Examples

#### Full *Event* Structure

```javascript
[
  {
     "action": "string",         // Actions on a file to put it into its current state.
     "blobalgorithm": "string",  // The blob algorithm; if not specified, we store in S3 with SSECustomerKey.
     "checkedtstamp": "string",  // the tstamp of the previous version of the oid we compared against.
     "custom": "string",         // Put custom fields in here that gm-data does not understand, but need to be tracked by the application.
     "defaultfile": "string",     // For a directory, if we try to stream the directory, ${defaultfile} will be the assumed name.  index.html is the default value.
     "derived": "string",        // Fields to denote that a file is derived from another - ie: doc to pdf. (oid,tstamp) are a primary key for the original. Type allows us to track what kind of derivation it is, such as doc-to-txt.
     "description": "string",    // Description of what is in the file or directory.
     "encrypted": "string",      // Allow for fields to be encrypted on a case-by-case basis.
     "expiration": "string",     // tstamp as of which this record will not come back from queries.  May legally require purges at some point.
     "isfile": "string",          // Set this if it is a file.  You will get a directory if you don't set this.
     "mimetype": "string",       // same as content-type. Set for this object, and also for S3 blobs.
     "name": "string",           // file/dir name without pathing.
     "objectpolicy": "string",   // The rules that allow action (C R U D X P).
     "oid": "string",            // A numeric identifier assigned to a file/dir. Happens to be a nanoseconds timestamp. (oid, tstamp) are the primary key for these events.
     "parentoid": "string",      // oid of the parent directory.
     "purgetstamp": "string",    // In order to just purge a single event, supply a purgetstamp along with the oid.
     "references": "string",     // When uploading files, an array of updates can be supplied; so that we can upload files into directories that don't yet exist. Indices are negative relative values.
     "rname": "string",          // ${AWS_S3_BUCKET}/${AWS_S3_PARTITION}/${rname} is where this object is in S3. Name is assigned even if only Local is used.
     "schema": "string",         // Version of the schema from which this object came.
     "security": "string",       // Security labels are written here, along with their foreground/background pen colors.
     "sha256plain": "string",    // sha256 of the plaintext, can be used in the client to calculate a minimum number of files to send for update.
     "size": "string",           // Same as Content-Length.
     "tstamp": "string",         // Nanosecond timestamp of this objects creation time - unique per event.
     "userpolicy": "string"      // JWT claims used when creating this event.
  }
    ]
```

#### Upload *Event* Examples

***Prerequisites***:Policy

> Before you can upload *anything* you need a policy to be applied to what you upload. This is very much like the requirement in AWS to upload IAM to specify how a capability is locked down. It is mandatory to attach a policy to any modification to the system.

Either applications or UIs may write these policies for you (ie: a CAPCO UI editor if you work for the government, or just writing simple expressions to do this directly without the assistance of a UI). As an example, a policy is a function that takes in your JWT claims as output, and may emit permissions as output. Every step is a function that takes args.

You can think of this as a function in another language:

```python
# The policy language is NOT Python, but this is an imperfect analogy
# The function may have been defined as a built-in, or as a user-added
# macro.
def f(x, y, z):
  return (R,X)

# And this is calling a function
f(x, y, z)
```

But for many reasons, we have a very tiny evaluation engine that executes precompiled statements. The syntax is LISP (not Python, or Javascript, or whatever) for reasons that have to do with being able to manipulate the code as a straightforward data structure, which is impossible in other syntaxes.

This is a simple function `f` with three argument values `x`, `y`, and `z`. Function evaluation is triggered by putting a parenthesis around the function and its args.:

```
(f x y z)
```

This is a value that is not executed (no parenthesis):

```
x
```

A function `g` with no args would be executed like:

```
(g)
```

> It isn't obvious, but this is a very common syntax for embedded dynamic code into larger programs, because it has simple mathematical rules to manipulate the code as data, due to a completely uniform syntax. It is so easy to write these evaluators that they may show up as stored functions in databases (like MySQL) to filter for access. The runtimes are usually under a hundred lines, because there are no operator precedences, or corner cases. You don't need to know what function `(g)` is ahead of time, other than looking it up in a symbol table to plug its args into it. A competitor language to this under standardization is OPA, which is far more complex than this, possibly too complex to expose to users; but OPA may end up in use here in future generations - and use assisted by UIs. If you use a CAPCO UI, then you are spared from knowing much about this. But we must have policies attached to files.

The most important function is the `if` function:

```
(if BOOLEANCONDITION
    TRUEBRANCH
    FALSEBRANCH)
```

When the first argument to `if` is `true`, the whole expression actually reduces to:

```
TRUEBRANCH
```

And when it's false, it reduces to

```
FALSEBRANCH
```

Many statements only have a `TRUEBRANCH` and just reduce to empty when `BRANCHCONDITION` is false. As an example:

```
(if FALSE
    TRUEBRANCH)
```

Just reduces to an implicit FALSEBRANCH, in which nothing is done. The next most important statement is `contains` (which is equivalent to `has some`). Contains implicitly reads our JWT claims input. If our JWT claims look like:

```javascript
{
  "values": {
    "email": [ "rob.fielding@gmail.com" ]
  }
}
```

> The only constraint is that the JWT has a `values` field that is a map from string to an array of strings. That really is the *only* requirement that we impose. JWT servers ensure that their JWTs have such a values field that can be operated on by policy.

So, if in our session, our JWT claims are as above; and the object is protected with the policy below:

```
(if (contains email rob.fielding@gmail.com)
    (allow-all)
    (allow-read)
)
```

The contains function looks in `jwt["values"]["email"]` to return true if `rob.fielding@gmail.com` is in the set. And it is in there, so the policy reduces to the function call:

```
    (allow-all)
```

What is that? It's a macro for `(yield C R U D X P)`. So, we have been given permission for: Create, Read, Update, Delete, Execute, and Purge; for this object. Had our email had `"rob.fielding@greymatter.io"`, it would reduce to:

```
    (allow-read)
```

Which happens to be a macro for `(yield R X)`, which means that we can Read the metadata about a file, or "execute" the file, by downloading the file bytes.

Also, it happens to be the the `objectpolicy` field is this LISP compiled down to json. This is great for Mongo, because Mongo just stores json in a binary encoding, to avoid pointless parsing. this is a yaml rendition of it:

```yaml
f: if
a:
  - f: contains
    a:
      - v: email
      - v: rob.fielding@gmail.com
  - f: allow-all
  - f: allow-read
```

> `f` is a function. `a` is an arg list. `v` is a value.

Given this small set of functions, you can evaluate any JWT token against any policy attached to a candidate object.

* `(if BOOLEAN TRUEBRANCH FALSEBRANCH)`  ... boolean conditions
* `(contains FIELDNAME VAL1 VAL2 ...)` ... contains FIELDNAME on a value
* `(and ARG0 ARG1 ARG2 ....)`  ... and for as many args as you want
* `(or  ARG0 ARG1 ARG2 ....)`  ... or for as many args as you want
* `(not ARG0)`   ... the not function.
* `(yield ARG0 ARG1 ARG2 ...)` ... when permissions are returned.

This approach completely separates any customer-specific authorization system away from gm-data. gm-data only knows what these primitive functions mean. That keeps things backwards compatible, because when you need new behavior, ask for a new FUNCTION rather than changing the meaning of existing functions.

Almost every policy takes this by default, to have a file that is owned by its uploader, and readable by anyone. The `/static/ui` will do this by default when you upload files and make no modifications to policy before it goes up:

> **This is the most important example.** The `/static/ui` embedded uses the information in the `GET /config` call to properly formulate policies like this

```
(if (contains userDN "cn=daveg,dc=greymatter,dc=io")
    (allow-all)
    (allow-read)
)
```

"If it's my userDN, then I have full control. Everyone else has read access."

> If JWT says: `values["userDN"][i] == "cn=daveg,dc=greymatter,dc=io"` for any value i, then the boolean condition is true, and `(allow-all)` is invoked. If it does not match than it invokes `(allow-read)`.
>
> You will often see `(yield R X)` for `(allow-read)` or `(yield-all)` instead of `(allow-all)`. Do a `GET /macros/1/` against gm-data to see any macros added into the system.

Gotchas:

* The booleans `and`, `or`, and `not` don't really make sense anywhere other than in the first argument to `if`, where the point is to return a true or false answer, where we check strings in the jwt with the `contains` function (and its more general `has` function, like `has some` (`or` across args), or `has every` (`and` across all args).  Each branch of an `if` should start with an `if` or a `yield` (which might be in macros like `allow-read` or `allow-all`.
* This is *not* a general language.  There are no variables, or local function definitions.  That would happen if we migrate over to OPA, but the main use of this language is: given a JWT's digitally signed claims, iterate over many thousands of objects to instantly determine *what* a user is allowed to do on each of them.  We can filter directory listings at many thousands per second with this language.  And doing it this way prevents organizationally-specific authentication systems code from creeping into gm-data.

***Note***: when action: “C” (create / upload) is specified, the system will backfill Object ID when created internally. In this case, you should not specify `oid` parameter. ***Note***: when action: “C” (create / upload) is specified, the parameters below are the bare minimum that must be specified for the create action to complete internally.

```javascript
const events = [
  {
    action: "C",
    name: ”New Folder”,
    isfile: false,
    parentoid: 1,
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];
```

#### Delete *Event* Example

***Note***: when action: “D” (delete) is specified, the system will backfill most of the Object parameters, thus it is only necessary to specify `oid`, `action`, `parentoid`, and `objectpolicy`.

```javascript
const events = [
  {
    action: "D",
    oid: 42,
    parentoid: 1,
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
  }
];
```

#### Rename *Event* Example

***Note***: when action: “U” (update) is specified, all parameters – except the few that are being updated and `tstamp` – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

```javascript
const events = [
  {
    action: "U",
    oid: 42,
    parentoid: 1,
    name: "New Name",
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];
```

#### Move *Event* Example

***Note***: when action: “U” (update) is specified, all parameters – except the few that are being updated and `tstamp` – must be specified, mimicking the previous Event associated with the Object ID that is being updated.

```javascript
const events = [
  {
    action: "U",
    oid: 42,
    parentoid: 2,
    name: "New Name",
    originalobjectpolicy: "(if (contains email bob.null@email.com) (yield-all))",
    security: {
      label: "DECIPHER//GMDATA",
      foreground: "#FFFFFF",
      background: "#00FF00"
    }
  }
];
```

This section introduced several *Events* Objects that Grey Matter Data uses to track information. Understanding these objects will help you perform the following actions: uploading (action:“C”), moving, renaming, altering (action: “U”), or removing (action: “D”).

## Object ID

Grey Matter Data tracks stored data through unique Object IDs that are assigned on upload of files into the system. Relationships between Object IDs are established through the parentoid parameter of the *Event*. Creating an update *Event* with a new parentoid effectively moves an Object to a new folder. Learn more in the /write endpoint section.

When an *Event* with (param){action: “C”} (create) is sent into the system on upload through /write endpoint, (param){oid:} does not need to be specified. The system will assign it to this Event internally.

## API Overview

This section covers accessing and manipulating data within Grey Matter Data.

We begin with the overarching concepts of information retrieval and information modification. Then we dive deeper into specifics of each API endpoint and code examples.

## Information Retrieval

When starting to use the API, you will most likely direct your first request at the root folder to get initial file listings. You can accomplish this with a `GET` request to the `/list` endpoint with path of /1 (`GET` /list/1/).

The root directory has the well known Object ID (`oid`) of 1 by default. This will be the root folder for each user. However due to specific permissions prescribed through authentication JWT, each user will only be able to see and manipulate a subset of folders.

You can extract data from the system in the following three ways, leveraging numerous read endpoints:

1. As a `JSON` Object mimicking internal *Events* Object through one of the read endpoints
2. As a raw byte stream through the `/stream` endpoint, used to download the Object locally, and
3. As a raw byte stream within an iFrame that displays security meta data of the Object, through the `/show` endpoint, used to view the Object within the browser window.

More information regarding each of those methods can be found in respected endpoint sections (`/read`, `/stream`, `/show`)

## Information Modification

The only way to modify the content of Grey Matter Data is through the `/write` endpoint. When request is sent to `/write` endpoint, the request body has to carry [form data](https://developer.mozilla.org/en-US/docs/Learn/HTML/Forms/Sending_and_retrieving_form_data) with an appended {'meta': \[*Event* ]} object.

More details regarding data modification can be found in the endpoints `/write` section.

## Prerequisites

When authenticating to the API, there is a prioritized set of options. Our JWT is a format that allows for LDAP-like groups. They are signed by our signer that we trust, and have a `label` field that has the username or generic name to be logged in audits. It has the `values` field which is a `map[string][]string`, which is to say a set of multi-valued values; similar to LDAP groups. This is done so that we can write policies as boolean combinations of these attributes. In short, you need a `userpolicy` somehow as a prerequisite to make use of this API. The order we look is:

* http parameter `setuserpolicy` set to a JWT, which we turn into a setcookie and re-forward you back in with this parameter removed.  This may be used in setups without a JWT server or an edge proxy.
* http parameter `userpolicy` set to a JWT.  This may be used in setups without a JWT server or an edge proxy.
* cookie `userpolicy` set to a JWT.  This may be used in setups without a JWT server or an edge proxy.
* http header `userpolicy` set to a JWT, and is set by the edge server, usually using `USER_DN` header as input.  This is used in conjunction with the JWT filter.
* configurable header `USER_DN`, which we trust was securely set by the edge server(!!).  This can be used to look up a JWT in the JWT server.  This must be used with an edge proxy with inheaders enabled.
* anonymous.

## API File System Examples

This section covers multiple examples of http request configurations and explains the results they return.

## Common GET Requests

All requests in this section can be accomplished by modifying javaScript code presented below.

```javascript
const requestURL = `${gmDataEndpoint}/list/1`;
axios.get(requestURL, {
  // necessary to pass on server set HttpOnly authentication cookies
  withCredentials: "true"
})
.then(resp => console.log(resp.data)
.catch(error => console.log(error));
```

| Request Method | Endpoint URL             | Request Body | Credentials Include | Description                                                                                                                                                                                                                                                                                                                                                                |   |
| -------------- | ------------------------ | ------------ | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - |
| `GET`          | `/list/1`                | None         | True                | Get a listing of the root Object ID (`oid`) of 1, choosing a path / relative to it. / symbol at the end of listing path URL is mandatory. Each folder within /1 root folder will have its own unique security policy thus limiting access to groups of users. Each user navigating to /1 folder will see a unique folder landscape tailored by their security credentials. |   |
| `GET`          | `/list/1/Project1Folder` | None         | True                | This returns listings for Project1Folder, a folder that is child of root folder. This folder may have unique security settings rendering it invisible to groups of users.                                                                                                                                                                                                  |   |
| `GET`          | `/list/42/`              | None         | True                | If the Project1Folder dir had an Object ID (`oid`) of 42, then this would be an equivalent URL to list it. Note how we include / symbol at the end of the path.                                                                                                                                                                                                            |   |
| `GET`          | `/props/42/`             | None         | True                | This URL would produce the metadata about the Project1Folder directory. Once we have found an Object that we are looking for, we can perform operations on it.                                                                                                                                                                                                             |   |
| `GET`          | `/stream/900/`           | None         | True                | This will produce a bytestream of an Object with Object ID (`oid`) of 900. Presume this Object’s name property is resume.pdf.                                                                                                                                                                                                                                              |   |
| `GET`          | `/stream/42/resume.pdf`  | None         | True                | The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.                                                                                                                                                                                                                                                                        |   |
| `GET`          | `/props/900/`            | None         | True                | The metadata of Object ID with name resume.pdf. Returns an Event Object with associated properties.                                                                                                                                                                                                                                                                        |   |
| `GET`          | `/history/900/`          | None         | True                | A list of Event Objects for every state of resume.pdf, ordered by time stamp of the Event.                                                                                                                                                                                                                                                                                 |   |
| `GET`          | `/show/900/`             | None         | True                | Is a convenience wrapper around stream to show an html security banner with file’s security metadata around the byte stream.                                                                                                                                                                                                                                               |   |

## Common POST Requests

Above `GET` requests can be dispatched separately or in bulk using `POST` request to the /read endpoint. This lets you minimize back-and-forth `HTTP` traffic to improve performance in low bandwidth situations.

| Request Method | Endpoint URL | Request Body                                                                  | Credentials Include | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                |   |
| -------------- | ------------ | ----------------------------------------------------------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - |
| `POST`         | `/read`      | stringified(\[{URL:”/list/900/“}, {URL:”/list/42/“}])                         | True                | This endpoints requires a string encoded array in the body of the request in the following form: \[{URL:”/list/900/“}, {URL:”/list/42/“}]. A detailed example can be found in the Read endpoint section. This call will yield an array with data identical to the same calls performed individually using GET requests. In this specific example, we list two directories simultaneously. This allows for quick file system exploration with significantly fewer requests. |   |
| `POST`         | `/read`      | stringified(\[{URL:”/history/900/?count=10“}, {URL:”/history/42/?count=10“}]) | True                | Simultaneously getting last 10 revisions of 2 separate Object IDs                                                                                                                                                                                                                                                                                                                                                                                                          |   |
| `POST`         | `/read`      | stringified(\[{URL:”/derived/900/“}, {URL:”/derived/42“}])                    | True                | Simultaneously getting derived file meta data from 2 separate Object IDs.                                                                                                                                                                                                                                                                                                                                                                                                  |   |

To get data into the system, a request with attached multi-part/form-data needs to be performed to `/write` endpoint. The transaction is an array of individual JSON Event Objects, in the order in which they need to be applied in the database (optionally including file objects in BLOB format appended to the form data when performing an upload). Detailed examples can be found in the `/write` endpoint section.

| Request Method | Endpoint URL | Request Body                                                                              | Credentials Include | Description                                                                                                                                                                                              |   |
| -------------- | ------------ | ----------------------------------------------------------------------------------------- | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - |
| `POST`         | `/write`     | form data \[{'meta':\[Event1Object]}]                                                     | True                | This endpoints requires a form data with appended array of Event Objects under ‘meta’ property, specifying a modification to the system. Detailed example can be found in the `/write` endpoint section. |   |
| `POST`         | `/write`     | form data \[{'meta':\[Event1Object, Event2Object]}]                                       | True                | This endpoint can accept multiple Event objects at the same time.                                                                                                                                        |   |
| `POST`         | `/write`     | form data \[{'meta':\[Event1Object, Event2Object]}, {'blob':\[BLOB1]}, {'blob':\[BLOB2]}] | True                | This endpoint can accept multiple Event objects at the same time.                                                                                                                                        |   |

## Common Request Error Response Codes

| HTTP Error Code | Common Causes                                                                                                  |   |
| --------------- | -------------------------------------------------------------------------------------------------------------- | - |
| 400             | Bad Request code is most often caused when using `/write` endpoint and Event Object in form data is malformed. |   |
| 403             | Forbidden code is most often caused when JWT authentication token doesn't match Object's privileges.           |   |
| 404             | Not Found code is most often caused when Object ID (`oid`) that is specified in the request is incorrect       |   |

## Command Line Interface and Go Client Package

There is a command-line interface to support bulk, and automated scenarios. This should help ease the implementation burden for some very common tasks:

* Upload a full tree of data, such as a full React application
* Create initial directories for the uploads to go into
* Run this from desktops or servers.  There are a few platforms available, all for Intel architecture:
  * OSX - `gmdatatool.osx`
  * Windows - `gmdatatool.exe`
  * Linux - `gmdatatool.linux`
  * README.html `gmdatatool-readme.html`

The CLI commands all need to be able to connect in an authenticated manner, so there are environment variables associated with connecting. Here is an example of connecting to a PKI enabled setup. The environment variables only need to be set once in a script. After environment variables are setup:

```bash
#!/bin/bash

# Name this script:
# gmdatatool.sh

## Environmental setup - depends on how gm-data TLS and address is configured
(
u=`uname`
if [ "${u}" == "Darwin" ]
then
  b64="base64"
else
  b64="base64 -w 0"
fi
export MONGO_USE_TLS=false
export CLIENT_PORT=9443
export CLIENT_CN=localhost
export CLIENT_ADDRESS=localhost
export CLIENT_PREFIX=/services/gmdatax/latest
export CLIENT_USE_TLS=true
# wherever your certs are
export CLIENT_CERT=`cat  ../../certs/localhost.crt    | ${b64}`
export CLIENT_KEY=`cat   ../../certs/localhost.key    | ${b64}`
export CLIENT_TRUST=`cat ../../certs/intermediate.crt | ${b64}`
export MONGO_USE_TLS=false

./gmdatatool.linux $*
)
```

```bash
# Create a directory that we control under self-service directory /world
./gmdatatool.sh mkdir --securitylabel "localuser owned" \
                     --securityfg "white" \
                     --securitybg "red" \
                     --policylabel "localuser owned" \
                     --objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
                     /world/localuser@deciphernow.com
```

```bash
# Upload an entire application into /world/localuser@deciphernow.com
./gmdatatool.sh upload --securitylabel "SECRET" \
                     --securityfg "white" \
                     --securitybg "red" \
                     --policylabel "localuser owned" \
                     --objectpolicy '(if (contains email localuser@deciphernow.com)(yield-all)(yield R X))' \
                     /world/localuser@deciphernow.com  ../../static/ui

)
```

The tool `./gmdatatool.sh` is a special-case use of the go package `github.com/deciphernow/gm-data/client`. The client is based around two important ideas:

* Listening for changes in gm-data, and invoking callbacks when they happen.
* Providing an API to respond to changes.  Example uses:
  * Statically generated thumbnails
  * Run AWS Rekognition to upload derived files on images, such as object-labelling.
  * The written back files are json, and they point to the image that they are derived from
* Responding to changes may happen through REST or Kafka.

There is a responder, with REST or Kafka constructors. The REST constructor filters out information based on objectPolicy (ie: it runs as a real user). The Kafka constructor runs on a privileged, unfiltered view of all events that happen on gm-data. Generally, the Kafka view is appropriate for back-end processes. The REST constructor is usable from front-end (ie: not originating from within Fabric itself, possibly even from web browsers calling the `/notifications` endpoint), or back-end.

```go
// Create a client at the root
c, err := client.NewRESTResponder(
  logger,
  client.GetURL(),
  getClient(),
  listing.DefaultRootOID,
  policy.CurrentTstamp(),
  1000,
  time.Duration(2)*time.Second,
  client.CLIENT_IDENTITY.Str(),
  func(c *client.Responder, ev *listing.Event) error {
    return nil
  },
)
if err != nil {
  log.Printf("create client failed: %v", err)
  panic(err)
}
```

This responder will poll every second for new information, and get up to 1000 events at a time. The callback allows us to inspect events with our code. Generally, when we see something interesting in the event (`ev`), we call different parts of the API:

```go
  # Get an io.Reader on ev, as it is a file type that we are interested in
  blobData, err := c.StreamOf(ev.Oid, ev.Tstamp)
```

We may then go do something outside the scope of gm-data, such as turn a blob into a json file (ie: submit a jpg and get back a json description of it). Note that when we are doing listen and write-back like this, we typically end up setting `Derived` fields, so that we can track the lineage of *why* the file exists, and *what* created it. We can correlate a jpg of a face with a json about it, so that we can delete them both if we are asked to delete the file.

```go
m := c.NewWriteMarshaler()
defer m.Close()
err = m.Append(&listing.EventArgs{
  Action:       policy.ActionUpdate,
  IsFile:       true,
  ParentOID:    ev.ParentOID,
  Name:         newFname,
  MimeType:     "application/json",
  ObjectPolicy: policy.ForReadAllFull,
  Derived: listing.Derived{
    Oid:    ev.Oid,
    Tstamp: ev.Tstamp,
    Type:   kind,
  },
  Security:      ev.Security,
  BlobAlgorithm: "none",
}, newFname)
...
req, err := c.NewWriteRequest(m)
...
res, evs, err := c.DoWriteRequest(req)
...
```

Functions supported by the client API, all required to respond to changes in gm-data with write-backs of new derived files. For things related to read endpoints:

* NewRESTResponder/NewKafkaResponder - Listen on `/notifications`, which is the critical reason for having a client library, to respond to changes being made in gm-data.
* StreamOf - Get the bytes for an `(oid,tstamp)`, where tstamp is optional, so that you get the latest blob.
* EventOf - Get the properties for an `(oid,tstamp)`, or latest if tstamp is not included.
* DerivedOf - Find out what is already derived from this file.  This is how you could know that a thumbnail already exists for a file.
* Self - Discover what we are authenticated as, which is important for troubleshooting.
* HistoryOf - Every event pertaining to an `oid`.  This is the lifecycle of the inode, across all changes (including name, parent, policy, security labels, etc).

> Note that more complex paging options are not being used with these simple client libraries.

For things related to the write endpoint, which are a bit more difficult to write directly against the API for yourself than the read endpoints:

* AppendTree - Perform a bulk upload of a large directory, where you have the opportunity to set security labels and policies individually
* Append - A raw append to update an individual file or directory

Example use case:

* GDPR laws require that if a demand is to remove files "about" an individual, that individual can make this demand.
* In order to comply, if we have a jpg with attached metadata that says that the individual is named in the file, then we can issue a delete on *both* files.
* This is possible because we track the `Derived` file pointers.
* The `/derived` endpoint lets us find all files that point to us with a `Derived` pointer, so that we can find an entire tree of files that started from a single input file.  Example: `elasticSearchEntry derivedfrom facesIndex, facesIndex derivedfrom jpg`

## Environment And Deployment

The gm-data service creates a binary called `gmdatax.linux`, that is configured entirely by environment variables (to avoid a requirement to mount files). This binary however is packaged with some other files.

* `./runforever` - a shell script that keeps `./gmdatax.linux` in a re-start loop to handle non-intentional crashes of the binary.  This allows us to catch things like `array out of bounds`, `nil pointer dereference`, or catastrophic resource exhaustion such as `out of file handles`.  It is these latter cases that drive the decision to allow the binary to die.
* `./gmdatax.linux` - the actual gmdata binary, that reads in environment variables.
* `./VERSION` - the version of this service
* `./static/` - a bundle of runtime API user documentation, and test user interface.  this directory is served literally out of gm-data under the URL `/static/`
* `./certs/` - a directory that the binary can write certificates into on startup.  the certificates originate from environment variables passed in as single-line base64 encoding full `pem` files.
* `./logs/` - a place to write logs (in non-default cases), and may be mounted over to keep the root partition from running out of space.

> gm-data will make every possible attempt to look at your configuration and immediately crash with a detailed explanation of what to actually do about it. This includes looking up hostnames in DNS to verify that they exist. Always look in the log files for gm-data if something does not seem right on startup. But it cannot detect inconsistency issues at a higher level, such as one service offering a cert that is then trusted by a service that will try to connect to it. That would require analyzing a larger set of environment variables that are destined for multiple services.

### Basic Environment Variables

* `MASTERKEY`❗ is mandatory.  This is the key that is used to encrypt data.
* `JWT_PUB` is the single-line base64 encode of the signing key that the gm-data server trusts to sign JWT tokens.  This is a mandatory parameter.  It is not an X509 certificate.  It is an actual Elliptic Curve key that is suitable for `ES512` in the JWT standard.
* `FILE_BUCKET` is mandatory (aka: `AWS_S3_BUCKET`).  This says where we write gm-data ciphertext out to AWS.
* `FILE_PARTITION` is mandatory (aka: `AWS_S3_PARTITION`).  This should be set to a value that is unique to a set of replicated Fabric clusters.  It is literally a subdirectory in `FILE_BUCKET`.  This exists so that we don't need to create lots of buckets constantly, yet can still distinguish which bucket data belongs to which installation.
* `AWS_REGION` is required if  `USES3=true`.
* `AWS_S3_ENDPOINT` is only required in government setups that need to point to a different hostname for S3.
* `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`❗ may be set to give AWS credentials in the case where IAM roles are not used for the EC2 instance. `AWS_SECRET_ACCESS_KEY`❗ is a secret, obviously.

> When you disable S3 use like `USES3=false`, the bucket and partition are still used. The directory `./buckets/${FILE_BUCKET}/${FILE_PARTITION}` should exist and be writable by the gm-data process `./gmdatax.linux`.

The `JWT_PUB` is the public part of an elliptic curve key. The private part of it is `PRIVATE_KEY`❗ for the JWT server. The parameters for use with the JWT libraries are rather specific, due to the curve name `secp521r1`. This is how we generate our keypairs, which is done specifically for `gm-jwt-security` to get a private key for signing (file `jwtES512.key`❗), and then the public key derived from that (`jwtES512.key.pub`) and set for gm-data as `JWT_PUB`.

```bash
openssl ecparam -genkey -name secp521r1 -noout -out jwtES512.key
openssl ec -in jwtES512.key -pubout -out jwtES512.key.pub
```

### References Between services

* Prefix patterns.  When gm-data needs to make reference to another service, these are relevant environment variables:
  * `CLIENT_PREFIX` is the URL that the gateway is mapping gm-data service to. This is done so that we can send back links that resolve properly in html files.  We do this because we cannot hardcode even our own service name, and also cannot correctly give a relative path.  Example: `/services/gmdatax/latest`
  * `CLIENT_JWT_PREFIX` is the URL that the gateway is mapping our peer service gm-jwt-security to.  This is done so that we can send back links that resolve properly in html files.  Example: `/services/jwt-server/1.0`, or `/services/jwt-server-gov/1.0`.

We have explicit dependencies on these things:

* a JWT token issuer, that has a proper sidecar, and is reachable through the edge
* a Mongo database, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.
* Kafka, which is not mounted into the Fabric framework; so is not reached via a sidecar, or through the edge.

### TLS

Services that use TLS will end up creating a large number of environment variables. We follow a principle of passing in pem files as a single line of base64 of the original pem file. That means that we create such files as environment variables on the host that is preparing the deployment. Here is an example of setting up the trust for our Mongo dependency:

```bash
MONGO_TRUST=`cat server.trust.pem | base64 -w 0`
```

When TLS connections to peer services are involved, this pattern in name suffixes arises:

* `ADDRESS` (or `HOST`) - ip or hostname of the peer.
* `PORT` - port for the peer.
* `USE_TLS` - use TLS
* `CERT` - a base64 single-line encode of the pem cert (which also happens to be multi-line base64).
* `KEY`❗ - base64 single-line encode of the pem key (which is a multi-line base64).  This is also a secret.
* `TRUST` - this is similar to `CERT`.  It may encode a concatenated list of pem files for certs.
* `CN` - the ServerName expected.  This is usually the same as the CN in the remote cert, but may also be an SNI name that matches a wildcard in the CN. If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.

With that being said, these variables are grouped together.

* MONGO related connect info
  * `MONGOHOST` - slightly violates our pattern.  This can be a list of host:port pairs, like `mongodata:27017,mongodata:27017`.  This is because in a clustered setting, connections are not made to individual machines, but to entire clusters.  The `PORT` part is already taken care of.
  * `MONGODB` - is not strictly part of TLS, but we need to know the database that we are connecting to.
  * `MONGO_USE_TLS` - says whether to use the TLS variables to make a TLS connection.
  * `MONGO_CERT` - is the client PKI cert that we identify ourselves with.
  * `MONGO_KEY`❗ - is the key that goes with `MONGO_CERT`.
  * `MONGO_TRUST` - is the trust file to connect to Mongo servers.
  * `MONGO_CN` - is SNI name for the mongo cert, the manually set serverName expected.  If this is not set, then we will contact the server to try to grab the CN out of the remote certificate.
  * `MONGO_INITDB_ROOT_USERNAME` - is the username we will use (not necessarily related to the root username however).
  * `MONGO_INITDB_ROOT_PASSWORD` - the password for `MONGO_INITDB_ROOT_USERNAME`.  This is a secret of course.
* GMDATA TLS info, for our own service. This generally only happens when the sidecar egress is mTLS.
  * `GMDATA_USE_TLS` - Says whether to use TLS.  This will need to be coordinated with how our sidecar is setup.  Our sidecar EGRESS will need to be a client of this TLS connection.
  * `GMDATA_CERT` - The identity cert of gmdata that will be presented to sidecar.
  * `GMDATA_KEY`❗ - The key that goes with `GMDATA_CERT`
  * `GMDATA_TRUST` - The sidecar will need to present a cert that is signed by something in this TRUST
* CLIENT\_JWT\_ENDPOINT prefixed environment variables are relevant to gmdata looking up `userpolicyid` (a random key to find a JWT) to get a `userpolicy` (an actual JWT token). This is only needed in cases where we have a jwt server indirectly via `userpolicyid`.
  * `CLIENT_JWT_ENDPOINT_ADDRESS` - is the hostname of the JWT server
  * `CLIENT_JWT_ENDPOINT_PORT` - is the port of the JWT server
  * `CLIENT_JWT_ENDPOINT_USE_TLS`
  * `CLIENT_JWT_ENDPOINT_CERT`
  * `CLIENT_JWT_ENDPOINT_KEY`❗
  * `CLIENT_JWT_ENDPOINT_CN` - Expected SNI name
  * `CLIENT_JWT_ENDPOINT_TRUST`
  * `CLIENT_JWT_ENDPOINT_PREFIX` - if we connect directly or to the sidecar, then this is just left at empty string "".  But if we go through the edge, which is an unlikely case, this ends up needing to be set to the same value as `CLIENT_JWT_PREFIX`.
  * `JWT_API_KEY` - is a base64 password that the JWT server will require to accept connections to resolve access codes for JWT tokens (`userpolicyid`) to actual JWT tokens (`userpolicy`).

Note that for the JWT server, we are trying to form a connection URL like:

```bash
# proto is either http or https depending on CLIENT_JWT_ENDPOINT_USE_TLS
# cert setup is the normal pattern:
# CLIENT_JWT_ENDPOINT_CERT
# CLIENT_JWT_ENDPOINT_KEY
# CLIENT_JWT_ENDPOINT_TRUST
GET ${proto}://${CLIENT_JWT_ENDPOINT_ADDRESS}:${CLIENT_JWT_ENDPOINT_PORT}${CLIENT_JWT_ENDPOINT_PREFIX}/policies
```

Internally, gm-data sees a `userpolicyid` header, and connects to that URL to try to get a `userpolicy` object, which may be too large to have fit into an http header. Notice that the inclusion of `CLIENT_JWT_ENDPOINT_PREFIX` exists only to go through the edge instead of the sidecar. In the normal case `CLIENT_JWT_ENDPOINT_PREFIX=""`, because we want to talk to the sidecar.

Examples:

* Talk to our own local sidecar in plaintext to reach JWT (preferred):
  * `CLIENT_JWT_ENDPOINT_PREFIX=/services/jwt-server/latest`
  * `CLIENT_JWT_ENDPOINT_ADDRESS=gmdata-proxy`
  * `CLIENT_JWT_ENDPOINT_PORT=8080`
* Talk to a JWT sidecar directly (not preferred):
  * `CLIENT_JWT_ENDPOINT_PREFIX=`
  * `CLIENT_JWT_ENDPOINT_ADDRESS=jwt-server-proxy`
  * `CLIENT_JWT_ENDPOINT_PORT=8080`

> `CLIENT_JWT_ENDPOINT_USE_TLS` may require connecting to a sidecar-issued cert, that may not exist at the time gm-data launches. So, note that using `GMDATA_USE_TLS` in the mesh may be complicated by this fact.

### Miscellaneous parameters

* `DONT_PANIC` - is an advanced parameter that says to only WARN, but do not CRASH when inconsistent environment variables are detected. If you run with this setting, you run the risk of creating a setup that we cannot support. Sometimes you need to temporarily ignore known problems. So, this should be disabled as soon as possible if it is ever used.
* `LESS_CHATTY_INFO` - by default, we like less chatty logs. If you want a lot more logging information that includes the begin and end of sessions in which there were no problems, then you can set this to `false`.
* `GMDATAX_SESSION_MAX` - is an admission control value. This imposes a limit on the number of outstanding requests gm-data will allow to be concurrently serviced. It is literally a maximum population at which gm-data just issues `503` to tell the client to get out of line, and come back later. It exists because if we run out of filehandles, the server will become unstable and crash in an irregular manner. If this server runs out of filehandles, than `GMDATAX_SESSION_MAX` should be lowered to a value that causes us to stop running out of filehandles. It may need to be raised if we get `503` errors that actually originate from gm-data itself. Our proxy may also issue `503` in the case of admission control, which complicated determining which one ran out. It is more likely that Envoy will run out of filehandles before gm-data will, because the front-end is dealing with a lot of services concurrently.
* `GMDATA_NAMESPACE` Typical value is `world`. In order to avoid having to create root access tokens to get the system bootstrapped, We allow for the creation of a self-service directory. If this value is `/world` then the home directory can be created here, on the condition that the directory is named after the field mentioned `GMDATA_NAMESPACE_USERFIELD`, which is typically `email`. For example: `/world` is created empty on init of gm-data. User uses `static/ui` to create directory `/world/rob.johnson@email.com`, which is only allowed because he came in with a JWT token matching `{values: {email: ["rob.johnson@email.com"]}}`.
* `GMDATA_NAMESPACE_USERFIELD` Typical value is `email`.

If an environment variable you are looking for was not mentioned here, it's likely something that is not something that you should need to change in a normal setup. For more detail of the auto-generated documentation on environment variables used in gm-data, see:

### Kafka Connect

In order to point to a Kafka, in the simplest plaintext case, set env vars relating to Kafka. At a minimum, point to the brokers and name the topics.

```bash
KAFKA_PEERS=kafka:9092
KAFKA_TOPIC_ERROR=gmdatax-error
KAFKA_TOPIC_READ=gmdatax-read
KAFKA_TOPIC_UPDATE=gmdatax-update
```

## Deploy - Environment Variables

| Name                            | Default                                         | Description                                                                                            | Example                                        | Type                                                                     |                                                              |
| ------------------------------- | ----------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------ |
| DISABLE\_LOOKUPS                | false                                           | don't dns check env vars representing hosts                                                            | true                                           |                                                                          |                                                              |
| DONT\_PANIC                     | false                                           | disable panic when environment looks mis-configured                                                    | true                                           |                                                                          |                                                              |
| LESS\_CHATTY\_INFO              | true                                            | chatty info logs will write something to the log when a transaction begins, when there are no problems | false                                          |                                                                          |                                                              |
| CLIENT\_JWT\_PREFIX             | /services/gm-jwt-security/1.0                   | endpoint prefix for primary jwt service to resolve pointers to JWT tokens                              | /services/gm-jwt-security-gov/1.0              |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_ADDRESS  |                                                 | ip of jwt server in the network                                                                        |                                                | a hostname                                                               |                                                              |
| CLIENT\_JWT\_ENDPOINT\_PORT     |                                                 | port of jwt server in the network                                                                      | 8443                                           | an unsigned int                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_CERT     |                                                 | JWT server client cert                                                                                 | base64 line pem written to certs/jwt.cert.pem  |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_KEY      |                                                 | ❗JWT server client key                                                                                 | base64 line pem written to certs/jwt.key.pem   |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_TRUST    |                                                 | JWT server trust                                                                                       | base64 line pem written to certs/jwt.trust.pem |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_PREFIX   |                                                 | prefix to reach the CLIENT\_JWT\_PREFIX when proxied                                                   | localhost                                      |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_USE\_TLS | false                                           | use TLS to connect to jwt endpoint                                                                     | true                                           |                                                                          |                                                              |
| CLIENT\_JWT\_ENDPOINT\_CN       |                                                 | the server name expected for this cert                                                                 |                                                |                                                                          |                                                              |
| GMDATA\_FABRIC\_CLUSTER         | default                                         | the name of this fabric cluster                                                                        | us-east                                        |                                                                          |                                                              |
| ZEROLOG\_LEVEL                  | WARN                                            | logging level: INFO, DEBUG, WARN, ERR                                                                  | INFO                                           |                                                                          |                                                              |
| MASTERKEY                       |                                                 | ❗❗Masterkey for the encrypted content                                                                  | som3r9doMg1bberish                             | master key for the data                                                  |                                                              |
| AWS\_REGION                     |                                                 | Bucket location                                                                                        | us-east-1                                      | some non-whitespace token                                                |                                                              |
| AWS\_S3\_BUCKET                 |                                                 | Bucket name, overridden by FILE\_BUCKET                                                                |                                                | AWS\_S3\_BUCKET= must match a token without whitespaces or special chars |                                                              |
| AWS\_S3\_PARTITION              |                                                 | Subdirectory within the S3 bucket, overridden by FILE\_PARTITION                                       | username                                       |                                                                          |                                                              |
| FILE\_BUCKET                    |                                                 | Bucket name                                                                                            |                                                | FILE\_BUCKET= must match a token without whitespaces or special chars    |                                                              |
| FILE\_PARTITION                 |                                                 | Subdirectory within the file bucket                                                                    | username                                       |                                                                          |                                                              |
| AWS\_S3\_ENDPOINT               |                                                 | Bucket host override                                                                                   | s3.region.aws.com                              | a hostname                                                               |                                                              |
| AWS\_REKOGNITION\_ENDPOINT      |                                                 | Bucket host override                                                                                   | rek.region.aws.com                             | a hostname                                                               |                                                              |
| AWS\_ACCESS\_KEY\_ID            |                                                 | Set if not using IAM roles for the machine                                                             | AKAI...                                        | iam roles used                                                           |                                                              |
| AWS\_SECRET\_ACCESS\_KEY        |                                                 | ❗Set if not using IAM roles for the machine                                                            | AEFE...                                        | iam roles used                                                           |                                                              |
| USES3                           | true                                            | Use S3                                                                                                 | false                                          | S3 bucket setup                                                          |                                                              |
| S3\_TASKS                       | 512                                             | Max number of concurrent S3 tasks                                                                      | 64                                             | an unsigned int                                                          |                                                              |
| KAFKA\_PEERS                    |                                                 | Kafka nodes to talk to directly.  A comma-delimited list of host:port pairs                            | localhost:9092                                 | a comma-delimited list of host:port                                      |                                                              |
| KAFKA\_TOPIC\_UPDATE            |                                                 | Kafka topic for update events                                                                          | gmdu                                           | some non-whitespace token                                                |                                                              |
| KAFKA\_TOPIC\_READ              |                                                 | Kafka topic for read events                                                                            | gmdr                                           | some non-whitespace token                                                |                                                              |
| KAFKA\_TOPIC\_ERROR             |                                                 | Kafka topic for errors                                                                                 | gmde                                           | some non-whitespace token                                                |                                                              |
| KAFKA\_CONSUMER\_GROUP          | test1                                           | Kafka consumer group id                                                                                | imageconverters                                | some non-whitespace token                                                |                                                              |
| KAFKA\_CERT                     |                                                 | id cert                                                                                                | single line base64 of pem                      | KAFKA\_CERT is expecting a single-line base64 encoded string             |                                                              |
| KAFKA\_KEY                      |                                                 | id key                                                                                                 | single line base64 of pem                      | KAFKA\_KEY is expecting a single-line base64 encoded string              |                                                              |
| KAFKA\_TRUST                    |                                                 | id trust                                                                                               | single line base64 of pem                      | KAFKA\_TRUST is expecting a single-line base64 encoded string            |                                                              |
| KAFKA\_USE\_TLS                 | false                                           | use TLS for kafka directly                                                                             | true                                           |                                                                          |                                                              |
| KAFKA\_CN                       | false                                           | cn for kafka                                                                                           | true                                           |                                                                          |                                                              |
| TEST\_JWT\_PRIV                 |                                                 | ❗❗test only! a base64 encoded single line of the private key for internal signing during tests         |                                                | base64 encoded line                                                      |                                                              |
| JWT\_PUB                        |                                                 | the single-line base64 encode of the public key of jwt tokens we accept                                | export JWT\_PUB=\`cat jwtRS256.key.pub \\      | base64 -w 0\`                                                            | JWT\_PUB is expecting a single-line base64 encoded string    |
| JWT\_PUB\_1                     |                                                 | the single-line base64 encode of the public key of jwt tokens we accept                                | export JWT\_PUB=\`cat jwtRS256.key.pub \\      | base64 -w 0\`                                                            | JWT\_PUB\_1 is expecting a single-line base64 encoded string |
| JWT\_PUB\_2                     |                                                 | the single-line base64 encode of the public key of jwt tokens we accept                                | export JWT\_PUB=\`cat jwtRS256.key.pub \\      | base64 -w 0\`                                                            | JWT\_PUB\_2 is expecting a single-line base64 encoded string |
| JWT\_PUB\_3                     |                                                 | the single-line base64 encode of the public key of jwt tokens we accept                                | export JWT\_PUB=\`cat jwtRS256.key.pub \\      | base64 -w 0\`                                                            | JWT\_PUB\_3 is expecting a single-line base64 encoded string |
| JWT\_PUB\_4                     |                                                 | the single-line base64 encode of the public key of jwt tokens we accept                                | export JWT\_PUB=\`cat jwtRS256.key.pub \\      | base64 -w 0\`                                                            | JWT\_PUB\_4 is expecting a single-line base64 encoded string |
| JWT\_NOT\_BEFORE\_SKEW\_SECONDS | 86400                                           | seconds that not-before is in the past, to handle mutual clock skews                                   | 60                                             | an unsigned int                                                          |                                                              |
| MONGOHOST\_MASTER               |                                                 | Mongo host ip:port that we replicate with                                                              | m1:27017,m2:27017                              | a comma-delimited list of host:port                                      |                                                              |
| MONGODB\_MASTER                 |                                                 | Mongo database we replicate with                                                                       | gmdatadev                                      | some non-whitespace token                                                |                                                              |
| MONGOHOST                       |                                                 | Mongo host ip:port                                                                                     | m1:27017,m2:27017                              | a comma-delimited list of host:port                                      |                                                              |
| MONGODB                         | gmdatax                                         | Mongo database                                                                                         | gmdatadev                                      | some non-whitespace token                                                |                                                              |
| MONGO\_CERT                     |                                                 | Mongo TLS cert base64                                                                                  | cat ./certs/server.cert.pem \| base64 -w 0     | MONGO\_CERT is expecting a single-line base64 encoded string             |                                                              |
| MONGO\_KEY                      |                                                 | ❗Mongo TLS cert key base64                                                                             | cat ./certs/server.key.pem \| base64 -w 0      | MONGO\_KEY is expecting a single-line base64 encoded string              |                                                              |
| MONGO\_TRUST                    |                                                 | Mongo TLS trust base64                                                                                 | cat ./certs/server.trust.pem \| base64 -w 0    | MONGO\_TRUST is expecting a single-line base64 encoded string            |                                                              |
| MONGO\_CN                       |                                                 | Mongo SNI name                                                                                         |                                                |                                                                          |                                                              |
| MONGO\_SOURCE                   |                                                 | Mongo login source                                                                                     | $external                                      |                                                                          |                                                              |
| MONGO\_MECHANISM                |                                                 | Mongo login mechanism                                                                                  | MONGODB-X509                                   |                                                                          |                                                              |
| MONGO\_USE\_TLS                 | false                                           | Mongo use TLS                                                                                          | true                                           |                                                                          |                                                              |
| MONGO\_INITDB\_ROOT\_USERNAME   |                                                 | MongoDB user id                                                                                        | mongoadmin                                     |                                                                          |                                                              |
| MONGO\_INITDB\_ROOT\_PASSWORD   |                                                 | ❗MongoDB password                                                                                      | S0m3Pass                                       |                                                                          |                                                              |
| TEST\_LOAD\_ITERATIONS          |                                                 | number of iterations for load test                                                                     | 10000                                          | an unsigned int                                                          |                                                              |
| GMDATA\_NAMESPACE               |                                                 | A Directory in the root that lets you create content as yourself                                       |                                                |                                                                          |                                                              |
| GMDATA\_NAMESPACE\_USERFIELD    |                                                 | The field that is that matches up with the directory you can create                                    |                                                |                                                                          |                                                              |
| GMDATA\_NAMESPACE\_TEMPLATE     | (if (contains %s "%s") (yield-all) (yield R X)) | The default template to create a user implicitly                                                       |                                                |                                                                          |                                                              |
| DELETE\_EXPIRED                 | false                                           | Actually remove expired entries periodically to comply with privacy laws                               |                                                | DELETE\_EXPIRED= should be true or false                                 |                                                              |
| DELETE\_EXPIRED\_POLL\_SECONDS  | 600                                             | Number of seconds to poll for expired data                                                             | 3600                                           | an unsigned int                                                          |                                                              |
| NOTIFICATION\_CACHE\_SIZE       | 1000                                            | Number of items to cache when watching notifications on an oid                                         | 100                                            | an unsigned int                                                          |                                                              |
| MIMETYPES\_OVERRIDE             |                                                 | Supply an alternate mime.types                                                                         | ./mime.types                                   |                                                                          |                                                              |
| LISTING\_DEBUG                  | false                                           | Turn on debug for listing package                                                                      | true                                           |                                                                          |                                                              |
| BIND\_ADDRESS                   | 0.0.0.0                                         | bind address for port                                                                                  | 127.0.0.1                                      | a hostname                                                               |                                                              |
| BIND\_PORT                      | 8181                                            | bind port                                                                                              | 9123                                           | an unsigned int                                                          |                                                              |
| PRETTY\_PRINT                   | true                                            | pretty print returning json by default. set this to false in production, as it makes json larger.      | false                                          |                                                                          |                                                              |
| HTTP\_TRANSPORT\_CANCEL\_HOURS  | 4                                               | Hours before http call is cancelled                                                                    | 24                                             | an unsigned int                                                          |                                                              |
| USE\_PPROF\_CPU                 | true                                            | CPU profiling in pprof                                                                                 | false                                          |                                                                          |                                                              |
| USE\_PPROF\_MEM                 | true                                            | mem profiling in pprof                                                                                 | false                                          |                                                                          |                                                              |
| HTTP\_CACHE\_SECONDS            | 10                                              | http default cache in seconds                                                                          | 60                                             | an unsigned int                                                          |                                                              |
| TRACE\_LOG                      |                                                 | write a trace to file name                                                                             | /logs/trace.out                                |                                                                          |                                                              |
| REKOGNITION\_FACE\_INDEX        |                                                 | Set a face index for AWS Rekognition                                                                   | hackathon                                      |                                                                          |                                                              |
| LOG\_OPEN\_FILE\_HANDLES        | true                                            | log open file handles to look for leaks                                                                | false                                          |                                                                          |                                                              |
| GMDATAX\_CATCH\_PANIC           | false                                           | catch panics rather than restarting gmdatax                                                            | true                                           |                                                                          |                                                              |
| GMDATAX\_SESSION\_MAX           | 4096                                            | max http sessions in progress                                                                          | 10000                                          | an unsigned int                                                          |                                                              |
| JWT\_API\_KEY                   |                                                 | jwt api key                                                                                            | a password                                     | JWT\_API\_KEY is expecting a single-line base64 encoded string           |                                                              |
| NAMED\_BANNER                   | true                                            | include name in banner                                                                                 | false                                          |                                                                          |                                                              |
| GMDATA\_CERT                    |                                                 | id cert                                                                                                | single line base64 of pem                      | GMDATA\_CERT is expecting a single-line base64 encoded string            |                                                              |
| GMDATA\_KEY                     |                                                 | id key                                                                                                 | single line base64 of pem                      | GMDATA\_KEY is expecting a single-line base64 encoded string             |                                                              |
| GMDATA\_TRUST                   |                                                 | id trust                                                                                               | single line base64 of pem                      | GMDATA\_TRUST is expecting a single-line base64 encoded string           |                                                              |
| GMDATA\_USE\_TLS                | false                                           | use TLS for gmdata directly                                                                            | true                                           |                                                                          |                                                              |
| GMDATA\_REQUIRE\_CLIENT\_CERT   | true                                            | demand a client cert                                                                                   | false                                          |                                                                          |                                                              |
| GMDATA\_AUTHENTICATION\_HEADER  | USER\_DN                                        | a header that is TRUSTED to contain an authenticated user id. disable with value '-'.                  | -                                              |                                                                          |                                                              |
| POLICY\_CACHE\_LIFETIME         | 60                                              | amount of time an object lives in objectpolicy cache                                                   | 30                                             | an unsigned int                                                          |                                                              |

## Authentication, Authorization, ObjectPolicy, and Permission

### Policies

This service does not contact an authorization service (or authentication service for that matter). This service should stay that way. In order to have a secure service, encryption of data is not nearly enough. There needs to be a workable system for calculating access to objects. For the authentication part, we do not want a username/password for authentication purposes (though we might want a password for end-to-end encryption purposes):

* Treat incoming users as a set of digitally signed attributes only.  They are not usernames that we need to go look up elsewhere.
* This allows us to operate without having to further defer to more backend microservices for authorization.
* I HAVE NO backend user service that I can defer to at the moment.
* There is a standard that does this, and it is possible to use a minimum subset of it so that it is easy to secure and implement.
  * JWT (json web token) is as simple of a specification as it can be.  There is an RFC mess over top of it that brings in hazards, and we can not implement that.
    * The basic spec contains a base64 header that hints on the signing algorithm (and we must CHECK that it is the only thing that we expect).
    * The basic spec allows expiration dates to be encoded.
    * The basic spec allows for arbitrary content in the payload.
    * All of the header and payload are json chunks, so that it is not hard to parse or document.

#### Example Of Generic Authentication

Some signing service, that is not us, generates a DSA signing key ❗:

```bash
-----BEGIN EC PRIVATE KEY-----
MIHcAgEBBEIB3hA+StvLndr3qCMhY8SOWu5MM/Oim2SqVA8GFWV+Lmnc03OuacyZ
...
uqTT+pE6m1KYIbuBrsv0TgIrYPWXMdPpTUaUGytBtw==
-----END EC PRIVATE KEY-----
```

From that DSA signing key, it can generate and publish a DSA public key:

```bash
-----BEGIN PUBLIC KEY-----
MIGbMBAGByqGSM49AgEGBSuBBAAjA4GGAAQBDZpYnSarEIirBqbqxzqpV+HyXkx0
...
K2D1lzHT6U1GlBsrQbc=
-----END PUBLIC KEY-----
```

If we trust this service to sign attributes, then we include this public key in our trust store of public keys. We currently only support one public key for this purpose. If that server generates a digitally signed statement. This statement is designed to be passed in http headers, used in OAuth tokens. The user goes to an Authorization service that we have to login (with an X509 certificate or a password. It can SUGGEST to the site what needs to be signed. That service should reject any assertions that the user makes about himself that cannot be verified as true (according to its internal database), and the server is allowed to inject new attributes such as expiration.

```bash
{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }
```

So, the server responded to the user request and gave it this token (❗).

```bash
eyJhbGciOiJFUzUxMiIsInR5cCI6IkpXVCJ9.eyJhZ2UiOlsiYWR1bHQiXSwiZW1haWwiOiJyb2IuZmllbGRpbmdAZ21haWwuY29tIiwib3JnIjpbImRlY2lwaGVyIl19.AVDnBeIOgAsTblY1YbI4K7JQ_28zNbeVCS3fpKXkQMtHRJhSHZza9dgHuQhGwLn4gm_CngmPNRwkzDJHjg6AFJ12AMAxX4u04Im4EQQWOKAasOLr2A-3-uDaq18hU_s8siSA-24tru3WsqG_47bAuNYKt6m7mGQk3pBi2upWgYRJzRC7
```

Our app ONLY sees these tokens on incoming requests, that were set on http headers (cookies, authentication). The first part of it is actually not encrypted. But this bearer token should NOT be leaked to anybody but the service that needs this token. It decodes in plaintext as this:

Header:

```javascript
{ "alg": "ES512", "typ": "JWT" }
```

Claims:

```javascript
{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }
```

The header says how to interpret the tail-end, which contains the digital signature. It is critical that we only allow `alg` that matches our actual trust cert, which is `ES512`. We should also only be honoring claims that include expiration timestamps, to limit the time that a leaked token can be used.

#### UserPolicy Specific

If a user comes in with a statement such as the one above, then we can accept it as a true statement. There needs to be some specific structure in these in order to be useful. Presume that we design the Claims payload to specifically look like this:

```javascript
{
  "exp": 43143143,
  "label": "asRobJohnsonWork",
  "values": {
    "org": [ "decipher", "ieee" ],
    "citizenship": [ "US" ],
    "email": [ "rob.johnson@email.com", "rob.johnson@another-email.com" ],
    "age": [ "adult" ],
    "clearance": [ "confidential" ],
  }
}
```

Some things about this document:

* It has an expiration date, after which we should not honor the signature
* The values are specific to an application family that want to automatically process flexible policies.
  * It has a regular structure to it.  `map[string][]string` is the regular structure for the values.
  * This regular structure helps a domain-specific language to be written into objectPolicy for evaluation later

#### ObjectPolicy Specific

When an object is created in the system, a policy is attached to it. The policy is authored in JSON (and literally stored as BSON). It is actually a function. The output of this function is a Permission. The inputs are the claims of the input token. Each of the sample objectPolicies are attached to an object on each of its updates.

> These statements are rendered as LISP syntax for legibility. The `originalobjectpolicy` field can be set to this LISP statement directly, and the server will set the `objectpolicy` field to the equivalent, but much more verbose, json value of `objectpolicy`. The field `originalobjectpolicy` might also be set to something that is not our LISP language, such as an proprietary ACL language that gm-data does not understand directly.

This is how you would allow anonymous access (did not present us with a UserPolicy at all, then give them R and X access (ie: read is for properties. execute is for streams and directory listings.):

```
  (yield R X)
```

Note that this LISP syntax statement compiled to the json that is actually set in the `objectpolicy` field is.

```javascript
{
  "f": "yield",
  "a": [
    {"v": "R"},
    {"v": "X"}
  ]
}
```

> In general, the json field "f" is the first argument in a parenthesis list (`head` of the list in LISP terminology), and "a" is the list of remaining arguments (`tail` of the list in LISP terminology). The actual string values are represented with "v". This means that the transform between LISP and actual `objectpolicy` is straightforward, and goes in both directions. Double-quotes can be used in values to include spaces and parenthesis in the actual value.

If we need audited-but-public access, then we can demand an email, just so that it shows up in audit logs:

```
  (if (tells email) (yield R X) (if (has some email rob@example.com) (yield-all)))
```

This is a file that is owned by a particular email address (full access, including P for purge):

```
  (if (contains email rob.johnson@email.com) (yield C R U D X P))
```

If this file is owned by a user and shared for read-access to a group:

```
  (if (contains email rob.johnson@email.com)
      (yield C R U D X P)
      (if (contains org decipher)
        (yield R X)))
```

You have to be a non-dual citizen to use this resource. (ie: US citizen required, but must not also be a citizen somewhere else):

```
  (if (and
        (contains citizenship US)
        (not (has not citizenship US)))
      (yield R X))
```

Everything in this system is an event such as a Create, Update, or a Delete. When updating or deleting an object, it will search for the latest version of the object to evaluate the userPolicy against. Similarly, for a Create, it will look up the latest version of the parent directory and check the permissions against that. These evaluators take UserPolicy in as input and return permissions as output. Each of these evaluators is attached to each version of a resource that is access controlled.

As a general pattern, expect that access control whether a file is known to exist, and ownership determines who has edit access on a file, while read access may be granted otherwise. That means that this pattern should be common:

```
  (if (COMPLICATED_ACCESS)
    (if (OWNERSHIP)
      (yield-all)
      (yield R X)
    )
  )
```

Example of "US Adult citizens can read files and metadata", that is "jointly editable by two specific email addresses or anybody in a specific admin group". Anyone not meeting the access threshold will even know that the file exists, which would require R or X.

So, taking COMPLICATED\_ACCESS to be:

```
  (and (contains citizenship US) (contains age adult))
```

And OWNERSHIP to be:

```
  (or (contains email rob.johnson@email.com danielle.miller@email.com) (contain group "deciphernow admin"))
```

We then would have a full `originalobjectpolicy` of:

```
  (if (and (contains citizenship US) (contains age adult))
    (if (or (contains email rob.johnson@email.com danielle.miller@email.com) (contain group "deciphernow admin"))
      (yield-all)
      (yield R X)
    )
  )
```

Which when compiled to its equivalent json, becomes the `requirements` field of `objectpolicy`.

#### Combining The Concepts

UserPolicy values:

```javascript
{ "age": [ "adult" ], "email": "rob.johnson@email.com", "org": [ "decipher" ] }
```

Is allowed access because we told our email.

```
  (if (tells email) (yield R X) (if (contains role "administrator") (yield-all)))
```

And this object has full ownership by any bearer with an `email` equal to `rob.johnson@email.com`:

```
  (if (contains email rob.johnson@email.com)
      (yield C R U D X P)
      (if (contains org decipher)
          (yield R X)))
```

#### The ObjectPolicy Language

The current policy language has a minimum number of functions to demonstrate the capability. It will probably needs some higher-level functions to compress common policy expressions.

* `(if $cond $trueBranch $falseBranch) : bool` Based on boolean truth of `$cond` it will execute either `$trueBranch` or `$falseBranch` and not evaluate the other branch.
* `(and $arg+) : bool` And against any number of arguments (usually two).
* `(or $args+) : bool` Or against any number of arguments (usually two).
* `(not $args+) : bool` Not against one argument.
* `(contains $field $arg+) : bool` We presume that `$field` is a `[]string`.  It returns true if any arg matches.  (note: we should probably have contains-and and contains-or, just to be less surprising)
* `(has $op $field $arg+) : bool` We presume that `$field` is a `[]string`. `$op` can be specified as either `eq`, `not` so that we can express more than equality that contains can do.  Multiple args will do.  ie: `(has not employer snapchat zynga)`.  (note: we probably want `has-or` and `has-and` to make it less ambiguous.)
* `(tells $field) : bool` We demand that this field is told to us in UserPolicy values.  This supports saying that we want an attribute to exist for auditing purposes.  Example: `(tells email)`
* `(yield $args) : (bool,permssions-side-effect)` If we encounter this in the tree, then we write each argument to the output list.
* `(allow-all) : (bool,permissions-side-effect)` Is identical to `yield-all`
* `(allow-read) : (bool,permissions-side-effect)` Is identical to `yield R X`
* `true : bool` Is used in cases where we unconditionally yield permissions
* `false : bool` Is used in cases where we document that this branch of the expression fails.

When a true branch is taken, the whole expression reduces to the true branch only, until a leaf is reached. The leaf must be a macro that eventually yields possible permissions.

#### Possible Permissions

Note that we NEVER mutate data in this design. We insert events that sort on `(objectId, tstamp)` so that we maintain full history. This is what allows this permission system to have a uniform structure to it.

* "R" - Reading metadata about resources such as file/dir, but not necessarily being able to retrieve files or list directories.  Without "R" we are not even told that the file exists.  It wont be in the listings.
* "X" - Execute.  This is not the same as Unix filesystem "execute".  What we mean is listing directories (same as in Unix), and for opening the file stream (not same as Unix).
* "C", "U", "D" - For the mutation operations we need to look up the latest version of what we are changing before we allow the change to be appended.
  * When performing one of these on a dir, we check for "C" on the latest version of the parent dir.
  * When performing update on a dir or file, we check for "U" against the latest version that has this objectId that we are updating.
* "C" - Create.  This is used on directories.  When inserting a file, we check that we have C on the latest version of the parent directory.  Creates in the parent will be allowed by UserPolicy that yields C.
* "U" - Update.  This is used on everything.  Updates on this objectId will be allowed for UserPolicy that yields U.
* "P" - Purge.  This DOES alter the database.  Records are automatically hidden when overridden by new versions, and when they expire.  But they are physically in the database.  For compliance situations, such as GDPR, we must be able to purge objects as well.  This means periodically complying by physically removing items that are either expired, or that a user has demanded be removed.

The reason we never mutate data is that we need to be able to recreate a replica of the database from the Update logs alone. Purge is kind of an exception to mutating data. But it can be safely included as long as we:

* Never recreate the same objectId more than once (in the update audit logs)
* Hold a purge request and only execute it after we actually find what we are asked to purge.  This way we don't lose purge requests on replication.

Also, internally, we take advantage of the immutability of data so that we can run queries on snapshots in time. As we are processing as of the current time Now(), more records are coming in with future timestamps on them.
