Derived Files

A main requirement for gm-data is to facilitate analysis for stored data. This requirement is one of the motivations for using microservices in the first place. Consider a real scenario:

A phone company can get mp3 files of phone calls that go to messages
Processes listen for such files to do useful conversions
Alice calls Bob. Alice set her language to Spanish. Bob set his language to English.
When Alice leaves a message for Bob, an mp3 file is written into the mail folder for Bob. The file is: /users/mail/Bob/message-0001.mp3
The metadata about the call was saved too. The file is: /users/mail/Bob/message-0001.mp3
The /notifications/ endpoint in gm-data hears that this file was created, and a voice to text microservice is listening to /notifications/
The microservice converts the mp3 file to an Spanish text translation, and writes back into gm-data: /users/mail/Bob/messages-0001.mp3.audio-to-text.txt
Another microservice listening on /notifications/ knows how to convert Spanish text to English text. It writes: /users/mail/bob/messages-0001.mp3.audio-to-text.txt.to-english.txt
Bob's phone is sent an english text that reads "Hello, my plane was late. When is the meeting?". This was possible because the metadata had the source and target phone numbers, and the english translation of the mp3.
Bob response: Tomorrow morning".
Alice eventually gets a text message that says: "Mañana por la mañana" ("tomorrow morning").
Every file when created, set the derived properties for oid and tstamp, and a dtype like to-english or audio-to-text. These let us have listeners interpret these files in application-specific ways.

There are similar tasks that involve thumbnailing uploaded images to help user interfaces, or extracting images out of office documents. The point is that the microservices pipeline exists to build great applications to keep the data from being a black hole into which everything becomes unsearchable and lost. Something as simple as text searching needs a solution like this as well. Presume that GDPR laws require us to respond to requests to delete data about any individual that demands it.

Somebody uploads a picture cat.jpg
Because of this process, the picture cat.jpg causes a search on "Grumpy Cat" to return the file "cat.jpg" that was otherwise not labeled with this fact when uploaded.
GrumpyCat approaches us and demands that we delete all pictures about him.
We use the derived fields to find all files about GrumpyCat, with the original picture in particular. We have to delete these due to the GDPR rules.

So, /notifications/ and /derived/ are used together for this purpose. In the backend, it's possible to listen on the Kafka queue, but having /notifications/ as a REST endpoint allows us to perform such tasks, even from unusual locations such as the front-end. This can allow a user to annotate a file with metadata by hand. (ie: upload a movie, and a karaoke file that can be shown over it for timed lyrics to be written out on the screen.). Or this could just be manually identifying things about images and audio.

This is what to look for in a file that was derived from a source file. The derived file is similar to a parent. The difference is that we identify the type of derivation so that we know which processor would have been responsible for creating it.

"derived": {
  "oid": "15834d5cb7f4f430",
  "tstamp": "15834d5cb8040ca4",
  "dtype": "detect-celebrities"
}

PreviousDesign NextStructure

Last updated 4 years ago

Was this helpful?