Content based indexing for Audio/Video/Images

vigsterkr · February 3, 2020, 5:05pm

So for a while I was wondering why we cannot index our audio/video/images locally, since we have so nice machine learning models already… see all the CNNs for images like VGG, YOLOv3 et.al.

here’s my idea about the whole thing and it would be great to discuss how would a final solution for this in tracker:

Since tracker-extract already supports gstreamer pipelines, my idea is basically create a simple gstreamer plugin that can apply an ML model on the content and then use the output and add it to the knowledge graph (now how would the actual ontology look like that’s another question, see later). The great thing is that there’s already efforts in exporting ML models into a common format called ONNX. There’s actually a nice library called onnxruntime that can execute these models, hence not requiring to have all the different ML libraries like tensorflow, pytorch etc available/linked to the plugin to be able to apply these models. in other words, what i’m working on now is to write a gst-onnx plugin that only depends on onnxruntime and you can just specify which model you wanna use for that pipeline. of course the devil is in the details, i.e. none of the actual onnx files provide all the information to be able to run the model, i.e. there are some pre- and postprocessing steps that are model specific… so this makes things a bit more complicated, but still some of the mature models could be already supported with this approach.

some of the problems/questions that should be discussed:

the tags/categories of the models are not standardized, i.e. each model has a certain set of categories that they can assign to an image. but these are model specific.
how to store the assigned category in rdf: namely how would you search based on content? all these categories are defined in english, so while searching “images of dogs” could be easily done, but the same is not true if you wanna support the same query in other languages. same for things like: “images of selfies” etc. so how would we support internationalization?

btw another good feature of onnxruntime that it supports various backends (like gpu or multithreaded cpu) for running the models

i’d be super interested to discuss this idea in further detail and have it somewhere on the roadmap to add it to tracker.