Content based indexing for Audio/Video/Images

vigsterkr · February 3, 2020, 5:05pm

So for a while I was wondering why we cannot index our audio/video/images locally, since we have so nice machine learning models already… see all the CNNs for images like VGG, YOLOv3 et.al.

here’s my idea about the whole thing and it would be great to discuss how would a final solution for this in tracker:

Since tracker-extract already supports gstreamer pipelines, my idea is basically create a simple gstreamer plugin that can apply an ML model on the content and then use the output and add it to the knowledge graph (now how would the actual ontology look like that’s another question, see later). The great thing is that there’s already efforts in exporting ML models into a common format called ONNX. There’s actually a nice library called onnxruntime that can execute these models, hence not requiring to have all the different ML libraries like tensorflow, pytorch etc available/linked to the plugin to be able to apply these models. in other words, what i’m working on now is to write a gst-onnx plugin that only depends on onnxruntime and you can just specify which model you wanna use for that pipeline. of course the devil is in the details, i.e. none of the actual onnx files provide all the information to be able to run the model, i.e. there are some pre- and postprocessing steps that are model specific… so this makes things a bit more complicated, but still some of the mature models could be already supported with this approach.

some of the problems/questions that should be discussed:

the tags/categories of the models are not standardized, i.e. each model has a certain set of categories that they can assign to an image. but these are model specific.
how to store the assigned category in rdf: namely how would you search based on content? all these categories are defined in english, so while searching “images of dogs” could be easily done, but the same is not true if you wanna support the same query in other languages. same for things like: “images of selfies” etc. so how would we support internationalization?

btw another good feature of onnxruntime that it supports various backends (like gpu or multithreaded cpu) for running the models

i’d be super interested to discuss this idea in further detail and have it somewhere on the roadmap to add it to tracker.

vigsterkr · February 3, 2020, 5:10pm

ah just another thought that came up now: in case one would want to externalize the computation, i.e. running the models somewhere on a cluster, but still not to reveal the content (i would never actually give anybody the raw data to do this :P) we could use homomorphic ML for this (see more details about this here)

sthursfield · February 3, 2020, 7:47pm

This sounds like a project that’s interesting, useful, large in scope and potentially neverending

Some more things it will be useful to know:

what metadata can we produce with onnxruntime ? “What’s in the image” is one useful thing, is there anything else?
how ‘heavy’ is onnxruntime as a dependency? The last release binary for Linux contained a 63MB .so file, and then we’ll have some models too.

We need to be conservative about adding resource heavy features to Tracker, so this would start off as something external, or at least ‘off by default’. The libtracker-miner library provides a TrackerDecorator class which lets you receive a signal every time a file is created/updated and do some processing that gets added to the tracker-store. However this is private API inside tracker-miners.git now; it’s possible to make it public, but it also increases the ongoing maintenance effort which is risky. So perhaps the best way to go is to make a branch of tracker-miners.git, implement a new ‘tracker-onnx’ daemon, similar but separate to tracker-extract, and depending on onnxruntime. We’d disable this by default, but distros could make it available in a separate package and allow their users to opt-in easily.

We probably would need to extend the ontologies. I think the first step is to make a list of what data we can actually extract reliably, and then think about how to usefully store it.

abustany · February 3, 2020, 10:19pm

Ha, you’re bringing up an interesting topic I was wondering the same a few weeks ago (and was a bit jealous of Mac users having face recognition working out of the box), so I started toying with putting together Tracker and a face recognition library. Luckily the ontologies already have the needed classes for linking regions of interests in images and entities like contacts, and the Python ecosystem has some pretty mature packages for doing face recognition. The super alpha, but already-kind-of-working proof of concept is hosted at GitHub - abustany/tracker-face-recognition: Detect and recognize faces in pictures indexed by Tracker (WIP, alpha quality) , it does not include the “miner” part yet though (ie. you have to run it manually). Basically you can make it index (= compute face recognition information) for all the pictures in a given directory, and then spawn an ugly UI to associate the faces to names. After you identify few pictures, most faces should be (hopefully correctly) pre-identified for you.

I don’t have much time at the moment to progress on this, but I hope I one day manage to implement a rough copy of TrackerDecorator in Python so that this thing can run in the background. And even with this, the question of the UI would still be open…

vigsterkr · February 4, 2020, 7:00am

well ideally anything as said above any ML model that is ONNX exportable could be used, here is a small list of possibilities - sorry coz of discourse’s limitation i couldn’t add all the relevant links (new user can only add 2 links/post :P):

image classification (labeling)
face recognition
emotion labeling of faces
audio transcribing for content indexing:
better text indexing: see BERT et al

and essentially any of these models here (a lot!)

jensgeorg · February 4, 2020, 9:17am

Slightly related (because face detection): Shotwell master has a D-Bus service (subprojects/shotwell-facedetect/org.gnome.ShotwellFaces1.xml · master · GNOME / shotwell · GitLab) where you can throw images at and it will give you the bounding boxes of the faces (uses OpenCV), still somewhat experimental and a bit and probably needs a couple of iterations, though

sthursfield · February 4, 2020, 11:27am

All very interesting! I think these will service different use cases and will have different audiences. From an end user perspective it could make sense to implement them as separate ‘plug ins’, i.e. separate daemons like tracker-onnx-audio-transcribe, tracker-onnx-ocr, tracker-onnx-image-classify, etc.

The OpenCV face recognition is interesting too! Is it possible for Shotwell and/or Gnome Photos to add the necessary UI for users to match up the faces? It seems better than having a separate GUI program just for that.

sthursfield · February 5, 2020, 7:37pm

You should be able to use the C implementation of TrackerDecorator from Python. We do built the GI bindings for libtracker-miner as far as I know. The only roadblock is that libtracker-miner is probably going to be private in Tracker 3.0 – this is to reduce maintenance load as it seemed to be hardly used by anyone. You could probably use Meson to build tracker-miners as a subproject of tracker-face-recognition, and get at the library that way, or or you could keep tracker-face-recognition in a branch of tracker-miners.git.

The idea of making libtracker-miner private was not to force people to re-implement bits of it, so if the above ideas don’t work for some reason then we probably need to revisit our approach .

system · February 19, 2020, 7:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.