GSoC 2026 Original Proposal - gnome-semsearch: Offline Semantic Search Daemon

Hi everyone!

My name is Darshan Samuel Nayak, I have submitted an Original Idea proposal for GSoC 2026, and I wanted to drop the architectural summary here for asynchronous review by the Shell and Tracker teams, as I know Matrix is incredibly busy today with the deadline.

The Pitch: gnome-semsearch

Currently, standard desktop indexers rely on exact keyword matching. Proprietary alternatives (like Windows Recall) solve this using invasive screen recording or cloud-tethered AI.

I am proposing a lightweight, background system service that brings AI-powered natural language search to the GNOME desktop with a strict 100% offline, privacy-first guarantee. It locally indexes text files, markdown, and source code into vector embeddings, allowing users to query their files natively from the GNOME Shell overview (e.g., “Python script where I annotated the dataset”).

Proposed Tech Stack & Architecture

To ensure zero battery drain and no UI stuttering, this is treated as an Edge AI problem:

  • Inference: A quantized all-MiniLM-L6-v2 model (~22MB) running purely on the CPU via ONNX Runtime (ort crate). Inference takes milliseconds.
  • Storage: Local in-process vector storage using SQLite combined with the highly optimized sqlite-vec C-extension.
  • Chunking: tree-sitter bindings to parse ASTs, allowing code (C++, Python, Rust) to be chunked by semantic function/class boundaries rather than arbitrary character counts.
  • IPC: Exposing the search results directly to GNOME Shell via the org.gnome.Shell.SearchProvider2 D-Bus API using zbus.

Proposed system architecture flow

An Open Architectural Question: Tracker Integration

My initial draft designs this as a standalone Rust daemon with its own file watcher (notify crate). However, to adhere to GNOME’s best practices and avoid duplicating file-watching overhead, I am highly open to architecting the ML inference pipeline as an extension for tracker-miners, provided Tracker’s SPARQL backend can accommodate or link to dense vector storage.

I would love to hear thoughts from the core maintainers on whether a standalone SearchProvider or a Tracker integration is the preferred “GNOME Way” for an experimental feature like this.

I have already submitted my formal proposal PDF to the GSoC dashboard to meet today’s deadline. I know the review period is chaotic, but if any mentors from the Shell, Mutter, or Tracker teams find this architecture interesting, I would be incredibly grateful for your feedback!

Thanks for your time and all your work on GNOME!
Darshan

Hi Darshan,

This is a really great proposal — I like it a lot, especially the privacy-first offline approach.

I’d suggest starting with a standalone service, as it would be easier to develop, maintain, and iterate on. While deeper GNOME integration (like Tracker) would be powerful, it might slow things down at this stage.

A phased approach could work well: start standalone, then integrate later.

Best of luck with your GSoC proposal!

Thanks for the reply @MesterPerfect.
I was thinking of doing the project during my summer vacations. Since I’m still doing my undergrad studies, I thought GSoC would be the perfect opportunity to contribute towards open source GNOME community and also develop my own skills.
I already submitted by proposal to GSoC, but still need a mentor. Hope things work out.
And yes, i was also thinking to start standalone then make it full offline later.
I also wanted to know if there some other forum or discussion that i can put this idea up at.
Thanks again

Hi Retro,

That sounds like a great plan — I really encourage you to go ahead with it. It’s a strong idea and a very valuable direction, especially for GNOME and open-source in general.

Also, you might consider sharing your proposal on the Ubuntu community forum https://discourse.ubuntu.com/ . It’s quite active, and you could get useful feedback or even attract potential mentors.

Best of luck — I hope it works out for you!

Thanks you for the timely reply. I’ll definitely check out that forum.

You should look for GNOME developers working on this space and ask whether they find the idea interesting and are interested in mentoring you in GSoC.

Will surely look into that. Thank you for your response.
Since I’m pretty new to this community, could you perhaps help me out in connecting with someone who might be interested in mentoring me for this project?

Hi Darshan,

We spoke briefly in the GNOME Shell Matrix room. I was not aware of this Discourse post at that moment, and I did not introduce myself properly, I am Carlos Garnacho, maintainer and main developer of LocalSearch and TinySparql.

I’ve got to say I like that diagram more than what I first imagined with your introduction in the Matrix room, as it looks really close to the current architecture, mostly with a couple extra “ML inference engine” boxes.

I have been looking into sqlite-vec myself, and I don’t think there should be much problem in wrapping sqlite-vec functionality using the extension points reserved in the SPARQL language. TinySparql is a thin wrapper over SQLite already, and endpoints are a natural feature of SPARQL. From an architectural point of view I think it is better not to reinvent on ways to connect to a sqlite database from a separate process.

You are correct that inotify/fanotify handles are a scarce resource, and letting multiple independent indexers would be a source of pain in some scenarios, it does make sense to keep this under the LocalSearch umbrella.

I see that the scope of the proposal seems oriented to indexing markdown and code. The first I think is short in scope, as there’s a number of other formats that would also be nice to support (PDF and ODF come to mind), so there should be support for pipelining data from the existing metadata extractors at least.

Code indexing however has been a established non-goal for LocalSearch. While it would be fun if the project churned a code indexer as a sort of easter egg, I’m not sure that many users would be interacting with the GNOME Shell search entry this way, even among developers. But the project looks already sizeable even without it.

This proposal has got potential, but I am already booked for another GSoC project… If my LocalSearch co-maintainer @sthursfield would like to co-mentor with me I would be open too, but otherwise I feel like I would be overreaching by accepting this project by myself, sorry… In any case, please reach us at the Tracker Matrix room.

Cheers,

Carlos

Hi Carlos,

Thank you so much for taking the time to look at the Discourse post and the diagram! It is incredibly encouraging to hear from the maintainer that the architecture aligns with how TinySparql and LocalSearch are actually designed. I completely agree that wrapping sqlite-vec directly via SPARQL endpoints is the cleanest, most native way to handle this without reinventing database connections.

I also completely agree with your feedback on the scope, and I am more than happy to pivot. Shifting the focus to pipelining text from the existing LocalSearch metadata extractors (specifically for PDF, ODF, and plain text) into the embedding engine makes much more sense for the average GNOME user’s workflow. I will update my proposal to reflect this change.

I completely understand your bandwidth constraints for GSoC, and I am really grateful you would even consider co-mentoring this. I will head over to the Tracker Matrix room right away to say hello, and we can see if Sam (@sthursfield) might be interested in co-mentoring the adjusted, document-focused scope.

Thanks again for the invaluable feedback and guidance!

Best regards,
Darshan

Hi! Thanks a lot for the interest in the project. I wish I had capacity to mentor this summer, but I just don’t have any free time to spare.

Hi Sam,

No worries at all, I completely understand! Thank you so much for getting back to me so quickly. Time is a scarce resource for maintainers, and I really appreciate you letting me know upfront.

I am still very passionate about bringing offline semantic search to the GNOME desktop. Even if this doesn’t become an official GSoC project this summer, I plan to continue working on the offline semantic search prototype and would love to contribute it to LocalSearch/TinySparql as an independent contributor down the line.

If you or Carlos happen to know anyone else in the wider GNOME community who might have the bandwidth to mentor this specific idea, that would be amazing.

Alternatively, if GNOME has any other open GSoC projects that are currently short on applicants or looking for a student, I am very flexible! I would be more than happy to pivot and dedicate my summer to another project where the organization needs help the most.

Thanks again for your time, and keep up the great work on Tracker!

Best regards,

Darshan

Good to hear you are keen. The great thing about open source is you can just make cool stuff without anyone needing to give permission :slight_smile:

I will give some advice from back when I had more time to think about desktop search. Coming up with an architecture for how a search engine works is one thing. What we never did effectively in GNOME so far is come up with assertions about how it should behave in different scenarios.

Localsearch really only has this level of testing for full text search: tests/functional-tests/test_fts_basic.py · main · GNOME / LocalSearch · GitLab

When you consider the number of possible inputs (filenames, document contents, media tags, photo tags, etc), you can see there’s a huge amount of nuance in how the search engine presents the results. If I search for “Cat”, is cat.jpg more relevant than a document containing the word “cat”?

Nobody ever invested much time/money in asking those questions, answering them and writing tests to ensure the search infrastructure behaves according to the design.

I did some thinking about this a couple of years ago, see: Sam Thursfield / example-desktop-content · GitLab. But I didnt get very far.

This question is quite complex already today, and if you add additional data produced by machine learning based analysis then its going to become an even bigger question!

Hi Sam,

Your point about behavioral assertions and search relevancy is incredibly insightful. You are highlighting the exact problem that plagues modern search engines: defining Ground Truth. It is one thing to generate vector embeddings, but figuring out how to blend those AI similarity scores (from 0.0 to 1.0) with standard SQLite FTS scores without creating a chaotic user experience is a huge UX challenge. Like you said, if a user searches “Cat”, deciding whether cat.jpg or a document about felines wins the top spot is highly nuanced.

Thank you for sharing test_fts_basic.py and your example-desktop-content repository. This actually gives me a fantastic, concrete milestone for my prototype.

Instead of just building the pipeline in a vacuum, I am going to use the concept behind your example-desktop-content repo to build a “Ground Truth” test directory. I can write a behavioral test script that runs specific queries against both standard FTS and the sqlite-vec engine. This will let me objectively measure where the semantic search actually improves the UX, where it introduces noise, and how we might eventually rank them together (perhaps using something like Reciprocal Rank Fusion).

This gives me a much deeper, product-level perspective to think about while I hack on the sqlite-vec prototype. Thank you for pointing me in this direction—this is exactly the kind of context I was hoping to learn from the community!

I’ll be sure to share my findings once I have the prototype running against some baseline test assertions.

Cheers,
Darshan