Is it expected to get "recorded failures" at Tracker 3?

felipehw · October 7, 2020, 4:49am

Hi. I’m running tracker 3.0.1 at Fedora 33.
The first bug seems fixed (https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/141).
But when I run tracker3 status, I got:

$ tracker3 status
Currently indexed: 131644 files, 11385 folders
Remaining space on database partition: 18,0 GB (41,12%)
All data miners are idle, indexing complete
599 recorded failures

Is it expected to get “recorded failures” at Tracker 3? Or must I open issues until I reach zero “recorded failures”?

Obs: A lot of these “recorded failures” happen with image files.
Obs2: Gnome offers to index “Home” … Is it a good idea? Or a lot of cache and configuration files will be indexed?

garnacho · October 7, 2020, 8:18am

Hi @felipehw,

As a rule of thumb, yes, tracker3 status should list no failures. But also note that there’s a variety of circumstances why should a failure happen:

Files are corrupted
Tracker Miners get the wrong MIME type from GIO/shared-mime-info
Of course, a bug in Tracker Miners
Or the libraries it depends on for metadata extraction

These all (or a combination) can lead to tracker-extract-3 failing on a file. Of course, if the source of the problem is unclear, a bug at Issues · GNOME / LocalSearch · GitLab is always welcome.

To go through these files, you can use:

tracker3 status [pattern] to match failed files, and print more detailed status than a single line
tracker3 extract [file] to trigger metadata extraction from the command line, Add G_MESSAGES_DEBUG=Tracker and TRACKER_DEBUG=all for more verbose output.
coredumpctl or gdb to get backtraces

Thanks for helping improve Tracker!

garnacho · October 7, 2020, 8:23am

Forgot to reply here. $HOME is indexed non-recursively by default, and hidden folders/files are skipped. There’s no risk of it going through all of your files, nor config/caches.

This is for good measure, imagine Tracker Miners monitoring its own database :).

felipehw · October 7, 2020, 12:55pm

A last-minute question.
If I have:

Enabled for indexation: “xdg-documents” and “xdg-pictures”.
The value for “xdg-documents” is /home/user/Documents and for “xdg-pictures” is /home/user/Documents/pictures.

Will Tracker produce a duplicate index of files at /home/user/Documents/pictures?

felipehw · October 7, 2020, 1:18pm

A lot of problematic images are GIFs got from downloaded web pages via Firefox. These webpages are saved with a folder *_files for its assets.

Could I use something like wildcards to avoid the indexing of *_files dirs?
For example:

$ gsettings get org.freedesktop.Tracker.Miner.Files ignored-directories-with-content
['.trackerignore', '.git', '.hg', '.nomedia', '*_files' ]

garnacho · October 7, 2020, 1:45pm

There will be a single copy of the indexed data. Tracker-miner-fs should also manage nested indexed folders properly.

felipehw:

Could I use something like wildcards to avoid the indexing of *_files dirs?
For example:
$ gsettings get org.freedesktop.Tracker.Miner.Files ignored-directories-with-content
['.trackerignore', '.git', '.hg', '.nomedia', '*_files' ]

Yup, those settings allow */? tokens, as per GPatternSpec. Although you probably want ignored-directories, ignored-directories-with-content will ignore the parent directory of the matching file altogether. I assume you just want to skip *_files, not e.g. the downloads directory .

Filing a Tracker Miners bug for those GIF files is nonetheless still appreciated. These files shouldn’t fail, unless corrupted.

felipehw · October 7, 2020, 9:56pm

There are just a bunch of errors in my Tracker log now … The majority are with ISOs of Sony consoles.

About an error in a seemly OK PDF file:

$ tracker3 status pdf
URI: file:///var/mnt/home/home/dread/Documents/estudos/Hist%C3%B3ria%20-%20Mestrado%20-%20UFSC%20(2018-2019)/2018.2%20-%202%20semestre/disciplinas/HST%203333003%20-%20Est%C3%A1gio%20Doc%C3%AAncia/HST%207602%20->
Message: SparqlTimeSort helper: Failed time string conversion (strerror of errno (not necessarily related): No such file or directory)
SPARQL: INSERT DATA {  GRAPH tracker:FileSystem {    <file:///var/mnt/home/home/dread/Documents/estudos/Hist%C3%B3ria%20-%20Mestrado%20-%20UFSC%20(2018-2019)/2018.2%20-%202%20semestre/disciplinas/HST%203333003%>
GRAPH <http://tracker.api.gnome.org/ontology/v3/tracker#Documents> {
  <urn:contact:Francisco%20Carlos%20Teixeira%20da%20Silva> nco:fullname ?nco_fullname } };
DELETE WHERE {
GRAPH <http://tracker.api.gnome.org/ontology/v3/tracker#Documents> {
  <urn:contact:Francisco%20Carlos%20Teixeira%20da%20Silva> rdf:type ?rdf_type } };
INSERT DATA {
GRAPH <http://tracker.api.gnome.org/ontology/v3/tracker#Documents> {
<urn:contact:Francisco%20Carlos%20Teixeira%20da%20Silva> a nco:Contact ; 
  nco:fullname "Francisco Carlos Teixeira da Silva" .
<file:///var/mnt/home/home/dread/Documents/estudos/Hist%C3%B3ria%20-%20Mestrado%20-%20UFSC%20(2018-2019)/2018.2%20-%202%20semestre/disciplinas/HST%203333003%20-%20Est%C3%A1gio%20Doc%C3%AAncia/HST%207602%20-%20H>
_:608065 a nfo:PaginatedTextDocument , nfo:PaginatedTextDocument ; 
  nco:creator <urn:contact:Francisco%20Carlos%20Teixeira%20da%20Silva> ; 
  nie:description "" ; 
  nie:mimeType "application/pdf" ; 
  nie:isStoredAs <file:///var/mnt/home/home/dread/Documents/estudos/Hist%C3%B3ria%20-%20Mestrado%20-%20UFSC%20(2018-2019)/2018.2%20-%202%20semestre/disciplinas/HST%203333003%20-%20Est%C3%A1gio%20Doc%C3%AAncia/H>
  nfo:pageCount 41 ; 
  nie:plainTextContent "" ; 
  nie:title "Crise da ditadura militar e o processo de abertura política no Brasil, 1974-1985" ; 
  nie:contentCreated "0100-12-31T21:00:00-03:00" .
}
};

An of the errors with ISOs:

[Report]
Uri=file:///var/mnt/home/home/dread/Documents/games/consoles/Sony%20-%20PSP/roms/Tactics%20Ogre:%20Let%20Us%20Cling%20Together/Tactics%20Ogre%20patched.iso
Message=Could not get any metadata for uri:'file:///var/mnt/home/home/dread/Documents/games/consoles/Sony%20-%20PSP/roms/Tactics%20Ogre:%20Let%20Us%20Cling%20Together/Tactics%20Ogre%20patched.iso' and mime:'application/x-cd-image'

 tracker3 extract Tactics\ Ogre*
file:///var/mnt/home/home/dread/Documents/games/consoles/Sony%20-%20PSP/roms/Tactics%20Ogre:%20Let%20Us%20Cling%20Together/Tactics%20Ogre.iso: No metadata or extractor modules found to handle this file
file:///var/mnt/home/home/dread/Documents/games/consoles/Sony%20-%20PSP/roms/Tactics%20Ogre:%20Let%20Us%20Cling%20Together/Tactics%20Ogre%20patched.iso: No metadata or extractor modules found to handle this file

felipehw · October 7, 2020, 10:57pm

Is normal that tracker search and tracker3 search give a radical different number of results?
E.g.:

$ tracker search "Word1" "Word2" --disable-snippets | wc -l
80
$ tracker3 search "Word1" "Word2" --disable-snippets | wc -l
5

In my tests (with AND operator) … tracker2 search finds a lot of PDFs and tracker3 search didn’t find this content inside pdfs.

Ah, the number of indexed files is different too!

$ tracker status
Currently indexed: 106524 files, 9570 folders
Remaining space on database partition: 17,3 GB (39,62%)
All data miners are idle, indexing complete

$ tracker3 status
Currently indexed: 96954 files, 9570 folders
Remaining space on database partition: 17,3 GB (39,62%)
All data miners are idle, indexing complete
8 recorded failures

sthursfield · October 9, 2020, 9:04am

Is normal that tracker search and tracker3 search give a radical different number of results?

One change that is probably relevant is Limit the types of plain text that we index (!217) · Merge requests · GNOME / LocalSearch · GitLab

Tracker 3 is more conservative about indexing text/plain files because this reduces the risk of indexing huge sets of log files, data files, source code etc. and the corresponding high resource consumption that can cause.

There is a new config key which specifies allowed text/plain content:

$ gsettings get org.freedesktop.Tracker3.Extract text-allowlist
['*.txt', '*.md', '*.mdwn']

I’m open to extending the default here within reason, if there are more plain text formats that are usually worth indexing.

felipehw · October 9, 2020, 1:55pm

The files not found by tracker3 search aren’t text/plain, they are PDFs (that have these Word1 and Word2 not at filename but text content).

sthursfield · October 9, 2020, 2:06pm

OK, I notice you posted already about an error with PDFs. The log shows that the extractor is generating invalid SPARQL, specifically here:

<file:///var/mnt/home/home/dread/Documents/estudos/Hist%C3%B3ria%20-%20Mestrado%20-%20UFSC%20(2018-2019)/2018.2%20-%202%20semestre/disciplinas/HST%203333003%20-%20Est%C3%A1gio%20Doc%C3%AAncia/HST%207602%20-%20H>
                                                               _:608065 a nfo:PaginatedTextDocument , nfo:PaginatedTextDocument ;

Could you open an issue about this? It would be amazing if you could also attach one of the PDFs that trigger this issue so we can add it to the testsuite, but if that’s not possible or practical then don’t worry.

I expect this is the reason that you’re seeing fewer results from PDFs than before.

felipehw · October 9, 2020, 3:01pm

I opened an issue about the “problematic” PDF and attached it: https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/146.
BTW: the PDFs that are missing from results aren’t appointed as problematic by tracker3 status.

sthursfield · October 9, 2020, 6:42pm

Thanks!

the PDFs that are missing from results aren’t appointed as problematic by tracker3 status .

That may because the errors are missing. Or because there’s another issue at work. Try running tracker3 info on a file that doesn’t have any errors reported. Do you see the expected nie:plainTextContent output listed there? If not, the problem is that we can’t get the text. If yes, the problem is somewhere else.

felipehw · October 9, 2020, 7:26pm

I don’t see the expected nie:plainTextContent output!

$ tracker3 info /../18.pdf
Querying information for entity: '/../18.pdf'
  'file:///.../18.pdf'
Results:
  'tracker:extractorHash' = '9f3e4118b613f560ccdebcee36846f09695c584997fa626eb72d556f8470697f'
  'nfo:fileLastModified' = '2015-12-18T20:29:14Z'
  'nfo:fileLastModified' = '2015-12-18T20:29:14Z'
  'nfo:fileName' = '18.pdf'
  'nfo:fileSize' = '841006'
  'nfo:belongsToContainer' = 'urn:bnode:14181c08-299e-4069-b0e3-8532ab841cea'
  'nfo:fileLastAccessed' = '2018-08-30T04:47:45Z'
  'nie:isPartOf' = 'urn:bnode:14181c08-299e-4069-b0e3-8532ab841cea'
  'nie:interpretedAs' = 'urn:bnode:2ac49695-3d4a-4b13-998b-3563bf320e98'
  'nie:dataSource' = 'urn:bnode:fa6e734f-2c36-4d9b-88f2-6a534505dcb9'
  'nie:byteSize' = '841006'
  'nie:url' = 'file:///../18.pdf'
  'http://purl.org/dc/elements/1.1/source' = 'urn:bnode:fa6e734f-2c36-4d9b-88f2-6a534505dcb9'
  'http://purl.org/dc/elements/1.1/date' = '2015-12-18T20:29:14Z'
  'http://purl.org/dc/elements/1.1/date' = '2015-12-18T20:29:14Z'
  'http://purl.org/dc/elements/1.1/date' = '2018-08-30T04:47:45Z'
  'nrl:modified' = '1449'
  'nrl:modified' = '1449'
  'nrl:added' = '2020-10-09T01:34:54Z'
  'nrl:added' = '2020-10-09T01:34:54Z'
  'rdf:type' = 'http://www.w3.org/2000/01/rdf-schema#Resource'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nie#DataObject'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nfo#FileDataObject'
  'rdf:type' = 'http://www.w3.org/2000/01/rdf-schema#Resource'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nie#DataObject'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nfo#FileDataObject'

sthursfield · October 12, 2020, 3:12pm

That’s good news, as it means the problem is indeed somewhere in the extractor

system · October 26, 2020, 3:21pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.