Tracker3 ignores custom 'text-allowlist' extensions list

Hello,

I’m running tracker3 version 3.1.2 on Arch Linux. Seems like tracker3 ignores custom ‘text-allowlist’ extensions and extracts plain text only from default files [’.txt’, '.md’, ‘*.mdwn’].

I added other extensions to the list:

[arch]$ gsettings get org.freedesktop.Tracker3.Extract text-allowlist
['*.txt', '*.tech', '*.diag', '*.log', '*.md', '*.mdwn']

But ‘.tech’, '.diag’, ‘*.log’ are completely ignored and plain text is not extracted from these. Example of two files from the same directory, one is .txt another is .log and plain-text has been extracted only from the .txt one:

[arch]$ ls -la \*log*
-rw-r--r-- 1 ivan ivan  22008 Apr 14 17:22 log-ToR001.txt
-rw-r--r-- 1 ivan ivan 676903 Apr 14 17:23 putty-log_TOR001.log

[arch]$ xdg-mime query filetype log-ToR001.txt 
text/plain
[arch]$ file -b --mime-type log-ToR001.txt 
text/plain
[arch]$ xdg-mime query filetype putty-log_TOR001.log 
text/x-log
[arch]$ file -b --mime-type putty-log_TOR001.log 
text/plain

[arch]$ tracker3 info -c log-ToR001.txt
Querying information for entity: 'log-ToR001.txt'
  'file:///home/ivan/OneDrive/Cases/2021-04/5354807557/log-ToR001.txt'
Results:
  'tracker:extractorHash' = 'd35fd368fe97892c95134d493a67d39834817454eec787cddc36b8e1ca5612c3'
  'nfo:fileLastModified' = '2021-04-14T14:22:54Z'
  'nfo:fileLastModified' = '2021-04-14T14:22:54Z'
  'nfo:fileName' = 'log-ToR001.txt'
  'nfo:fileName' = 'log-ToR001.txt'
  'nfo:fileSize' = '22008'
  'nfo:belongsToContainer' = 'urn:bnode:aecc5093-b7bf-4584-8d66-c8ba10830e03'
  'nfo:fileLastAccessed' = '2021-06-24T16:00:18Z'
  'nie:isPartOf' = 'urn:bnode:aecc5093-b7bf-4584-8d66-c8ba10830e03'
  'nie:interpretedAs' = 'urn:bnode:193af09e-2a44-47ce-845a-4bb47152e4ef'
  'nie:dataSource' = 'urn:bnode:306154d9-03d5-463a-aa26-2361f226b268'
  'nie:byteSize' = '22008'
  'nie:url' = 'file:///home/ivan/OneDrive/Cases/2021-04/5354807557/log-ToR001.txt'
  'http://purl.org/dc/elements/1.1/source' = 'urn:bnode:306154d9-03d5-463a-aa26-2361f226b268'
  'http://purl.org/dc/elements/1.1/date' = '2021-04-14T14:22:54Z'
  'http://purl.org/dc/elements/1.1/date' = '2021-04-14T14:22:54Z'
  'http://purl.org/dc/elements/1.1/date' = '2021-06-24T16:00:18Z'
  'nrl:modified' = '436'
  'nrl:modified' = '436'
  'nrl:added' = '2021-06-24T18:46:06Z'
  'nrl:added' = '2021-06-24T18:46:06Z'
  'rdf:type' = 'http://www.w3.org/2000/01/rdf-schema#Resource'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nie#DataObject'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nfo#FileDataObject'
  'rdf:type' = 'http://www.w3.org/2000/01/rdf-schema#Resource'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nie#DataObject'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nfo#FileDataObject'

[arch]$ tracker3 info -c putty-log_TOR001.log
Querying information for entity: 'putty-log_TOR001.log'
  'file:///home/ivan/OneDrive/Cases/2021-04/5354807557/putty-log_TOR001.log'
Results:
  'nfo:fileLastModified' = '2021-04-14T14:23:07Z'
  'nfo:fileName' = 'putty-log_TOR001.log'
  'nfo:fileSize' = '676903'
  'nfo:belongsToContainer' = 'urn:bnode:aecc5093-b7bf-4584-8d66-c8ba10830e03'
  'nfo:fileLastAccessed' = '2021-06-24T16:00:18Z'
  'nie:isPartOf' = 'urn:bnode:aecc5093-b7bf-4584-8d66-c8ba10830e03'
  'nie:dataSource' = 'urn:bnode:306154d9-03d5-463a-aa26-2361f226b268'
  'nie:byteSize' = '676903'
  'nie:url' = 'file:///home/ivan/OneDrive/Cases/2021-04/5354807557/putty-log_TOR001.log'
  'http://purl.org/dc/elements/1.1/source' = 'urn:bnode:306154d9-03d5-463a-aa26-2361f226b268'
  'http://purl.org/dc/elements/1.1/date' = '2021-04-14T14:23:07Z'
  'http://purl.org/dc/elements/1.1/date' = '2021-06-24T16:00:18Z'
  'nrl:modified' = '70'
  'nrl:added' = '2021-06-24T18:46:06Z'
  'rdf:type' = 'http://www.w3.org/2000/01/rdf-schema#Resource'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nie#DataObject'
  'rdf:type' = 'http://tracker.api.gnome.org/ontology/v3/nfo#FileDataObject'

[arch]$ tracker3 extract putty-log_TOR001.log 

[arch]$ tracker3 extract log-ToR001.txt 
@prefix nie: <http://tracker.api.gnome.org/ontology/v3/nie#> .
@prefix nfo: <http://tracker.api.gnome.org/ontology/v3/nfo#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<file:///home/ivan/OneDrive/Cases/2021-04/5354807557/log-ToR001.txt> nie:plainTextContent "2021-04-12T14:>
  a nfo:PlainTextDocument .

Any thoughts how to force tracker3 to extract plain text info from files which extensions are different from the default ones?

Thank you!

There is one additional filter to the mimetypes being extracted here, at /usr/share/tracker3-miners/extract-rules/15-text.rule:

[ExtractorRule]
ModulePath=libextract-text.so
MimeTypes=text/plain;text/markdown
FallbackRdfTypes=nfo:Document;nfo:PlainTextDocument;
Graph=tracker:Documents
Hash=d35fd368fe97892c95134d493a67d39834817454eec787cddc36b8e1ca5612c3

As seen in your output, the .log files get a distinct text/x-log mimetype not covered by this file, I suppose the other files also have distinct mimetypes that throw the plain text extractor off.

With the configuration setting in place, it could make sense to consider generalizing the extractor to let additional mimetypes through. Please file a tracker-miners issue for that.

2 Likes

Bingo, that was it! I’ve modified the file ‘/usr/share/tracker3-miners/extract-rules/15-text.rule’, added ‘text/x-log’ to the ‘MimeTypes’ and Tracker3 successfully indexed contents of all *.log files:

[ExtractorRule]
ModulePath=libextract-text.so
MimeTypes=text/plain;text/markdown;text/x-log
FallbackRdfTypes=nfo:Document;nfo:PlainTextDocument;
Graph=tracker:Documents
Hash=d35fd368fe97892c95134d493a67d39834817454eec787cddc36b8e1ca5612c3

Thank you!

1 Like

Here’s an issue: https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/180

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.