Mac Spotlight indexing on SMB share with FSCrawler and Elasticsearch

rob_robot · September 30, 2020

This guide is based on the Samba WIKI Spotlight with Elasticsearch backend: https://wiki.samba.org/index.php/Spotlight_with_Elasticsearch_Backend

The goal of this project is to use the Mac finder to search SMB shares from Mac clients. The provided solutions gives us an index based full text search, something that I've been waiting for a long time. Recently added extensions in SAMBA finally made this possible.

To begin with I want to say that I'm nether an UNRAID nor docker expert, so please forgive me if there are better ways to solve this, but my solution seems to be working for now:

To realise this we need firstly the lastest beta of Unraid (6.9.0-beta25 at time of writing) as we need SAMBA 4.12 to use this feature. This revision of SAMBA is only shipped with unraid 4.12, therefore we need to install the beta first.

We as well need an Elasticsearch docker container that will work as the search backend. Then an FSCrawler docker that will crawl the data on a regular basis and feed the results to Elasticsearch that will then create the index. Lastly, enable the SAMBA settings for Spotlight search support.

The high-level interaction of the tools looks like this:

FSCrawler <-------------- DATA directory (SMB share)

|

| (sends data periodically and tracks changes in data directory)

|

Elasticsearch --------------> index <---------- SAMBA (4.12) <--------- Finder Spotlight search

Steps:

1.) install Elasticsearch

I used 7.9.1 from community applications

2.) Install Elasticsearch ingest plugin to search PDF and doc as described here:

Download the user scripts plug-in and define the script as follows:

#!/bin/bash

# execute command inside container
docker exec -i elasticsearch /usr/share/elasticsearch/bin/elasticsearch-plugin install --batch ingest-attachment

3.) Install FSCrawler

If you go to Settings in the Community Applications and allow additional search results from DockerHub, you can install a version of FSCrawler (I used toto1319/fscrawler, version 2.7):

https://hub.docker.com/r/toto1310/fscrawler/

In the template, you need to set the config and data directories. The data directory mount point in FSCrawler needs to match the real mount point in unraid as this path is written into the Elasticsearch index later on, and then needs to be valid for SAMBA to read it. I used /mnt/user/ to be able to search all shares later on.

image.png.3e72be2b537a2276f7ce58fce3af94d3.png

To start the docker, the following post argument needs to be added (turn on advanced mode in the template):

Post Arguments: fscrawler job_name --restart

The option "--restart" causes a full re-index of the whole share. This option is only needed for the first execution of the crawler, later on this option can be removed so that the crawler only monitors the data directory for changes and feeds these into the Elasticsearch index.

After the first run, FSCrawler creates a _settings.yaml file under /mnt/user/appdata/fscrawler/config/job_name/_settings.yaml

This file needs to be edited. I have the following content. Please change the IP for your Elasticsearch interface and add the excludes that you do not want to be crawled. The URL needs to match your mount point as this will serve as the "root" directory.

---
name: "job_name"
fs:
  url: "/mnt/user"
  update_rate: "15m"
  excludes:
  - "*/~*"
  - "/appdata/*"
  - "/domains/*"
  - "/isos/*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.xxx.xxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

FSCrawler should now start crawling the data and create 2 indices (one for the folders and one for the files) under:

/mnt/user/appdata/elasticsearch/data/nodes/0/indices

For more information on FSCrawler, have a look at FSCrawler documentation https://hub.docker.com/r/toto1310/fscrawler/

4.) Configure SAMBA to enable spotlight. I have inserted this in the unraid Settings > SMB > SMB Extras section:

We need to add parameters in the global and individual share section. To do this you can add to the Samba extra configuration file the following. Please replace share with your share name:

[global]
# Settings to enable spotlight search
spotlight backend = elasticsearch
elasticsearch:address = 192.168.xxx.xxx
elasticsearch:port = 9200
elasticsearch:use tls = 0
#enable spotlight search in share
[share]
path = /mnt/user/share
spotlight = yes

Restart SAMBA (or the server) .

5.) Enjoy searching in Finder with Spotlight (share needs to be selected in finder).

6.) Background information:

Spotlight is accessing the Index with specific search queries. SAMBA has for this purpose a mapping file that translates Elasticsearch attributes to the Spotlight queries. I have not changed this mapping file, but it can be found here for reference:

/usr/share/samba/mdssvc/elasticsearch_mappings.json

There is also another mapping file that FSCrawler uses when creating the Elasticsearch index. This mapping can be found here if Elasticsearch 7.x is used. Also this mapping file was not modified by me:

/mnt/user/appdata/fscrawler/config/_default/7/_settings.json

7.) Testing:

List Elasticsearch indices on server (replace localhost with server IP):

curl http://localhost:9200/_aliases?pretty=true

List all content of index job_name_folder

curl -H 'Content-Type: application/json' -X GET http://192.168.xxx.xxx:9200/job_name_folder/_search?pretty

List all content of index job_name

curl -H 'Content-Type: application/json' -X GET http://192.168.xxx.xxx:9200/job_name/_search?pretty

Test if Samba search is working: (replace your user name with username), IP address and select a search string

mdfind -d=8 --user=username 192.168.xxx.xxx share 'kMDItemTextContent=="searchstring"'

8.) References:

Samba 4.12 release notes:

https://www.samba.org/samba/history/samba-4.12.0.html

Samba mdfind

https://www.samba.org/samba/docs/4.12/man-html/mdfind.1.html

fscrawler docker package:

https://hub.docker.com/r/toto1310/fscrawler

Edited September 30, 2020 by rob_robot

rob_robot · September 30, 2020

reserved

CuFk · October 21, 2020

Does this help speeding up folder browsing as well? We're having serious issues whilst browsning folders containing large amounts of hi-res photos etc.

Toskache · November 25, 2020

Nice Tutorial, thank you!

Unfortunately I have problems to setup fscrawler:

The docker configuration:

image.png.704ba4fc989da8b3b08e29870b52f5da.png

Starting the docker shows the following output in the docker-log:

16:46:01,584 [32mINFO [m [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [226.5mb/3.4gb=6.38%], RAM [4.7gb/15.6gb=30.16%], Swap [0b/0b=0.0].
16:46:01,611 [33mWARN [m [f.p.e.c.f.c.FsCrawlerCli] job [job_name] does not exist
16:46:01,611 [32mINFO [m [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
Exception in thread "main" java.util.NoSuchElementException
at java.util.Scanner.throwFor(Scanner.java:862)
at java.util.Scanner.next(Scanner.java:1371)
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:225)

And indeed, there is no directory "job_name":

root@nas:/mnt/user/appdata/fscrawler/config# ls -lah
total 0
drwxrwxrwx 1 nobody users 16 Nov 25 16:44 ./
drwxrwxrwx 1 root   root  12 Nov 25 09:32 ../
drwxr-xr-x 1 root   root   4 Nov 25 09:33 _default/

Creating that directory manualy has no effect.

Any ideas?

HagenS · March 7, 2021

Same error here for me when starting fscrawler. Any tips or is this tutorial orphaned?

ungeek67 · March 7, 2021

5 hours ago, HagenS said:

Same error here for me when starting fscrawler. Any tips or is this tutorial orphaned?

I got past this back in 6.9.0 rc2 but then stuck later on, was actually waiting for 6.9 to start this thread back up!

Make sure your settings json file is placed correctly, the container is starting with the arguments for "job_name" but no folder and/or config exists and you can't interactively send the "y"

I used the job name "unraid_data_spotlight" so my config looks like:

fscrawler docker post arguments (--restart commented out after first run)

fscrawler unraid_data_spotlight #--restart

/mnt/user/appdata/fscaler/unraid_data_spotlight/_settings.json

---
name: "unraid_data_spotlight"
fs:
  url: "/mnt/user"
  update_rate: "15m"
...

I think thats all I did to get it moving awhile back, let me know if that doesn't help and I'll blow away my current setup and do it again and actually take notes this time.

My issue is that all the tests are passing, including mdfind which is returning the expected results. But when I then try to use spotlight on macOS Big Sur I get either nothing.

mdutil -s /Volumes/media returns "Server search enabled" as expected.

Adding "elasticsearch:index = unraid_data_spotlight" to the samba extra config under [global] hasn't helped either.

Anyone get beyond this?

Edited March 7, 2021 by ungeek67

tarzan · March 9, 2021

Im super excited about this.. hope it will work when I update to 6.9 in a few weeks :)

rob_robot · March 10, 2021

On 3/7/2021 at 10:01 AM, HagenS said:

Same error here for me when starting fscrawler. Any tips or is this tutorial orphaned?

It is a bit like a chicken and egg problem. The file should get created after the first run, but after this time has passed I don't remember if I manually added the file or if I copied it from inside the docker (so not mapping the config file at all and then copying the file outside of the docker via docker command).

One way would be to manually create the file:

1.) Go to /mnt/user/appdata/fscrawler/config/ and create the folder "job_name" (permissions 999, root / root)

2.) Inside the new job_name folder, create a file called _settings.yaml and paste the content from my initial post. Please make sure to change the IP. address at the bottom of the file (- url). Later on there will be as well a 2nd file called _status.json, but I don't think this is needed initially.

image.png.17cf08d84e4aee7d7a69175455deda3b.png

parazit15 · March 10, 2021

Hi guys,

i am trying to get fscrawler to index my shares, but after a couple of hours, i always get this error and the docker stops.

# A fatal error has been detected by the Java Runtime Environment:

Anybody has an idea, what the error could be?

rob_robot · March 22, 2021

I didn't encounter this issue as far as I remember. Could it be some memory size issue? Is this the only error or are the additional error messages in the log file?

ankx7 · May 17, 2021

Hi! thanks in advance for the article!

i've a problem: mdfind don't find anything

my samba version is 4.12.3

The indexing is ok.

Command:

curl -H 'Content-Type: application/json' -X GET http://localhost:9200/myjob/_search?pretty

give correct result but mdfind dont find!

anyone similar problems?

my smb.conf

[global]
        workgroup = TESTSAMBA
        security = user
        netbios name  = REDHAT8
        passdb backend = tdbsam

        printing = cups
        printcap name = cups
        load printers = yes
        cups options = raw

        spotlight backend = elasticsearch
        elasticsearch:address = localhost
        elasticsearch:port = 9200

[testfolder]
        comment = folder di test
        path = /srv/samba/test
        valid users = testuser
        browseable = Yes
        read only = No
        spotlight = yes

thanks

ecat · June 14, 2021

On 5/17/2021 at 2:40 PM, ankx7 said:
Hi! thanks in advance for the article!

i've a problem: mdfind don't find anything

my samba version is 4.12.3

The indexing is ok.

Command:

curl -H 'Content-Type: application/json' -X GET http://localhost:9200/myjob/_search?pretty

give correct result but mdfind dont find!

anyone similar problems?

my smb.conf
[global]
        workgroup = TESTSAMBA
        security = user
        netbios name  = REDHAT8
        passdb backend = tdbsam

        printing = cups
        printcap name = cups
        load printers = yes
        cups options = raw

        spotlight backend = elasticsearch
        elasticsearch:address = localhost
        elasticsearch:port = 9200

[testfolder]
        comment = folder di test
        path = /srv/samba/test
        valid users = testuser
        browseable = Yes
        read only = No
        spotlight = yes
thanks

Jobname must be the same to share name, so change your 'myjob' to 'testfolder'

Bwx_Flo · October 20, 2021

Hi guys,

I did everything according to this tutorial and crawling seems to be working. But: I get an Error in the ES Docker Log, claiming:

{"type": "server", "timestamp": "2021-10-20T07:37:59,873+02:00", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "911179884b65", "message": "master not discovered or elected yet, an election requires a node with id [tJO09zgcQSOaJvZadHyMXQ], have discovered [{911179884b65}{tJO09zgcQSOaJvZadHyMXQ}{FzZZFb76SSu6dgTTsjkWJw}{172.17.0.2}{172.17.0.2:9300}{dilmrt}{ml.machine_memory=67047288832, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] which is a quorum; discovery will continue using [] from hosts providers and [{911179884b65}{tJO09zgcQSOaJvZadHyMXQ}{FzZZFb76SSu6dgTTsjkWJw}{172.17.0.2}{172.17.0.2:9300}{dilmrt}{ml.machine_memory=67047288832, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 3, last-accepted version 48 in term 3", "cluster.uuid": "JqY853ThR_uPDSn3mURqJA", "node.id": "tJO09zgcQSOaJvZadHyMXQ" }

I guess what ES is saying is it is looking for a "master", but can't find a node with that ID. Question is: why would it search for that? How do I configure that and where does that specific ID come from? The same time that error occured I also get errors in FSCrawler, saying the directory he just crawled for about half an hour suddenly does not exist anymore.

04:24:27,517 [33mWARN [m [f.p.e.c.f.FsParserAbstract] Error while crawling /mnt/user/public: /mnt/user/public doesn't exists.

Can anybody make sense of this and maybe even help me fix it?

Thanks a lot for this tutorial and your help in advance!

Greetings from Germany,

Flo

Ralph456 · February 28, 2022

On 10/21/2020 at 8:15 AM, CuFk said:

Does this help speeding up folder browsing as well? We're having serious issues whilst browsning folders containing large amounts of hi-res photos etc.

GOOD QUESTION,

SO will this help browsing large raw files on mac os client?

ovcrash · August 23, 2023

Hi,

Did you guys get this working?

Elasticsearch is working, indexing and feeding the data into elasticsearch is working.

The samba configuration doesn't work. On my mac spotlight doesn't use the index, even if i use mdfind it doesn't find anything.

Mac Spotlight indexing on SMB share with FSCrawler and Elasticsearch

Recommended Posts

rob_robot

Link to comment

rob_robot

Link to comment

CuFk

Link to comment

Toskache

Link to comment

HagenS

Link to comment

ungeek67

Link to comment

tarzan

Link to comment

rob_robot

Link to comment

parazit15

Link to comment

rob_robot

Link to comment

ankx7

Link to comment

ecat

Link to comment

Bwx_Flo

Link to comment

Ralph456

Link to comment

ovcrash

Link to comment

Join the conversation