Mac Spotlight indexing on SMB share with FSCrawler and Elasticsearch


rob_robot

Recommended Posts

This guide is based on the Samba WIKI Spotlight with Elasticsearch backend:  https://wiki.samba.org/index.php/Spotlight_with_Elasticsearch_Backend 

 

The goal of this project is to use the Mac finder to search SMB shares from Mac clients. The provided solutions gives us an index based full text search, something that I've been waiting for a long time. Recently added extensions in SAMBA finally made this possible. 

To begin with I want to say that I'm nether an UNRAID nor docker expert, so please forgive me if there are better ways to solve this, but my solution seems to be working for now:

 

To realise this we need firstly the lastest beta of Unraid (6.9.0-beta25 at time of writing) as we need SAMBA 4.12 to use this feature. This revision of SAMBA is only shipped with unraid 4.12, therefore we need to install the beta first.  

We as well need an Elasticsearch docker container that will work as the search backend. Then an FSCrawler docker that will crawl the data on a regular basis and feed the results to Elasticsearch that will then create the index. Lastly, enable the SAMBA settings for Spotlight search support. 

 

The high-level interaction of the tools looks like this:

 

FSCrawler <-------------- DATA directory (SMB share)

      |

      | (sends data periodically and tracks changes in data directory)

      |

Elasticsearch -------------->  index  <---------- SAMBA (4.12) <--------- Finder Spotlight search

 

 

Steps:

 

1.) install Elasticsearch

I used 7.9.1 from community applications

 

2.) Install Elasticsearch ingest plugin to search PDF and doc as described here: 

 

Download the user scripts plug-in and define the script as follows:

 

#!/bin/bash

# execute command inside container
docker exec -i elasticsearch /usr/share/elasticsearch/bin/elasticsearch-plugin install --batch ingest-attachment

 

3.) Install FSCrawler

If you go to Settings in the Community Applications and allow additional search results from DockerHub, you can install a version of FSCrawler (I used toto1319/fscrawler, version 2.7):

https://hub.docker.com/r/toto1310/fscrawler/

 

In the template, you need to set the config and data directories. The data directory mount point in FSCrawler needs to match the real mount point in unraid as this path is written into the Elasticsearch index later on, and then needs to be valid for SAMBA to read it. I used /mnt/user/ to be able to search all shares later on. 

 

image.png.3e72be2b537a2276f7ce58fce3af94d3.png

 

image.thumb.png.f6e9a5d2075615b0d1a884094bdd56fe.png

 

To start the docker, the following post argument needs to be added (turn on advanced mode in the template):

 

Post Arguments: fscrawler job_name --restart

 

The option "--restart" causes a full re-index of the whole share. This option is only needed for the first execution of the crawler, later on this option can be removed so that the crawler only monitors the data directory for changes and feeds these into the Elasticsearch index. 

After the first run, FSCrawler creates a _settings.yaml file under /mnt/user/appdata/fscrawler/config/job_name/_settings.yaml

 

This file needs to be edited. I have the following content. Please change the IP for your Elasticsearch interface and add the excludes that you do not want to be crawled. The URL needs to match your mount point as this will serve as the "root" directory.

 

---
name: "job_name"
fs:
  url: "/mnt/user"
  update_rate: "15m"
  excludes:
  - "*/~*"
  - "/appdata/*"
  - "/domains/*"
  - "/isos/*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.xxx.xxx:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

 

FSCrawler should now start crawling the data and create 2 indices (one for the folders and one for the files) under:

/mnt/user/appdata/elasticsearch/data/nodes/0/indices

 

For more information on FSCrawler, have a look at FSCrawler documentation  https://hub.docker.com/r/toto1310/fscrawler/

 

4.) Configure SAMBA to enable spotlight. I have inserted this in the unraid Settings > SMB > SMB Extras section:

We need to add parameters in the global and individual share section. To do this you can add to the Samba extra configuration file the following. Please replace share with your share name: 

 

[global]
# Settings to enable spotlight search
spotlight backend = elasticsearch
elasticsearch:address = 192.168.xxx.xxx
elasticsearch:port = 9200
elasticsearch:use tls = 0
#enable spotlight search in share
[share]
path = /mnt/user/share
spotlight = yes

 

Restart SAMBA (or the server) .

 

5.) Enjoy searching in Finder with Spotlight (share needs to be selected in finder). 

 

 

6.) Background information:

Spotlight is accessing the Index with specific search queries. SAMBA has for this purpose a mapping file that translates Elasticsearch attributes to the Spotlight queries. I have not changed this mapping file, but it can be found here for reference: 

/usr/share/samba/mdssvc/elasticsearch_mappings.json

 

There is also another mapping file that FSCrawler uses when creating the Elasticsearch index. This mapping can be found here if Elasticsearch 7.x is used. Also this mapping file was not modified by me:

/mnt/user/appdata/fscrawler/config/_default/7/_settings.json

 

 

7.) Testing:

List Elasticsearch indices on server (replace localhost with server IP):

curl http://localhost:9200/_aliases?pretty=true

 

List all content of index job_name_folder

curl -H 'Content-Type: application/json' -X GET http://192.168.xxx.xxx:9200/job_name_folder/_search?pretty

 

List all content of index job_name

curl -H 'Content-Type: application/json' -X GET http://192.168.xxx.xxx:9200/job_name/_search?pretty

 

Test if Samba search is working: (replace your user name with username), IP address and select a search string 

mdfind -d=8 --user=username 192.168.xxx.xxx share 'kMDItemTextContent=="searchstring"'

 

 

8.) References:

Samba 4.12 release notes:

https://www.samba.org/samba/history/samba-4.12.0.html

 

Samba mdfind

https://www.samba.org/samba/docs/4.12/man-html/mdfind.1.html

 

fscrawler docker package:

https://hub.docker.com/r/toto1310/fscrawler

Edited by rob_robot
Link to comment
  • 3 weeks later...
  • 1 month later...

Nice Tutorial, thank you!

Unfortunately I have problems to setup fscrawler:

The docker configuration:

image.png.704ba4fc989da8b3b08e29870b52f5da.png

 

Starting the docker shows the following output in the docker-log:

16:46:01,584 [32mINFO [m [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [226.5mb/3.4gb=6.38%], RAM [4.7gb/15.6gb=30.16%], Swap [0b/0b=0.0].
16:46:01,611 [33mWARN [m [f.p.e.c.f.c.FsCrawlerCli] job [job_name] does not exist
16:46:01,611 [32mINFO [m [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
Exception in thread "main" java.util.NoSuchElementException
at java.util.Scanner.throwFor(Scanner.java:862)
at java.util.Scanner.next(Scanner.java:1371)
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:225)

And indeed, there is no directory "job_name":

root@nas:/mnt/user/appdata/fscrawler/config# ls -lah
total 0
drwxrwxrwx 1 nobody users 16 Nov 25 16:44 ./
drwxrwxrwx 1 root   root  12 Nov 25 09:32 ../
drwxr-xr-x 1 root   root   4 Nov 25 09:33 _default/

Creating that directory manualy has no effect.

Any ideas? 

Link to comment
  • 3 months later...
5 hours ago, HagenS said:

Same error here for me when starting fscrawler. Any tips or is this tutorial orphaned?

 

I got past this back in 6.9.0 rc2 but then stuck later on, was actually waiting for 6.9 to start this thread back up!

 

Make sure your settings json file is placed correctly, the container is starting with the arguments for "job_name" but no folder and/or config exists and you can't interactively send the "y"

 

I used the job name "unraid_data_spotlight" so my config looks like:

 

fscrawler docker post arguments (--restart commented out after first run)

fscrawler unraid_data_spotlight #--restart

 

/mnt/user/appdata/fscaler/unraid_data_spotlight/_settings.json

---
name: "unraid_data_spotlight"
fs:
  url: "/mnt/user"
  update_rate: "15m"
...

 

 

I think thats all I did to get it moving awhile back, let me know if that doesn't help and I'll blow away my current setup and do it again and actually take notes this time.

 

My issue is that all the tests are passing, including mdfind which is returning the expected results. But when I then try to use spotlight on macOS Big Sur I get either nothing.

 

mdutil -s /Volumes/media returns "Server search enabled" as expected.

 

Adding "elasticsearch:index = unraid_data_spotlight" to the samba extra config under [global] hasn't helped either.

 

Anyone get beyond this?

 

 

Edited by ungeek67
Link to comment
On 3/7/2021 at 10:01 AM, HagenS said:

Same error here for me when starting fscrawler. Any tips or is this tutorial orphaned?

 

It is a bit like a chicken and egg problem. The file should get created after the first run, but after this time has passed I don't remember if I manually added the file or if I copied it from inside the docker (so not mapping the config file at all and then copying the file outside of the docker via docker command).  

 

One way would be to manually create the file:

1.) Go to /mnt/user/appdata/fscrawler/config/ and create the folder "job_name" (permissions 999, root / root)

2.) Inside the new job_name folder, create a file called _settings.yaml and paste the content from my initial post. Please make sure to change the IP. address at the bottom of the file (- url). Later on there will be as well a 2nd file called _status.json, but I don't think this is needed initially. 

 

image.png.17cf08d84e4aee7d7a69175455deda3b.png

 

 

Link to comment
  • 2 weeks later...
  • 1 month later...

Hi! thanks in advance for the article!

i've a problem: mdfind don't find anything

my samba version is 4.12.3

The indexing is ok.

Command:

curl -H 'Content-Type: application/json' -X GET http://localhost:9200/myjob/_search?pretty

give correct result but mdfind dont find!

 

anyone similar problems?

my smb.conf

[global]
        workgroup = TESTSAMBA
        security = user
        netbios name  = REDHAT8
        passdb backend = tdbsam

        printing = cups
        printcap name = cups
        load printers = yes
        cups options = raw

        spotlight backend = elasticsearch
        elasticsearch:address = localhost
        elasticsearch:port = 9200

[testfolder]
        comment = folder di test
        path = /srv/samba/test
        valid users = testuser
        browseable = Yes
        read only = No
        spotlight = yes

thanks

Link to comment
  • 4 weeks later...
On 5/17/2021 at 2:40 PM, ankx7 said:

Hi! thanks in advance for the article!

i've a problem: mdfind don't find anything

my samba version is 4.12.3

The indexing is ok.

Command:

curl -H 'Content-Type: application/json' -X GET http://localhost:9200/myjob/_search?pretty

give correct result but mdfind dont find!

 

anyone similar problems?

my smb.conf




[global]
        workgroup = TESTSAMBA
        security = user
        netbios name  = REDHAT8
        passdb backend = tdbsam

        printing = cups
        printcap name = cups
        load printers = yes
        cups options = raw

        spotlight backend = elasticsearch
        elasticsearch:address = localhost
        elasticsearch:port = 9200

[testfolder]
        comment = folder di test
        path = /srv/samba/test
        valid users = testuser
        browseable = Yes
        read only = No
        spotlight = yes

thanks

 

Jobname must be the same to share name, so change your 'myjob' to 'testfolder'

Link to comment
  • 4 months later...

Hi guys,

 

I did everything according to this tutorial and crawling seems to be working. But: I get an Error in the ES Docker Log, claiming:

 

{"type": "server", "timestamp": "2021-10-20T07:37:59,873+02:00", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "911179884b65", "message": "master not discovered or elected yet, an election requires a node with id [tJO09zgcQSOaJvZadHyMXQ], have discovered [{911179884b65}{tJO09zgcQSOaJvZadHyMXQ}{FzZZFb76SSu6dgTTsjkWJw}{172.17.0.2}{172.17.0.2:9300}{dilmrt}{ml.machine_memory=67047288832, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] which is a quorum; discovery will continue using [] from hosts providers and [{911179884b65}{tJO09zgcQSOaJvZadHyMXQ}{FzZZFb76SSu6dgTTsjkWJw}{172.17.0.2}{172.17.0.2:9300}{dilmrt}{ml.machine_memory=67047288832, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 3, last-accepted version 48 in term 3", "cluster.uuid": "JqY853ThR_uPDSn3mURqJA", "node.id": "tJO09zgcQSOaJvZadHyMXQ" }

 

I guess what ES is saying is it is looking for a "master", but can't find a node with that ID. Question is: why would it search for that? How do I configure that and where does that specific ID come from? The same time that error occured I also get errors in FSCrawler, saying the directory he just crawled for about half an hour suddenly does not exist anymore.

 

04:24:27,517 [33mWARN [m [f.p.e.c.f.FsParserAbstract] Error while crawling /mnt/user/public: /mnt/user/public doesn't exists.

 

Can anybody make sense of this and maybe even help me fix it?

 

Thanks a lot for this tutorial and your help in advance!

 

Greetings from Germany,

Flo

Link to comment
  • 4 months later...
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.