No index is currently used, the search relies on DuckDB's pattern matching capabilities with ILIKE. Therefore, no index is needed, and the data is always fresh. DuckDB's columnar storage makes this approach performant. However, it is true that performance will degrade gracefully as the dataset grows. Through testing, I found that you can handle quite a large dataset, over +100k and still maintain good performance.

I am considering introducing BM25 fts. However, this approach relies on indexes that would need to be rebuilt each time to ensure it is fresh. My initial idea for implementation is to make this optional through configuration and integrate it into the current search function. Something like: If the BM25 index exists, it will be used, otherwise, it will fall back to pattern matching.

Reply to this note

Please Login to reply.

Discussion

duckdb has indexes to implement these

ɪ ᴋɴᴏᴡ ᴛʜɪꜱ ʙᴇᴄᴀᴜꜱᴇ ɪ ʜᴀᴠᴇ ɪᴍᴘʟᴇᴍᴇɴᴛᴇᴅ ᴍᴀɴʏ ᴅɪꜰꜰᴇʀᴇɴᴛ ᴋɪɴᴅꜱ ɪɴᴄʟᴜᴅɪɴɢ ᴛᴡᴏ ɢʀᴀᴘʜ ᴠᴇʀᴛᴇx ᴛᴀʙʟᴇꜱ.

Not for I/LIKE as far I can understand from their docs. Seems they use art indexes but just for equality and IN conditions

ᴄᴏʟᴜᴍɴᴀʀ ᴛᴀʙʟᴇꜱ (ᴛʜᴇꜱᴇ ᴘʟᴀᴄᴇ ᴋᴇʏꜱ ᴀɴᴅ ᴠᴀʟᴜᴇꜱ ɪɴ ᴀ ᴅɪꜰꜰᴇʀᴇɴᴛ ᴛʏᴘᴇ ᴏꜰ ꜱᴛʀᴜᴄᴛᴜʀᴇ ᴛᴏ LSM ꜱᴛʏʟᴇ ᴜꜱᴇᴅ ɪɴ ʟᴇᴠᴇʟᴅʙ/ʟᴍᴅʙ/ʙᴀᴅɢᴇʀ ᴛʜᴀᴛ ꜱᴄᴀɴꜱ ɪɴ ᴀ ᴅɪꜰꜰᴇʀᴇɴᴛ ᴅɪʀᴇᴄᴛɪᴏɴ ᴀɴᴅ ᴄᴏᴍᴘᴀᴄᴛꜱ ᴅɪꜰꜰᴇʀᴇɴᴛʟʏ. ᴀꜱ ɪ ᴜɴᴅᴇʀꜱᴛᴀɴᴅ ɪᴛ, ᴛʜɪꜱ ᴍᴀᴋᴇꜱ ɢʀᴀᴘʜ-ʟɪᴋᴇ ǫᴜᴇʀɪᴇꜱ ᴀ ʙɪᴛ ᴍᴏʀᴇ ᴇꜰꜰɪᴄɪᴇɴᴛ.

ʙᴜᴛ ʏᴏᴜ ᴄᴀɴ ᴅᴏ ᴛʜɪꜱ ᴡɪᴛʜ ʙɪᴅɪʀᴇᴄᴛɪᴏɴᴀʟ ᴛᴀʙʟᴇꜱ ᴡɪᴛʜ ʀᴇɢᴜʟᴀʀ ᴋ/ᴠ ꜱᴛᴏʀᴇꜱ ᴀɴᴅ ᴡɪᴛʜ ᴛʜᴇ ꜱᴘʟɪᴛ ᴋ/ᴠ ꜱᴛᴏʀᴇ ɪɴ ʙᴀᴅɢᴇʀ, ʏᴏᴜ ᴄᴀɴ ᴘᴜᴛ ᴛʜᴇ ᴡʜᴏʟᴇ ᴠᴇʀᴛᴇx ᴛᴀʙʟᴇ ɪɴ ᴛʜᴇ ᴋᴇʏ ᴛᴀʙʟᴇ (ᴀꜱ ɪᴛ ɪꜱ ꜱᴇᴘᴀʀᴀᴛᴇ) ᴀɴᴅ ᴘᴇʀꜰᴏʀᴍ ᴄᴏᴍᴘʟᴇx ɢʀᴀᴘʜ ǫᴜᴇʀɪᴇꜱ ᴡɪᴛʜᴏᴜᴛ ᴇᴠᴇɴ ꜱᴛᴇᴘᴘɪɴɢ ᴏᴜᴛꜱɪᴅᴇ ᴏꜰ ᴛʜᴀᴛ ᴛᴀʙʟᴇ.

ɪ ᴀʟꜱᴏ ᴜꜱᴇ ᴛʜɪꜱ ꜱᴘᴇᴄɪᴀʟ ᴘʀᴏᴘᴇʀᴛʏ ᴏꜰ ʙᴀᴅɢᴇʀ'ꜱ ꜱᴘʟɪᴛ ᴋᴇʏ/ᴠᴀʟᴜᴇ ʟᴏɢꜱ ᴛᴏ ᴇɴᴀʙʟᴇ ᴏᴛʜᴇʀ ᴋɪɴᴅꜱ ᴏꜰ ɪɴᴅᴇxᴇꜱ ꜱᴜᴄʜ ᴀꜱ ᴛʜᴇ ᴏɴᴇ ᴛʜᴀᴛ ᴀʟʟᴏᴡꜱ ᴍᴇ ᴛᴏ ꜰɪʟᴛᴇʀ ᴏᴜᴛ ᴘᴜʙᴋᴇʏꜱ ᴀɴᴅ ᴛɪᴍᴇ ᴡɪɴᴅᴏᴡꜱ, ᴀꜱ ᴡᴇʟʟ ᴀꜱ ꜱᴏʀᴛ ᴛʜᴇ ʟɪꜱᴛ ᴏꜰ ᴇᴠᴇɴᴛ ɪɴᴅᴇx ꜱᴇʀɪᴀʟꜱ ʙᴇꜰᴏʀᴇ ᴇᴠᴇɴ ꜰᴇᴛᴄʜɪɴɢ ᴛʜᴇᴍ. ᴀ ꜱᴘᴇᴄɪᴀʟ ʜᴛᴛᴘ ᴇɴᴅᴘᴏɪɴᴛ ɪ ᴄʀᴇᴀᴛᴇᴅ ᴀʟꜱᴏ ᴜꜱᴇꜱ ᴛʜᴇ ᴍᴏɴᴏᴛᴏɴɪᴄ ᴇᴠᴇɴᴛ ꜱᴇʀɪᴀʟꜱ ᴀꜱ ᴀ ꜱʏɴᴄ ᴍᴇᴄʜᴀɴɪꜱᴍ, ᴀʟʟᴏᴡɪɴɢ ᴘᴇᴇʀꜱ ᴛᴏ ᴍᴏɴɪᴛᴏʀ ᴛʜᴇ ᴘʀᴏɢʀᴇꜱꜱ ꜱᴛᴀᴛᴇ ᴏꜰ ᴘᴇᴇʀꜱ ᴀɴᴅ ᴛʜᴇɴ ʀᴇǫᴜᴇꜱᴛ ʀᴀɴɢᴇꜱ ᴏꜰ ᴇᴠᴇɴᴛ ꜱᴇʀɪᴀʟꜱ ᴛᴏ ꜱʏɴᴄ ᴜᴘ, ᴡʜᴇɴ ᴇɴᴀʙʟᴇᴅ, ɪᴛ ᴅᴏᴇꜱ ᴛʜɪꜱ (ʙʏ ᴅᴇꜰᴀᴜʟᴛ) ᴇᴠᴇʀʏ 5 ꜱᴇᴄᴏɴᴅꜱ ᴀɴᴅ ᴀᴜᴛᴏᴍᴀᴛɪᴄᴀʟʟʏ ᴄᴀᴛᴄʜᴇꜱ ᴜᴘ ᴡɪᴛʜ ᴘᴇᴇʀꜱ ɪꜰ ᴛʜᴇ ɴᴏᴅᴇ ʜᴀꜱ ᴄʀᴀꜱʜᴇᴅ ᴏʀ ʟᴏꜱᴛ ᴄᴏɴɴᴇᴄᴛɪᴠɪᴛʏ ꜰᴏʀ ʟᴏɴɢ ᴘᴇʀɪᴏᴅꜱ, ᴡɪᴛʜ ɴᴏ ᴄʜᴀɴɢᴇ ᴛᴏ ᴛʜᴇ ǫᴜᴇʀʏ ᴘʀᴏᴛᴏᴄᴏʟ. ɪᴛ ᴀʟꜱᴏ ᴇxᴘʟᴏɪᴛꜱ ᴛʜᴇ ɪᴍᴍᴜᴛᴀʙɪʟɪᴛʏ ᴏꜰ ɴᴏꜱᴛʀ ᴇᴠᴇɴᴛꜱ, ꜱɪɴᴄᴇ ɪɴ ᴏᴛʜᴇʀ ᴛʏᴘᴇꜱ ᴏꜰ ᴅᴀᴛᴀʙᴀꜱᴇꜱ, ᴛʜᴇꜱᴇ ꜱᴇʀɪᴀʟꜱ ᴡᴏᴜʟᴅ ɴᴏᴛ ʙᴇ ᴀ ᴘʀᴏɢʀᴇꜱꜱ ɪɴᴅɪᴄᴀᴛᴏʀ.

ᴛʜɪꜱ ɪꜱ ᴀʟꜱᴏ ᴡʜʏ ɪ ᴄʜᴏᴏꜱᴇ ʙᴀᴅɢᴇʀ, ɪᴛ'ꜱ ᴛʜᴇ ᴍᴏꜱᴛ ᴠᴇʀꜱᴀᴛɪʟᴇ ᴀɴᴅ ᴘᴇʀꜰᴏʀᴍᴀɴᴛ ᴋ/ᴠ ꜱᴛᴏʀᴇ ʏᴏᴜ ᴄᴀɴ ɢᴇᴛ, ᴀɴᴅ ᴛʜᴇʀᴇ ɪꜱ ɴᴏ ᴏᴛʜᴇʀ ʟᴀɴɢᴜᴀɢᴇ ᴡɪᴛʜ ᴀ ᴡɪꜱᴄᴋᴇʏ ꜱᴘʟɪᴛ ᴛᴀʙʟᴇ ɪᴍᴘʟᴇᴍᴇɴᴛᴀᴛɪᴏɴ, ᴘᴇʀɪᴏᴅ. ꜱᴏ ɪᴛ'ꜱ ᴀʟꜱᴏ ᴀ ʙɪɢ ᴀᴅᴠᴀɴᴛᴀɢᴇ ꜰᴏʀ ɢᴏ ᴘʀᴏɢʀᴀᴍᴍᴇʀꜱ.