LRU is prone to bias from scrapers and the likes

Reply to this note

Please Login to reply.

Discussion

i don't see how if they are only fetching new data

i actively block scrapers manually, and could easily detect them with their unending streams of slowly progressing since/until queries

i could maybe add a second field so it's not just the timestamp but an access count as well and that mitigates the bias because they only add one each time anyway

you could modify access timestamp on filter, and not request by ID. then my scraper wouldn’t have issues since it will use the designated endpoints

otherwise, you should also ignore filters that are not specific (just since/until) for counting last access

blocking them imo is a bad idea, but if they misbehave that makes sense

yeah, i think i'll just add an access counter and then the sort order will be by last accessed AND least accessed, and this will remove the LRU bias, the GC will sort the oldest ones first and then sort those by the least accessed

stuff that might be better to keep will also tend to have higher access counts so it can be shuffled upwards away from the low water mark

this is for later work, anyhow, but as we have discussed the idea of making relays into caches for a bigger event store would require capping the storage use of the caches, evicting the least valuable data in the cache

relay operators could then run independent cache relays as part of their service offering and subscribe to the big store and save on managing their relay's syncing with the broader network (their relays would push to the store when they store and pull when they process requests, refreshing entries that may have found their way down to the end of the list)

i'm working on making a bunker app at the moment but i might switch up the access counter value field to contain an access counter alongside the last accessed timestamp, and then maybe reinstate the option of having a garbage collector and a size limit target, these can be done in teh dynamic configuration so you can switch it up whenever you need to such as after migrating to a bigger VPS

use a morris counter

i don't think it's gonna make much difference to performance or effectiveness of the GC decision process though, it will eliminate the most stale and least demanded in total where recent access is equal