Replying to Avatar ManiMe

Hey Mises, im taking a look at your indexer now.

https://github.com/misesdev/nostr-indexer

You should take a look at our GrapeRank engine. Looks like we’re trying to solve the same problem … in almost the same manner. … maybe we can colab?

https://github.com/Pretty-Good-Freedom-Tech/graperank-nodejs

Live demo at

https://grapevine.my

Sure, I'll take a look; we can collaborate!

The engine repository itself is private for now, but I'll make it public soon. It's equivalent to an in-memory database; the project is in C and is still a bit amateurish, haha.

Reply to this note

Please Login to reply.

Discussion

If I’m following you correctly, the search results will be stratified by how many hops away the content author is from the user initiating the search (or from you). So I’ll see content from a user 3 hops away from me ranked more highly than a user 4 hops away. Is that right?

Not exactly. I did some tests, and initially I wanted the search to be performed in a "network" starting from the user who is doing the search, considering the "distance" within that network as a factor for relevance — meaning, in a search, users closer to me, with more similarity to my query, would have higher relevance, and those farther away would have lower relevance.

In theory, it looks great, but when I tested it in practice, perhaps because the number of users was still small, the results were quite poor. So, I added a "similarity" property to the users resulting from the search and simply sorted the list at the end in descending order by the "similarity" property.

In the form described above, the results were much more accurate, so once I got the desired outcome, I optimized the code to make the search as fast as possible. To summarize, the most efficient approach given the current number of users is to take the list and perform a simple search sorted by similarity. I’m still not making good use of the graph properties, and I’ll revisit the analysis when the user base is large enough for graph-based search results to be more relevant.

This week, I’ll solve an issue in the method that calculates the similarity between users’ names in the list and the search term. There’s a bug causing it to return poor results quite often.

How do you calculate the similarity score? Is it PageRank or something like it?

And are you using a graph db like neo4j? (Which has powerful tools for calculation of various similarity scores)

I'm a few steps behind, my project is still a bit primitive haha

Right now, it only performs a simple breadth-first search (BFS) on the graph, where I calculate the similarity between user names and the search term using an algorithm I implemented based on this paper: http://www.catalysoft.com/articles/StrikeAMatch.html

I'm not using any database; I implemented everything from scratch in C to achieve the highest possible performance. By using the paper above and implementing it in the most optimized way possible with bitwise operations and everything, the similarity calculation became orders of magnitude faster than all existing methods—about 300% faster than cosine similarity, for example.

But it's still a pretty primitive project, as I mentioned. Right now, it's just a simple search in a graph and list. Later on, depending on the needs that come up, I might implement something like PageRank when considering user posts. For now, the search is just done by profile.

So do you query relays for follows and build the graph on the fly at the time of the search request?

In the case of the relay search, it doesn't involve a graph—it's just a simple list, since the number of relays is very small and is not likely to increase significantly. The similarity calculation is the same, but it's done by iterating over a simple array.

On the front end, a POST request is made to the relay, which eliminates any relays that don't respond to the request, listing only the active relays. This ensures that only active relays will appear in the search results. Another thing I implemented is a 2-second timeout for the requests made to the relays, which also ensures that only the relays closest to you, and thus more relevant, will appear.

I only use a graph for the users. The relays are mapped in kind 3 events, which contain a list of relays and are sent to the engine that saves them in a list loaded into RAM. The graph is also loaded into RAM at the application startup, and everything runs in memory.

nostr:npub1manlnflyzyjhgh970t8mmngrdytcp3jrmaa66u846ggg7t20cgqqvyn9tn we should dig into this to compare it to what we are doing