Even if it takes a screen shot, wouldn't it be only getting the text ripped out of the image?
My understanding is that the inputs always reduced to a string of tokens. But some feedback would be better than nothing.
Even if it takes a screen shot, wouldn't it be only getting the text ripped out of the image?
My understanding is that the inputs always reduced to a string of tokens. But some feedback would be better than nothing.
You can feed cursor agent images