NHacker Next
login
▲Is the doc bot docs, or not?robinsloan.com
176 points by tobr 12 hours ago | 101 comments
Loading comments...
emil_sorensen 9 hours ago [-]
Docs bots like these are deceptively hard to get right in production. Retrieval is super sensitive to how you chunk/parse documentation and how you end up structuring documentation in the first place (see frontpage post from a few weeks ago: https://news.ycombinator.com/item?id=44311217).

You want grounded RAG systems like Shopify's here to rely strongly on the underlying documents, but also still sprinkle a bit of the magic of the latent LLM knowledge too. The only way to get that balance right is evals. Lots of them. It gets even harder when you are dealing with GraphQL schema like Shopify has since most models struggle with that syntax moreso than REST APIs.

FYI I'm biased: Founder of kapa.ai here (we build docs AI assistants for +200 companies incl. Sentry, Grafana, Docker, the largest Apache projects etc).

chrismorgan 9 hours ago [-]
Why do you say “deceptively hard” instead of “fundamentally impossible”? You can increase the probability it’ll give good answers, but you can never guarantee it. It’s then a question of what degree of wrongness is acceptable, and how you signal that. In this specific case, what it said sounds to me (as a Shopify non-user) entirely reasonable, it’s just wrong in a subtle but rather crucial way, which is also mildly tricky to test.
whatsgonewrongg 8 hours ago [-]
A human answering every question is also not guaranteed to give good answers; anyone that has communicated with customer service knows that. So calling it impossible may be correct, but not useful.

(We tend to have far fewer evals for such humans though.)

girvo 8 hours ago [-]
A human will tell you “I am not sure, and will have to ask engineering and get back to you in a few days”. None of these LLMs do that yet, they’re biased towards giving some answer, any answer.
unshavedyak 6 hours ago [-]
I agree with you, but man i can't help but feel humans are the same depending on the company. My wife was recently fighting with several layers of comcast support over cap changes they've recently made. Seemingly it's a data issue since it's something new that theoretically hasn't propagated through their entire support chain yet, but she encountered a half dozen confidently incorrect people which lacked the information/training to know that they're wrong. It was a very frustrating couple hours.

Generally i don't trust most low paid (at no fault of their own) customer service centers anymore than i do random LLMs. Historically their advice for most things is either very biased, incredibly wrong, or often both.

tenacious_tuna 2 hours ago [-]
In the case of unhelpful human support, I can leverage my experience in communicating with another human to tell if I'm being understood or not. An LLM is much more trial-and-error: I can't model the theory-of-mind behind it's answers to tell if I'm just communicating poorly or whatever else may be being lost in translation, there is no mind at play.
unshavedyak 2 hours ago [-]
That's fair, though with an LLM (at least one you're familiar with) you can shape it's behavior. Which is not too different compared to some black box script that i can't control or reason through with a human support. Granted the LLM will have the same stupid black box script, so in both cases it's weaponized stupidity against the consumer.
dcre 7 hours ago [-]
This is not really true. If you give a decent model docs in the prompt and tell them to answer based on the docs and say “I don’t know” if the answer isn’t there, they do it (most of the time).
SecretDreams 7 hours ago [-]
> most of the time

This is doing some heavy lifting

QuadmasterXLII 4 hours ago [-]
I have never seen this in the wild. Have you?
dcre 8 minutes ago [-]
Yes. All the time. I wrote a tool that does it!

https://crespo.business/posts/llm-only-rag/

  $ rgd ~/repos/jj/docs "how can I write a revset to select the nearest bookmark?"

  Using full corpus (length: 400,724 < 500,000)

  # Answer

  gemini-2.5-flash  | $0.03243 | 2.94 s | Tokens: 107643 -> 56

  The provided documentation does not include a direct method to select the
  nearest bookmark using revset syntax. You may be able to achieve this using
  a combination of  ancestors() ,  descendants() , and  latest() , but the
  documentation does not explicitly detail such a method.
dingnuts 4 hours ago [-]
I need a big ol' citation for this claim, bud, because it's an extraordinary one. LLMs have no concept of truth or theory of mind so any time one tells you "I don't know" all it tells you is that the source document had similar questions with the answer "I don't know" already in the training data.

If the training data is full of certain statements you'll get certain sounding statements coming out of the model, too, even for things that are only similar, and for answers that are total bullshit

simonw 4 hours ago [-]
Do you use LLMs often?

I get "I don't know" answers from Claude and ChatGPT all the time, especially now that they have thrown "reasoning" into the mix.

Saying that LLMs can't say "I don't know" feels like a 2023-2024 era complaint to me.

stavros 3 hours ago [-]
Ok, how? The other day Opus spent 35 of my dollars by throwing itself again and again at a problem it couldn't solve. How can I get it to instead say "I can't solve this, sorry, I give up"?
simonw 2 hours ago [-]
That sounds slightly different from "here is a question, say I don't know if you don't know the answer" - sounds to me like that was Opus running in a loop, presumably via Claude Code?

I did have one problem (involving SQLite triggers) that I bounced off various LLMs for genuinely a full year before finally getting to an understanding that it wasn't solvable! https://github.com/simonw/sqlite-chronicle/issues/7

stavros 2 hours ago [-]
It wasn't in a loop really, it was more "I have this issue" "OK I know exactly why, wait" $3 later "it's still there" "OK I know exactly why, it's a different reason, wait", repeat until $35 is gone and I quit.

I would have much appreciated if it could throw its hands up and say it doesn't know.

conception 2 hours ago [-]
I solve this by in my prompt. I say if you can’t fix it in two tries look online on how to do it if you still can’t fix it after two tries pause and ask for my help. It works pretty well.
axus 6 hours ago [-]
Won't that be cool, when LLM-based AIs ask you for help instead of the other way around
whatsgonewrongg 7 hours ago [-]
You’re right that some humans will, and most LLMs won’t. But humans can be just as confidently wrong. And we incentivize them to make decisions quickly, in a way that costs the company less money.
bee_rider 5 hours ago [-]
Documentation is the thing we created because humans are forgetful and misunderstand things. If the doc bot is to be held to a standard more like some random discord channel or community forum, it should be called something without “doc” in the name (which, fwiw, might just be a name the author of the post came up with, I dunno what Shopify calls it).
intended 7 hours ago [-]
This is to move the goal posts /raise a different issue. We can engage with the new point, but this is to concede that Docs bots are not docs bots.
skrebbel 9 hours ago [-]
Why RAG at all?

We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.

I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.

cube2222 8 hours ago [-]
This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).

Re cost though, you can usually reduce the cost significantly with context caching here.

However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.

Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.

bee_rider 5 hours ago [-]
Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.
cluckindan 1 hours ago [-]
How will you grep synonyms or phrases with different word choices?
bee_rider 53 minutes ago [-]
I’m hoping that the documentation will be structured in a way such that Claude can easily come up with good grep regexes. If Claude can do it, I can probably do it only a little bit worse.
Rygian 9 hours ago [-]
What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.
cluckindan 9 hours ago [-]
With RAG the cost per question would be low single-digit pennies.
emil_sorensen 8 hours ago [-]
Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.
TZubiri 3 hours ago [-]
That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.

Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN

RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.

Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system

llm_nerd 6 hours ago [-]
What you described is RAG. Inefficient RAG, but still RAG.

And it's inefficient in two ways-

-you're using extra tokens for every query, which adds up.

-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.

Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.

TZubiri 3 hours ago [-]
I don't think it's RAG, RAG is specifically separating the search space from the LLM context-window or training set and giving the LLM tools to search in inference-time.
llm_nerd 44 minutes ago [-]
In this case their Retrieval stage is "SELECT *", basically, so sure I'm being loose with the terminology, but otherwise it's just a non-selective RAG. Okay ..AG.

RAG is selecting pertinent information to supply to the LLM with your query. In this case they decided that everything was pertinent, and the net result is just reduced efficiency. But if it works for them, eh.

IceDane 8 hours ago [-]
Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.
PeterStuer 8 hours ago [-]
Indeed. Dabbling in 'RAG' (which for better or worse has become a tag for anything context retrieval) for more complex documentation and more intricate questions, you will very quickly realize that you really need to go far beyond simple 'chunking', and end up with a subsystem that constructs more than one very intricate knowledge graphs for supporting different kinds of questions the users might ask. For example: a simple question such as "What exactly is an 'Essential Entity'? is better handled by Knowledge Representation A as opposed to "Can you provide a gap and risk analysis on my 2025 draft compliance statement (uploaded) in light of the current GDPR, NIS-2 and the AI Act?"

(My domain is regulatory compliance, so maybe this goes beyond pure documentation but I'm guessing pushed far enough the same complexities arise)

J_Shelby_J 2 hours ago [-]
“It’s just a chat bot Michael, how much can it cost?”

A philosophy degree later…

I ended up just generating a summary of each of our 1k docs, using the summaries for retrieval, running a filter to confirm the doc is relevant, and finally using the actual doc to generate an answers.

dingnuts 4 hours ago [-]
This is sort of hilarious; to use an LLM as a good search interface first build.. a search engine.

I guess this is why Kagi Quick Answer has consistently been one of the best AI tools I use. The search is good, so their agent is getting the best context for the summaries. Makes sense.

PeterStuer 2 hours ago [-]
It is building a system that amplifies the strengths of the LLM by feeding it the right knowledge in the right format at inference time. Context design is both a search (as a generic term for everything retrieval) and a representation problem.

Just dumping raw reams of text into the 'prompt' isn't the best way to great results. Now I am fully aware that anything I can do on my side of the API, the LLM provider can and eventually will do as well. After all, Search also evolved beyond 'pagerank' to thousands of specialized heuristic subsystems.

5 hours ago [-]
dworks 7 hours ago [-]
We're going to see increasingly more of these, and it's going to cause a big scandal at one point, that pops the current AI bubble. It's really obvious that you can't use non-deterministic systems this way but companies are hellbent on doing it anyway. This is why I won't take a role to implement "AI" in an existing product.
crystal_revenge 4 hours ago [-]
I don’t understand why people seem to be attacking the “non-determinism” of LLMs. First, I think most people are confusing “probabilistic” with “non-deterministic” which have very distinct meanings in CS/ML. Non-deterministic typically entails following multiple paths at once. Consider regex matching with NFAs or even the particular view of a list as a monad. The only case where LLMs are “non-deterministic” is when using sampling algorithms like beam search where multiple paths are considered simultaneously. But most LLM usage being discussed doesn’t involve beam search.

But even if one assumes people mean “probabilistic”, that’s also an odd critique given how probabilistic software has pretty much eaten the world. Most of my career has been building reliable product using probabilistic models.

Finally, there’s nothing inherently probabilistic or non-deterministic about LLM generation, these are properties of the sampler applied. I did quite a lot of LLM benchmarking in recent years and almost always used greedy sampling both for performance (doing things like GSM8K strong benefits from choosing the maximum likely path) and reproducibility. You can absolutely set up LLM tools that have perfectly reproducible results. LLMs have many issues but their probabilistic nature is not one of them.

thrwwXZTYE 3 hours ago [-]
There was an article on hackernews a few years back (before LLMs took over) about jobs that could be replaced by a sign saying "$default_result" 99% of the time.

Like being a cancer diagnostician. Or an inspector at a border crossing.

Using LLMs is currently a lot like going to a diagnostian that always responds "no, you're healthy". The answer is probably right. But still we pay people a lot to get that last 1%.

nlawalker 2 hours ago [-]
> But still we pay people a lot to get that last 1%.

If people paid for docs as an independent product, or had the foresight to evaluate the quality of the docs before making a purchase and use it as part of their criteria (or are able to do that at all), I think attitudes around docs and "docs bots" and their correctness, utility etc. would be a lot different.

meatmanek 1 hours ago [-]
I think you're being overly (and incorrectly) pedantic about the meaning of "non-deterministic" -- you're applying the fairly niche definition of the term as used on finite automata, when the people you're refuting are using it in the sense of https://en.wikipedia.org/wiki/Nondeterministic_algorithm: "In computer science and computer programming, a nondeterministic algorithm is an algorithm that, even for the same input, can exhibit different behaviors on different runs, as opposed to a deterministic algorithm." I think this usage of the term is more common than the finite automata sense. Dictionary.com doesn't have non-deterministic, but its (relevant) definition of deterministic is "of or relating to a process or model in which the output is determined solely by the input and initial conditions, thereby always returning the same results": https://www.dictionary.com/browse/deterministic

Under that definition of (non-)deterministic, ironically, an NFA is deterministic, because it always produces the same result for the same input.

crystal_revenge 27 minutes ago [-]
I'm not being pedantic, this is foundational comp-sci stuff drawing from the theory of computation (it's what the 'N' stands for in NP complete). That is not a particularly great (or relevant) wikipedia article you link to (you can look at the citations). The one on "Non-deterministic programming"[0] is probably better. But ultimately you can't just dismiss NFAs as these serve as the foundation for computational non-determinism. Automata theory isn't just some niche area of computing it's part of how we actually define what computation is.

We can just go straight to the Sipser (from the chapter 1, all emphasis is Sipser's)[1]:

> Nondeterminism is a useful concept that has had great impact on the theory of computation. So far in our discussion, every step of a computation follows in a unique way from the preceding step. When the machine is in a given state and reads the next input symbol, we know what the next state will be--it is determined. We call this deterministic computation. In a nondeterministic machine several choices may exist for the next state at any point.

> How does an NFA compute? Suppose that we are running an NFA on an input string and come to a state with multiple ways to proceed. For example, say that we are in state q_1 in NFA N_1 and that the next input symbol is a 1. After reading that symbol, the machine splits into multiple copies of itself and follows all the possibilities in parallel. Each copy of the machine proceeds and continues as before.

This is why the list monad also provides a useful way to explore non-determinism that mirrors in functional programming terms what NFAs do in a classical theory of computation framework.

To this point, LLMs can form this type of nondeterministic computing when they follow multiple paths at once doing beam search, but are unquestionably deterministic when doing greedy optimization, and still deterministic when using other single path sampling techniques and a known seed.

[0]. https://en.wikipedia.org/wiki/Nondeterministic_programming

[1]. https://cs.brown.edu/courses/csci1810/fall-2023/resources/ch...

d0mine 2 hours ago [-]
"chaotic system" might be more precise here: small variations in the input may result in arbitrary large differences in the output.
TZubiri 3 hours ago [-]
It's not entirely unrelated, the fact that the system is non-deterministic means that it necessarily is probabilistic.

A business can reduce temperature to 0 and choose a specific seed, and it's the correct approach in most cases, but still the answers might change!

On the other hand, it's true that there is some probability that is independent of determinism, for example maybe changing the order of some words might yield different answers, this might be a deterministic machines, but there's millions of ways to frame a question, if the answer depends on trivial details of the question formatting, there's a randomness there. Similar to how there is randomness in who will win a chess match between two equally rated players, despite the game being deterministic.

crystal_revenge 2 hours ago [-]
> the system is non-deterministic means that it necessarily is probabilistic.

This is not correct. Both of the examples I gave where specifically chosen because they use non-determinism without any probabilistic framework associated.

Regex matching using non-deterministic finite automata requires absolutely zero usage of probability. You simply need to keep track of multiple paths and store whether or not any are in valid state at the end of processing the string. The list monad as non-determinism is an even more generic model of non-determinism, that again, requires nothing probabilistic in it's reasoning.

Non-deterministic things do often become probabilistic because typically you have to make a choice of paths, and that choice can have a probabilistic nature. But again, NFA regex matching is a perfect example where no "choice" is needed.

pinoy420 7 hours ago [-]
[dead]
schnable 9 hours ago [-]
Reminds me of when I asked Gemini how to do some stuff in Google Docs App Script, and it just hallucinated the capability and code to make it work. Turns out what I wanted to do isn't supported at all.

I feel like we aren't properly using AI in products yet.

aDyslecticCrow 8 hours ago [-]
I asked about a nieche json library for c. It apparently wasn't in the training data so it just invented how it feels like a json library would work.

Ive also had alot of issues with cmake that it just invents syntax and functions. Every new question has to be made in a new chat context to clear the context poisoning.

Its the things that lack good docs i want to ask about. But that's where its most likley to fail.

dingnuts 4 hours ago [-]
I think users should get a refund on the tokens when this happens
Night_Thastus 2 hours ago [-]
That would turn a business model that is already questionable in terms of profitability to one that would never, ever be profitable. Just sayin.
braebo 7 hours ago [-]
Yet Google raised my workspace subscription cost by 25% last night because our current agreement is suddenly unworthy of all the new “ai value” they’ve added… value I didn’t even know existed until I started paying for it. I don’t even want to know what isis supposed to be referencing… I just want to dump it asap.
dsmmcken 6 hours ago [-]
The tool we use for our docs AI answers lets you mine that data for feature requests. It generates a report of what it didn't have answers for and summarizes them as potential feature gaps. (Or at least what it is aware it didn't have answers for).

People seem more willing to ask an AI about certain things then be judged by asking the same question of a human, so in that regard it does seem to surface slightly different feature requests then we hear when talking to customers directly.

We use inkeep.com (not affiliated, just a customer).

rapind 5 hours ago [-]
> We use inkeep.com (not affiliated, just a customer).

And what do you pay? It's crazy that none of these AI CSRs have public pricing. There should just be monthly subscription tiers, which include some number of queries, and a cost per query beyond that.

hnlmorg 8 hours ago [-]
I’ve found LLMs (or at least everyone I’ve tried this on) will always assume the customer is correct and thus even if they’re flat out wrong, the LLM will make up some bullshit to confirm the costumer is still correct.

It’s great when you’re looking to do creative stuff. But terrible when you’re looking to confirm the correctness of an approach or asking for support on something that you weren’t even aware of its nonexistence.

dworks 7 hours ago [-]
that's because its "answers" are actually "completions". cant escape that fact - LLMs will always "hallucinate".
xyst 6 hours ago [-]
> I feel like we aren't properly using AI in products yet.

Very similar sentiment at the height of the crypto/digital currency mania

shlomo_z 4 hours ago [-]
> so I did my customary dance of order-refund, order-refund, order-refund. My credit card is going to get locked one of these days.

I don't know the first thing about Shopify, but perhaps you can create a free "test" item so you don't actually need to make a credit card transaction.

dpifke 3 hours ago [-]
You elided the part where TFA claims you can't test "unconventional email formats" via test orders. The full quote is:

Shopify doesn’t pro­vide a way to test uncon­ven­tional email for­mats without actu­ally placing real orders, so I did my cus­tomary dance of order-refund, order-refund, order-refund. My credit card is going to get locked one of these days.

The person who wrote the above knows a lot about Shopify, so if you're going to contradict them, it'd be nice to point to some evidence as to why you think they're wrong.

Groxx 3 hours ago [-]
Anyone who has used these kinds of systems will have encountered tons of cases where those test systems weren't enough, so they use real purchases too

The test systems are broadly good and worth using, but no. Everyone uses real purchases too.

Bewelge 10 hours ago [-]
To be fair, for me at least, that weird chat bot only appears on https://help.shopify.com/ while the technical documentation is on shopify.dev/.

Everytime I land on help.shopify.com I get the feeling it's one of those "Doc pages for sales people". Like it's meant to show "We have great documentation and you can do all these things" but never actually explains how to do anything.

I tried that bot a couple of months ago and it was utterly useless:

question: When using discountRedeemCodeBulkAdd there's a limit to add 100 codes to a discount. Is this a limit on the API or on the discount? So can I add 100 codes to the same discount multiple times?

answer: I wasn't able to find any results for that. Can you tell me a little bit more about what you're looking for?

Telling it more did not help. To me that seemed like the bot didn't even have access to the technical documentation. Finding it hard to believe that any search engine can miss a word like discountRedeemCodeBulkAdd if it actually is in the dataset: https://shopify.dev/docs/api/admin-graphql/latest/mutations/...

So it's a bit like asking sales people technical questions.

edit: Okay, I should have tried that before commenting. They seem to have updated it. When I ask the same question now it answers correctly (weirdly in German) :

Die Begrenzung von 100 Codes bei der Verwendung von discountRedeemCodeBulkAdd bezieht sich auf die Anzahl der Codes, die Sie in einem einzelnen API-Aufruf hinzufügen können, nicht auf die Gesamtanzahl der Codes, die einem Rabatt zugeordnet werden können. Ein Rabattcode kann bis zu 20.000.000 eindeutige Rabattcodes enthalten. Daher können Sie mehrfach jeweils 100 Codes zum selben Rabatt hinzufügen, bis Sie das Limit von 20.000.000 Codes erreicht haben. Beachten Sie, dass Drittanbieter-Apps oder benutzerdefinierte Lösungen dieses Limit nicht umgehen oder erhöhen können.

~= It's a limit on the API endpoint, you can add up to 20M to a single discount.

debugnik 10 hours ago [-]
> weirdly in German

I keep seeing bots wrongly prompted with both the browser language and the text "reply in the user's language". So I write to a bot in English and I get a Spanish answer.

delusional 10 hours ago [-]
> So it's a bit like asking sales people technical questions.

Maybe that's the best anthropomorphic analogy of LLMs. Like good sales people completely disconnected from reality, but finely tuned to give you just the answer you want.

WJW 9 hours ago [-]
Well no, the problem was that the bot didn't give them the answer they wanted. It's more like "finely tuned to waffle around pretending to be knowledgeable, but lacking technical substance".

Kind of like a bad salesperson, the best salespeople I've had the pleasure of knowing were not afraid to learn the technical background of their products.

barrell 9 hours ago [-]
The best anthropomorphic analogy for LLMs is no anthropomorphic analogy :)
eszed 2 hours ago [-]
Anthropomorphizing sales people involves the same constraints, so I'd allow it.
dworks 7 hours ago [-]
to be fair?
simonw 8 hours ago [-]
This is a great example of the kind of question I'd love to be able to ask these documentation bots but that I don't trust them to be able to get right (yet):

> What’s the syntax, in Liquid, to detect whether an order in an email notification contains items that will be fulfilled through Shopify Collective?

I suspect the best possible implementation of a documentation bot with respect to questions like this one would be an "agent" style bot that has the ability to spin up its own environment and actually test the code it's offering in the answer before confidently stating that it works.

That's really hard to do - Robin in this case could only test the result by placing and then refunding an order! - but the effort involved in providing a simulated environment for the bot to try things out in might make the difference in terms of producing more reliable results.

dworks 7 hours ago [-]
get a second agent to validate the return from the first agent. but it might get it wrong because reasons, so you need a third agent just to make sure. and then a fourth. and so on. this is obviously not a working direction.
simonw 7 hours ago [-]
That's why you give them the ability to actually execute the code in a sandbox. Then it's not AI checking AI, you're mixing something deterministic into the loop.
kmoser 6 hours ago [-]
That may certainly increase the agent's ability to get it right, but there will always be cases where the code it generates mimics the correct response, i.e. produces the output asked for, without actually working as intended, as LLMs tend to want to please as much as be correct.
gampleman 3 hours ago [-]
However I think it would remove the case of the bit outright making up non-existent stuff. It could still always be just plain wrong, but in a more human sort of way. A real support person may be wrong about some precise detail of what they’re recommending, but unlikely to just make up something plausible.
simonw 5 hours ago [-]
Not much harm done. The end user sees the response and either spots that it's broken or finds out it's broken when they try to run it.

They take a screenshot and make fun of the rubbish bot on social media.

If that happens rarely it's still a worthwhile improvement over today. If it happens frequently then the documentation bot is junk and should be retired.

dworks 5 hours ago [-]
youre hand wavibng all the other million use cases where returning false information isnt OK.
dworks 7 hours ago [-]
the return may still not reflect the sandbox reality.
ngriffiths 6 hours ago [-]
The doc bot goes in the same category as asking a human who has read the docs. In order of helpfulness you could get:

- "Oh yeah just write this," except the person is not an expert and it's either wrong or not idiomatic

- An answer that is reliably correct enough of the time

- An answer in the form "read this page" or quotes the docs

The last one is so much better because it directly solves the problem, which is fundamentally a search problem. And it places the responsibility for accuracy where it belongs (on the written docs).

bee_rider 6 hours ago [-]
I think the name, doc-bot, is just bad (actually I don’t know what Shopify even calls their thing, so maybe the confusion is on the part of the author of the post, and not some misleading thing from Shopify). A bot like that could fulfill the role of the community forum, which certainly isn’t nothing! But of course it isn’t the documentation.
bravesoul2 11 hours ago [-]
Need a real CC to test. Right there makes me lose respect for shopify if true. Even stripe let's you test :)
Bewelge 10 hours ago [-]
Not sure if I'm missing something but the way I'd always test orders is generate some 100% discount. You don't need any payment info then. I only ever needed a CC if I wanted to actually test something relating to payment. And on test stores you can mock a CC
bravesoul2 8 hours ago [-]
That's a good way too for most cases. Unless you need there to be an amount
ysofunny 1 hours ago [-]
it's lossy docs.

docs with JPEG artifacts, the more you zoom, the more specific your query, the worse the noise becomes

schaum 4 hours ago [-]
There is also https://gurubase.io/ Which is sometimes used as a kind of talk with the documentation, it claims to validate the response somehow
trjordan 6 hours ago [-]
The core argument here is: LLM docbots are wrong sometimes. Docs are not. That's not acceptable.

But that's not true! Docs are sometimes wrong, and even more so if you could errors of omission. From a users perspective, dense / poorly structured docs are wrong, because they lead users to think the docs don't have the answer. If they're confusing enough, they may even mislead users.

There's always an error rate. DocBots are almost certainly wrong more frequently, but they're also almost certainly much much faster than reading the docs. Given that the standard recommendation is to test your code before jamming it in production, that seems like a reasonable tradeoff.

YMMV!

(One level down: the feedback loop for getting docbots corrected is _far_ worse. You can complain to support that the docs are wrong, and most orgs will at least try to fix it. We, as an industry, are not fully confident in how to fix a wrong LLM response reliably in the same way.)

apnorton 2 hours ago [-]
> There's always an error rate. DocBots are almost certainly wrong more frequently, but they're also almost certainly much much faster than reading the docs.

A lot of the discourse around LLM tooling right now boils down to "it's ok to be a bit wrong if you're wrong quickly" ... and then what follows is an ever-further bounds-pushing on how big "a bit" can be.

The promise of AI is "human-level (or greater)" --- we should only be using AI when it's as accurate (or more accurate) as human-generated docs, but the tech simply isn't there yet.

mananaysiempre 6 hours ago [-]
Docs are reliably fixable, so with enough effort they will converge to correctness. Doc bots are not and will not.
domk 11 hours ago [-]
Working with Shopify is an example of something where a good mental model of how it works under the hood is often required. This type of mistake, not realising that the tag is added by an app after an order is created and won't be available when sending the confirmation email, is an easy one to make, both for a human or an LLM just reading the docs. This is where AI that just reads the available docs is going to struggle, and won't replace actual experience with the platform.
goroutines 2 hours ago [-]
sounds like a good time to plug install.md (precise step-by-step docs / guides as MCP, with simple RAG) - which I think is the right direction when paired with coding agents.
anentropic 9 hours ago [-]
I would guess these narrow docs bots probably perform worse than ChatGPT et al in 'search' mode
nickphx 2 hours ago [-]
Placing live orders on your card is a violation of Shopify, Shopify merchant, and card holder terms..
BossingAround 11 hours ago [-]
It's probably docs... If it can hallucinate an answer, it's docs with probably the most infuriating UX one can imagine.

I remember being taught that no docs is better (i.e. less frustrating to the user) than bad/incorrect docs.

pmg101 10 hours ago [-]
"Documentation - or, as I like to call it, lies."

After a certain number of years you learn that source code comments so often fall out of synch with the code itself that they're more of a liability than an asset.

walthamstow 9 hours ago [-]
At my last place the docs were in the repo with the code, and if you didn't update the docs in the same PR as the code it wouldn't get approved.

My current place? It's in Confluence, miles away from code and with no review mechanism.

taneq 9 hours ago [-]
“There’s lies, damn lies, and datasheets.”

Although, “All datasheets are wrong. Some datasheets are useful.”

deepdarkforest 10 hours ago [-]
I mean that's the dirty secret of any RAG chatbot. The concept of "grounding" is arbitrary. It doesn't matter if you use embeddings, or use a tool that uses your usual search and gets the top items, like most web search tools or google's. Is still relies on the model to not hallucinate given this info, which is very hard since too much info -> model gets confused, but too little info -> model assumes the info might not be there so useless. The fine balance depends on the user's query, and all approaches like score cutoff for embeddings etc just don't generalize.

This is the same exact problem in coding assistants when they hallucinate functions or cannot find the needed dependencies etc.

There are better and more complex approaches that use multiple agents to summarize different smaller queries and then iteratively buildup etc, internally we and a lot of companies have them, but for external customer queries, way too expensive. You can't spend 30 cents on every query

TZubiri 3 hours ago [-]
nots
PeterStuer 11 hours ago [-]
Confused. I just tried it in the Shopify Assistant and got:

There is no built-in Liquid property to directly detect Shopify Collective fulfillment in email notifications.

You can use the Admin GraphQL API to programmatically detect fulfillment source.

In Liquid, you must rely on tags, metafields, or custom properties that you set up yourself to mark Collective items.

If you want to automate this, consider tagging products or orders associated with Shopify Collective, or using an app to set a metafield, and then check for that in your Liquid templates.

What you can do in Liquid (email notifications):

If Shopify exposes a tag, property, or metafield on the order or line item that marks it as a Shopify Collective item, you could check for that in Liquid. For example, if you tag orders or products with "Collective", you could use:

  {% if order.tags contains "Collective" %}
    <!-- Show Collective-specific content -->
  {% endif %}
or for line items:

  {% for line_item in line_items %}
    {% if line_item.product.tags contains "Collective" %}
      <!-- Show something for Collective items -->
    {% endif %}
  {% endfor %}
In the author's 'wrong' vs 'seems to work' answer, the only difference is the tag on the line items vs, the order. The flow (template? as he refers to it as 'some other cryptic Shopify process' ) he uses in his tests does seem to add the 'Shopify Collective' tag to the line items, and potentially also to the order if the whole order is Shopify Collective fullfilled, but without further info we can only guess his setup.

While using AI can always lead to non-perfect results, I feel the evidence presented here does not support the conclusion.

P.S. Given the reference to 'cryptic Shopify processes', I wonder how far the author would get with 'just the docs'.

hennell 8 hours ago [-]
So because you got a good response the conclusion is invalid? How does the user know if they got a good response or a bad one? Due to the parameters passed most LLMs are functionally non-deterministic, rarely giving the same answer twice even with the same question.

I just asked chatgpt "whats the best database structure for a users table where you have users and admins?" in two different browser sessions. One gave me sql with varchars and a role column using:

    role VARCHAR(20) NOT NULL CHECK (role IN ('user', 'admin')),
the other session used text columns and defined an enum to use first:

    CREATE TYPE user_role AS ENUM ('user', 'admin', 'superadmin');
    //other sql snipped
    role user_role NOT NULL DEFAULT 'user',
An Ai Assistant should be better tuned but often isn't. That variance to me makes it feel wildly unhelpful for 'documentation' as two people end up with quite different solutions.
PeterStuer 8 hours ago [-]
So by extrapolation all of the IT books of the past were "wildly unhelpful" as no two of them presented the exact same solution to a problem, even all those pretending to be 'best practice'?

Your question is vague (technical reference, not meant derogatory). In which DBMS? By what metric of 'best'? For which size of database? Does it need to support internationalization? Will the roles be updated or extended in the future etc.

You could argue an AI Assistant would need to ask you this clarification if the question is vague rather than make a guess. But in extremis this is in practice not workable. If every minute factor needs to be answered by the user before getting a result, only the very experts would get to the stage of getting an answer if ever.

This is not just an AI problem, but a problem (human) business and technical analysts face every day in their work. When do you switch to proposing a solution rather than asking further details? It is BTW also why all those BPM or RPA platforms that promise to eliminate 'programming' and let the business analyst 'draw' a solution often fail miserably. They either have too narrow defaults or keep needing to be fed detail long past the BA's comfort zone.

redhale 10 hours ago [-]
I think you're making the author's point, though. If two users ask the bot the same question and get different answers, is the bot valuable? A dice roll that might be (or is even _probably_) correct is not what I want when going directly to the official docs.
PeterStuer 8 hours ago [-]
Not sure the author is giving the full account though, as his answer snippet was probably just a part of the same answer I got, framed and interpreted differently (the AI's are never this terse as to just whip out a few lines of code).

Besides, it is not even incorrect in the way he states it is. It is fully dependent on how he added the tags in his flow, as the complete answer correctly stated. He speculates on some timing issue in some 'cryptic Shopify process' adding the tag at a later stage, but this is clearly wrong as his "working answer" (which is also in the Assistant reply) does rely on the tag having been added at the same point in the process.

My pure and exaggerated on purpose speculation: He just blindly copied some flow template, then from the (same as I got?) Assistant's answer copy/pasted the first Liquid code box, tested on one order and found it not doing what he wanted, this suited his confirmation bias regarding AI, later tried pasting the second Liquid code box (or the same answer you will get from Gemini through Google Search) and found 'it worked' on his one test order, still blamed the Assistant for being 'wrong'.

dworks 7 hours ago [-]
its non-deterministic. it gives different answers each time you ask, potentially, and small differences in your prompt yields different completions. it doesnt actually understand your prompt, you know.