Case Study
A 27x catalog scale-up
The classifier didn't just save time. It protected the catalog architecture from being quietly degraded by everyone who touched it downstream.
Role
TBD
Timeline
TBD
Team
PMs + 2 engineers
The problem
Mula was expanding its merchandise catalog from 300 products to 8,000, on a tight timeline. My job was to design the taxonomy that would govern how all of them were navigated, across three catalogs and three supplier structures that had never been unified. Eight thousand products can't be crammed into a taxonomy built for three hundred. This needed a new architecture, designed from the ground up to host a fundamentally different scale.
Bad taxonomy doesn't just frustrate users. It loses them quietly, before they ever reach out.
The architecture decision
Cross-referencing the three supplier trees made the answer clear: one had significantly stronger fundamentals. Functional rather than material-based, consistently scoped, clean two-level hierarchy. Under the time constraint, building from scratch would have been thorough but slow. The smarter move was to use the strongest existing structure as a reference spine, then adapt and refine it into something that could genuinely host the expanded catalog. An architectural decision with years of downstream consequences, made in days.
Why mapping wasn't enough
With the spine defined, category mapping was supposed to handle most of the volume. The problem: supplier category names were meaningless as mapping rules. A category called Küchenzubehör, Kitchen Accessories, sounded clean on paper. The only way to understand its actual scope was to open it and check the products inside. What looked like one decision was dozens, each article assessed individually.
When the developer executed the first batch of mappings, I could already spot errors in what had been built. At scale that pattern would repeat invisibly, across thousands of products, made by people without the taxonomy knowledge I'd built. The consequences were concrete:
This wasn't a mapping problem. It was a system problem.
The build
I built a classifier in n8n to test whether categorization could be automated. Four problems surfaced on the way, each one producing a distinct learning.
01
Large batch, first run
No sample cap on the first run. Cost surfaced immediately and was hard to defend. The approach wasn't proven yet, and scaling before stability is a mistake that compounds.
Prove the approach on a sample. Scale is cheap once it works, expensive when it doesn't.
02
Images were the wrong input
I'd been feeding product name, description, and images, which forced a vision-capable model fifteen times more expensive. Replacing images with HTS codes (internationally standardised product classification codes that carry precise functional information) cut cost fifteen times. Performance improved as a side effect. Better input data, not a bigger model.
Match the input to the task. The right data point outperforms a bigger model.
03
Prompt instability
I iterated the prompt extensively: explicit instructions, guardrails, function prioritisation. Every fix shifted the errors somewhere else. The model wasn't failing on instructions. It was failing under architectural load.
When fixing one thing consistently breaks another, the architecture is wrong. More instructions won't break through that ceiling.
04
Separating understanding from labeling
Chain of Thought solved it structurally: two nodes instead of one. The first asks the model to describe the product's primary function in one sentence. The second feeds that reasoning into the classification prompt. Both failure modes disappeared at once. Uncategorized products gone, miscategorized products gone.
Separate reasoning from labeling, structurally, not just in the instructions. Understanding before labeling isn't a prompt trick. It's the right architecture for the task.
The outcome
| Before | After |
|---|---|
| ~1.7 months of manual catalog work. | 3 days to build the classifier. |
| Already producing errors on the easiest part. | More accurate than manual already was. |
| Cost grew with every new supplier onboarding. | Cost near-zero per new onboarding. |
The taxonomy reached production intact, not a degraded approximation shaped by downstream judgment calls made with less context than the one before. The next supplier onboarding doesn't start over. The infrastructure exists.
Prompt instability is a diagnosis of a flawed architecture
Prompt instability is a diagnostic signal. When every fix shifts the errors rather than eliminating them, the architecture is the issue, not the instructions. Recognising that earlier, before the accumulated iteration cost, is now the first thing I check. And the input data matters as much as the model: the HTS code outperformed images not because it was cheaper to process, but because it was a more precise signal for the task.
What changes
Execution used to be where good design went to get compromised. A catalog architecture designed with care, handed off to a manual process, degraded by a thousand small decisions made without the full picture.