Structured product data, extracted by LLMs.
A Fortune 500 UK FMCG client ingested high volumes of product data from many third-party providers, but critical features lived only inside unstructured text. Phinest built an LLM-powered pipeline that aggregates sources, extracts attributes from text and images, and validates them against controlled vocabularies.
The pipeline fills in product features that were previously missing from structured fields, across millions of products.
structured product fields populated from previously unstructured text
items run through the extraction and validation pipeline
attributes extracted per product from descriptions and images
From unstructured text to analytics-ready data.
Product data arrived from multiple third-party providers in inconsistent formats, with key features buried in descriptive free text. Manual standardization was slow and costly, and the inconsistency blocked pricing analysis, inventory decisions, and internal ML models. The pipeline turns that raw input into structured, validated attributes.
What Phinest shipped
- A unified ingestion layer that aggregates product data from catalogs, external APIs, and product images
- LLM-powered extraction that parses descriptions, analyzes images with a vision model, and normalizes units
- Validation against controlled vocabularies that auto-corrects errors and detects bundled products
- A human-in-the-loop annotation tool for edge cases that feeds continuous improvement
Why it mattered
- Manual standardization, once prohibitively time-consuming and expensive, is now automated
- Consistent data formats unblock pricing analyses, inventory decisions, and internal ML models
- Clean, complete attributes flow into analytics, ML models, decision tools, and downstream systems
A multi-stage LLM extraction pipeline.
Each stage has a focused job: aggregate the sources, extract attributes with LLMs and vision, validate against the rules, and route the hard cases to people.
Data ingestion
Product data from catalogs, external APIs, and product images is aggregated into a single unified pipeline
LLM extraction
A vision model and text parsing pull attributes from descriptions and images, then normalize units into structured fields
Validation
Quality assurance validates against allowed values, auto-corrects errors, and detects bundles before data is released
Human-in-the-loop
An annotation tool routes edge cases to reviewers, and their decisions feed back into continuous model improvement
Have product data trapped in free text?
Phinest can build an extraction pipeline that turns unstructured descriptions and images into clean, validated fields your analytics and ML systems can actually use.

