Fortune 500 UK FMCG / LLM data extraction

Structured product data, extracted by LLMs.

A Fortune 500 UK FMCG client ingested high volumes of product data from many third-party providers, but critical features lived only inside unstructured text. Phinest built an LLM-powered pipeline that aggregates sources, extracts attributes from text and images, and validates them against controlled vocabularies.

Unstructured text to clean fields

99%

data completeness

The pipeline fills in product features that were previously missing from structured fields, across millions of products.

99%data completeness

structured product fields populated from previously unstructured text

3M+products processed

items run through the extraction and validation pipeline

9key features

attributes extracted per product from descriptions and images

The work

From unstructured text to analytics-ready data.

Product data arrived from multiple third-party providers in inconsistent formats, with key features buried in descriptive free text. Manual standardization was slow and costly, and the inconsistency blocked pricing analysis, inventory decisions, and internal ML models. The pipeline turns that raw input into structured, validated attributes.

What Phinest shipped

A unified ingestion layer that aggregates product data from catalogs, external APIs, and product images
LLM-powered extraction that parses descriptions, analyzes images with a vision model, and normalizes units
Validation against controlled vocabularies that auto-corrects errors and detects bundled products
A human-in-the-loop annotation tool for edge cases that feeds continuous improvement

Why it mattered

Manual standardization, once prohibitively time-consuming and expensive, is now automated
Consistent data formats unblock pricing analyses, inventory decisions, and internal ML models
Clean, complete attributes flow into analytics, ML models, decision tools, and downstream systems

Solution architecture

A multi-stage LLM extraction pipeline.

Each stage has a focused job: aggregate the sources, extract attributes with LLMs and vision, validate against the rules, and route the hard cases to people.

Data ingestion

Product data from catalogs, external APIs, and product images is aggregated into a single unified pipeline

LLM extraction

A vision model and text parsing pull attributes from descriptions and images, then normalize units into structured fields

Validation

Quality assurance validates against allowed values, auto-corrects errors, and detects bundles before data is released

Human-in-the-loop

An annotation tool routes edge cases to reviewers, and their decisions feed back into continuous model improvement

Have product data trapped in free text?

Phinest can build an extraction pipeline that turns unstructured descriptions and images into clean, validated fields your analytics and ML systems can actually use.