Methodology

The full protocol behind the 2026 benchmark.

Test set construction

15,000 meal photos were captured between November 2025 and February 2026. Each meal was prepared by a contracted chef, portioned to a target weight, and then weighed to ±0.1g on a calibrated lab scale. Photos were captured on an iPhone 15 Pro at a fixed 45° angle under 5500K lighting at 1100 lux.

Cuisine distribution

Equal 1,500-photo allocations across: Italian, Japanese, Mexican, Indian, Levantine, Thai, West African, US, French, Chinese. Within each cuisine, a third is allocated to each difficulty tier (standard / moderate / challenging).

Submission protocol

Every app was tested from a freshly-created account with no prior history. Each photo was submitted three times across a five-minute interval. The median of the three results was used for scoring to filter out non-deterministic decoding artefacts.

Scoring

Recognition (30%) — Top-1 dish identification rate.
Portion (25%) — Mean absolute percentage error vs. weighed ground truth.
Speed (20%) — Median shutter-to-result latency, P25/P75 reported.
Coverage (15%) — Breadth and depth of the food taxonomy.
Learning (10%) — Improvement in accuracy after 14 days of personal logs.

What we deliberately do not measure

We do not score UI polish, brand recognition, or social features. These matter to some users, but they do not affect whether a logged meal is accurate.