Nutrients, Vol. 18, Pages 23: Large Language Models for Real-World Nutrition Assessment: Structured Prompts, Multi-Model Validation and Expert Oversight
Nutrients doi: 10.3390/nu18010023
Authors:
Aia Ase
Jacek Borowicz
Kamil Rakocy
Barbara Piekarska
Background: Traditional dietary assessment methods face limitations including reporting bias and scalability challenges. Large language models (LLMs) offer potential for automated food classification, yet their validation in morphologically complex, non-English languages like Polish remains limited. Methods: We analyzed 1992 food items from a Polish long-term care facility (LTCF) cohort using three advanced LLMs (Claude Opus 4.5, Gemini 3 pro, and GPT-5.1-chat-latest) with two prompting strategies: a structured double-step prompt integrating NOVA and World Health Organization (WHO) criteria, and a simplified single-step prompt. Classifications were compared against consensus judgments from two human experts. Results: All LLMs showed high agreement with human experts (90.3–94.2%), but there were statistically significant differences in all pairwise comparisons (χ2 = 1174.5–1897.1; p < 0.001). The structured prompt produced very high Recall for UNHEALTHY items at the cost of lower Specificity, whereas the simplified prompt achieved higher overall Accuracy and a more balanced Recall–Specificity profile, indicating a trade-off between strict guideline adherence and alignment with general human judgment. Conclusions: Advanced LLMs demonstrate near-expert accuracy in Polish-language dietary classification, enhancing workflow efficiency by shifting effort toward validation. Expert oversight remains essential, and multi-model consensus alongside language-specific validation can improve AI reliability in nutrition assessment.
Background: Traditional dietary assessment methods face limitations including reporting bias and scalability challenges. Large language models (LLMs) offer potential for automated food classification, yet their validation in morphologically complex, non-English languages like Polish remains limited. Methods: We analyzed 1992 food items from a Polish long-term care facility (LTCF) cohort using three advanced LLMs (Claude Opus 4.5, Gemini 3 pro, and GPT-5.1-chat-latest) with two prompting strategies: a structured double-step prompt integrating NOVA and World Health Organization (WHO) criteria, and a simplified single-step prompt. Classifications were compared against consensus judgments from two human experts. Results: All LLMs showed high agreement with human experts (90.3–94.2%), but there were statistically significant differences in all pairwise comparisons (χ2 = 1174.5–1897.1; p < 0.001). The structured prompt produced very high Recall for UNHEALTHY items at the cost of lower Specificity, whereas the simplified prompt achieved higher overall Accuracy and a more balanced Recall–Specificity profile, indicating a trade-off between strict guideline adherence and alignment with general human judgment. Conclusions: Advanced LLMs demonstrate near-expert accuracy in Polish-language dietary classification, enhancing workflow efficiency by shifting effort toward validation. Expert oversight remains essential, and multi-model consensus alongside language-specific validation can improve AI reliability in nutrition assessment. Read More
