← Back to Projects
Featured

LLM Evaluation System and Culinary App

A two-component project combining an LLM Evaluation Framework (Python, 19 files, 50 test cases) with a mobile app ZjemTo (React Native, 29 screens).

PythonReact NativeGPT-4oClaudeBielikOllama

Table of Contents


v1.01 — Bielik Quantization for iPhone

Status: in research and planning phase

Problem: Cloud hosting at scale kills an indie project with costs, and self-hosting means queues with more users. Decision: the model must run locally on a device with 8 GB RAM (iOS).

Quantization: IQ2_XXS vs IQ3_XXS

VariantSizeAdvantagesRisks
IQ2_XXS~3.1 GBMore memory headroom on 8 GB iPhoneAt 2 bits the model may lose reasoning capability
IQ3_XXS~3.8 GBSignificantly better quality (big jump between 2 and 3 bits)Tight — KV cache + iOS overhead may cause OOM

Custom IMatrix for Polish Cuisine

Calibration on texts: Compendium Ferculorum, regional recipes, synthetic data. IMatrix ensures that weights critical to Polish culinary terms are quantized with higher precision.

Hybrid Architecture

AspectOnline (hybrid)Offline (full local)
CostLower than hosting full BielikZero
QualityHigher (cloud preprocessing)Depends on quantization
AvailabilityRequires internetAlways

Project Description

A two-component project combining:

  1. LLM Evaluation Framework — A system for evaluating language models on culinary recipe extraction and generation
  2. ZjemTo — A React Native mobile app for discovering Polish and international recipes powered by AI

The project contains ~150 source files, 50 test cases, and advanced mechanisms for detecting and preventing AI hallucinations.


System Architecture

├── llm-eval-framework/          # Python evaluation system
│   ├── config/config.yaml       # Main configuration
│   ├── src/                     # Evaluation modules (19 Python files)
│   │   ├── models/              # AI model runners
│   │   ├── judge/               # Hallucination detection
│   │   ├── eval/                # Evaluation pipeline
│   │   ├── report/              # Report generation
│   │   └── utils/               # Cache, logging
│   ├── data/eval_cases/         # 50 test cases
│   ├── results/                 # Evaluation results
│   └── tests/                   # Unit tests
│
├── ZjemTo/                      # React Native app
│   ├── src/
│   │   ├── screens/             # 29 app screens
│   │   ├── services/            # AI integration layer
│   │   ├── components/          # 50+ UI components
│   │   ├── data/                # Recipes, historical data
│   │   └── config/              # API configuration
│   ├── server/                  # Express backend (AI proxy)
│   └── package.json
│
├── semantic_cache.py            # Redis cache tool
└── README.md

LLM Evaluation Framework

Model Runners

FileDescription
base.pyAbstract ModelRunner class — extract/generate interface
openai_runner.pyGPT-4o, GPT-4, GPT-4o-mini support
claude_runner.pyClaude Sonnet and Haiku support
bielik_runner.pyLocal Bielik model via Ollama with validation

Evaluation Pipeline

FileDescription
eval_runner.pyMain orchestrator (258 lines)
extraction.pyOCRJSON extraction with validation
generation.pyJSONrecipe generation in Polish

Hallucination Detection

FileDescription
judge_runner.pyGPT-4o evaluation (225 lines)
prompts.pyThree-phase judge prompts

Reports

FileDescription
report_generator.pyMarkdown + JSON reports with comparison tables
visualizer.pyMatplotlib charts: accuracy heatmaps, error distributions

Test Data

DatasetCountDescription
Modern20Contemporary recipes
Traditional18Classic Polish dishes
Historical12Archaic language, adversarial

ZjemTo App

A React Native (Expo) mobile app with 29 screens, 50+ UI components, and integration with multiple AI models.

Architecture

Phone (Expo Go) → Express Server (:3001) → Ollama (:11434) for Bielik
                                              → OpenAI API for World mode

Screens

ScreenDescription
HomeScreenPremiumMain hub with Bielik/World mode toggle
TraditionalRecipesScreenPolish recipes from ingredients
RecreateTasteScreenDish description → AI identifies it
RegionsScreen16 Polish voivodeships
HistoricalRecipesScreen6 historical eras
CookingModeScreenStep-by-step with timers
CameraScreenFridge photo → GPT-4o Vision
PlateScoreScreenNutritional value analysis
ReceiptScannerScreenReceipt OCR scanning
SwipeDiscoveryScreenTinder-style recipe discovery

AI Services

ServiceResponsibility
RecipeAIServiceRecipe generation with validation and guardrails
BielikServiceCommunication with local Bielik via Express proxy
VisionServiceGPT-4o Vision image analysis (ingredients, plate, steps)
CacheServiceSemantic cache with Redis for repeated queries

Express Endpoints

EndpointMethodDescription
/api/bielik/generatePOSTRecipe generation via Bielik
/api/bielik/extractPOSTIngredient extraction from text
/api/vision/analyzePOSTGPT-4o Vision image analysis
/api/recipe/validatePOSTRecipe validation (Polish Guard)
/api/cache/lookupGETSemantic cache lookup
/api/healthGETHealth check + model status

Anti-Hallucination System

A 6-layer system for preventing AI hallucinations, designed specifically for Polish cuisine.

Layer 1: Judge System

The main evaluation system based on GPT-4o. It analyzes generated recipes for 8 types of hallucinations:

Hallucination TypeDescriptionExample
ingredientAdding an ingredient not in the listPierogi recipe with avocado
quantityUnrealistic quantities500g of salt for one serving
stepFabricated cooking step"Ferment for 3 weeks"
timeIncorrect cooking time"Cook pasta for 2 hours"
temperatureUnrealistic temperature"Bake at 50°C for 5 minutes"
toolNon-existent tool"Use a molecular dehydrator"
contextWrong cultural context"Traditional Polish ramen"
historical_anachronismHistorical anachronism"Tomatoes in medieval Poland"

Severity classification:

SeverityPenaltyDescription
HIGH-20 ptsCritical — recipe is dangerous or absurd
MEDIUM-10 ptsSignificant — error affecting quality
LOW-3 ptsMinor — cosmetic inaccuracy

Scoring system:

score = 100  # Starting point

for hallucination in detected:
    if hallucination.severity == "HIGH":   score -= 20
    if hallucination.severity == "MEDIUM": score -= 10
    if hallucination.severity == "LOW":    score -= 3

# Verdict:
# FAIL  — any HIGH or score < 50
# WARN  — ≥3 MEDIUM or score < 70
# PASS  — score ≥ 70 and no HIGH

Layer 2: Bielik Validation

Dedicated validation for the local Bielik model (Polish LLM):

  • Polish stem matching (cebulacebulecebuli)
  • Removal of untracked ingredients (not present in input)
  • Unit normalization (szklanka→ml, łyżka→g)
  • JSON response structure validation
  • 97 lines of validation code

Layer 3: Polish Guard

A system protecting the authenticity of Polish cuisine:

  • Database of ~48 Polish dishes with canonical ingredients
  • List of ~70 forbidden foreign dishes (sushi, pizza, ramen, tacos...)
  • "Grandma's test": Could a grandma from the 1980s cook this?
  • Auto-correction: foreign dish Kotlet schabowy (breaded pork cutlet)

Ingredient mapping:

ForbiddenSubstitute
TofuTwaróg (quark cheese)
WasabiChrzan (horseradish)
Soy sauceMaggi
QuinoaKasza gryczana (buckwheat groats)

Layer 4: FORBIDDEN Sections in Prompts

Direct instructions in system prompts prohibiting hallucinations:

Extraction:

  • DO NOT add ingredients "that should be there"
  • DO NOT guess quantities if not specified
  • DO NOT correct "errors" in the original text

Generation:

  • DO NOT fabricate cooking times
  • DO NOT add serving steps
  • DO NOT suggest substitutes

Historical:

  • DO NOT add ingredients unavailable in the given era
  • DO NOT modernize cooking techniques
  • DO NOT fabricate historical context

Layer 5: Allergen Validation

13 allergen categories monitored in every recipe:

gluten, milk, eggs, fish, crustaceans, peanuts,
tree nuts, soy, celery, mustard, sesame, lupin, mollusks

The system automatically suggests AI-powered ingredient substitutes for each allergen.

Layer 6: Local Fallback Database

200+ ready-made Polish recipes stored locally on the device. When AI fails (timeout, error, Judge score too low), the system automatically falls back to local recipes. The user always gets a result — zero empty screens.


AI Models

ModelRoleContext
GPT-4oExtraction, generation, judge, visionPrimary cloud model
GPT-4o-miniPreprocessing, fast extractionCheap and fast
Claude SonnetExtraction and generationAnthropic alternative
Claude HaikuFast extractionLightweight alternative
Bielik 11BPolish model via OllamaLocal inference

Evaluation Research

The framework was designed for systematic comparison of models on Polish culinary data. Each test case contains:

  • Original recipe text (various formats: OCR, handwritten, print)
  • Expected extraction result (ground truth JSON)
  • Expected generation result (reference recipe)
  • Metadata: era, region, difficulty level

Metrics: extraction_accuracy, generation_fidelity, hallucination_rate, judge_score, latency_ms


Main Features

Ingredient Scanning

Fridge photo GPT-4o Vision recognizes ingredients recipe generation from available products. Full pipeline: CameraScreen VisionService RecipeAIService.

HeritageWizard

A 5-step OCR wizard for handwritten family recipes. The user photographs a handwritten grandma's recipe, the system recognizes the text, extracts ingredients and steps, and generates a structured recipe.

Recreate the Taste

The user describes a culinary memory ("that soup at grandma's with those little dumplings") AI identifies the dish (zupa z lane kluskami — soup with poured dumplings) full recipe with regional context.

Cooking Mode

Interactive step-by-step mode with TTS (voice instruction reading), timers, Chef's Eye (step photo verification — does it look correct?) and "Ask AI" per step (explanations, substitutes).

DishSwiper

Tinder-style dish discovery — swipe right = like, swipe left = next. The system learns preferences and suggests increasingly relevant dishes.

Insta-Plate AI

Food aesthetics rating 1–10 with composition, color, and presentation analysis. Improvement suggestions ("add green herbs for contrast").

Regions

16 Polish voivodeships with regional specialties. Each region has canonical dishes, local ingredients, and culinary traditions.

Historical Recipes

6 historical periods (Middle Ages modern day) with ingredient restrictions for each era. The system refuses to add tomatoes to a medieval recipe (because they were unknown in Europe).

Gamification System

XP for cooking, 5 levels (Beginner Cook Master Chef), eco badges (seasonal ingredients, zero waste), daily culinary challenges.


← Back to ProjectsContact →
© 2026 Jakub Prejzner
System v2.0 | 2026