LLM Evaluation System and Culinary App
A two-component project combining an LLM Evaluation Framework (Python, 19 files, 50 test cases) with a mobile app ZjemTo (React Native, 29 screens).
Table of Contents
- v1.01 — Bielik Quantization for iPhone
- Project Description
- System Architecture
- LLM Evaluation Framework
- ZjemTo App
- Anti-Hallucination System
- AI Models
- API Endpoints
- Evaluation Research
- Main Features
v1.01 — Bielik Quantization for iPhone
Status: in research and planning phase
Problem: Cloud hosting at scale kills an indie project with costs, and self-hosting means queues with more users. Decision: the model must run locally on a device with ≥ 8 GB RAM (iOS).
Quantization: IQ2_XXS vs IQ3_XXS
| Variant | Size | Advantages | Risks |
|---|---|---|---|
IQ2_XXS | ~3.1 GB | More memory headroom on 8 GB iPhone | At 2 bits the model may lose reasoning capability |
IQ3_XXS | ~3.8 GB | Significantly better quality (big jump between 2 and 3 bits) | Tight — KV cache + iOS overhead may cause OOM |
Custom IMatrix for Polish Cuisine
Calibration on texts: Compendium Ferculorum, regional recipes, synthetic data. IMatrix ensures that weights critical to Polish culinary terms are quantized with higher precision.
Hybrid Architecture
| Aspect | Online (hybrid) | Offline (full local) |
|---|---|---|
| Cost | Lower than hosting full Bielik | Zero |
| Quality | Higher (cloud preprocessing) | Depends on quantization |
| Availability | Requires internet | Always |
Project Description
A two-component project combining:
- LLM Evaluation Framework — A system for evaluating language models on culinary recipe extraction and generation
- ZjemTo — A React Native mobile app for discovering Polish and international recipes powered by AI
The project contains ~150 source files, 50 test cases, and advanced mechanisms for detecting and preventing AI hallucinations.
System Architecture
├── llm-eval-framework/ # Python evaluation system │ ├── config/config.yaml # Main configuration │ ├── src/ # Evaluation modules (19 Python files) │ │ ├── models/ # AI model runners │ │ ├── judge/ # Hallucination detection │ │ ├── eval/ # Evaluation pipeline │ │ ├── report/ # Report generation │ │ └── utils/ # Cache, logging │ ├── data/eval_cases/ # 50 test cases │ ├── results/ # Evaluation results │ └── tests/ # Unit tests │ ├── ZjemTo/ # React Native app │ ├── src/ │ │ ├── screens/ # 29 app screens │ │ ├── services/ # AI integration layer │ │ ├── components/ # 50+ UI components │ │ ├── data/ # Recipes, historical data │ │ └── config/ # API configuration │ ├── server/ # Express backend (AI proxy) │ └── package.json │ ├── semantic_cache.py # Redis cache tool └── README.md
LLM Evaluation Framework
Model Runners
| File | Description |
|---|---|
base.py | Abstract ModelRunner class — extract/generate interface |
openai_runner.py | GPT-4o, GPT-4, GPT-4o-mini support |
claude_runner.py | Claude Sonnet and Haiku support |
bielik_runner.py | Local Bielik model via Ollama with validation |
Evaluation Pipeline
| File | Description |
|---|---|
eval_runner.py | Main orchestrator (258 lines) |
extraction.py | OCR→JSON extraction with validation |
generation.py | JSON→recipe generation in Polish |
Hallucination Detection
| File | Description |
|---|---|
judge_runner.py | GPT-4o evaluation (225 lines) |
prompts.py | Three-phase judge prompts |
Reports
| File | Description |
|---|---|
report_generator.py | Markdown + JSON reports with comparison tables |
visualizer.py | Matplotlib charts: accuracy heatmaps, error distributions |
Test Data
| Dataset | Count | Description |
|---|---|---|
| Modern | 20 | Contemporary recipes |
| Traditional | 18 | Classic Polish dishes |
| Historical | 12 | Archaic language, adversarial |
ZjemTo App
A React Native (Expo) mobile app with 29 screens, 50+ UI components, and integration with multiple AI models.
Architecture
Phone (Expo Go) → Express Server (:3001) → Ollama (:11434) for Bielik
→ OpenAI API for World modeScreens
| Screen | Description |
|---|---|
HomeScreenPremium | Main hub with Bielik/World mode toggle |
TraditionalRecipesScreen | Polish recipes from ingredients |
RecreateTasteScreen | Dish description → AI identifies it |
RegionsScreen | 16 Polish voivodeships |
HistoricalRecipesScreen | 6 historical eras |
CookingModeScreen | Step-by-step with timers |
CameraScreen | Fridge photo → GPT-4o Vision |
PlateScoreScreen | Nutritional value analysis |
ReceiptScannerScreen | Receipt OCR scanning |
SwipeDiscoveryScreen | Tinder-style recipe discovery |
AI Services
| Service | Responsibility |
|---|---|
RecipeAIService | Recipe generation with validation and guardrails |
BielikService | Communication with local Bielik via Express proxy |
VisionService | GPT-4o Vision image analysis (ingredients, plate, steps) |
CacheService | Semantic cache with Redis for repeated queries |
Express Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/bielik/generate | POST | Recipe generation via Bielik |
/api/bielik/extract | POST | Ingredient extraction from text |
/api/vision/analyze | POST | GPT-4o Vision image analysis |
/api/recipe/validate | POST | Recipe validation (Polish Guard) |
/api/cache/lookup | GET | Semantic cache lookup |
/api/health | GET | Health check + model status |
Anti-Hallucination System
A 6-layer system for preventing AI hallucinations, designed specifically for Polish cuisine.
Layer 1: Judge System
The main evaluation system based on GPT-4o. It analyzes generated recipes for 8 types of hallucinations:
| Hallucination Type | Description | Example |
|---|---|---|
ingredient | Adding an ingredient not in the list | Pierogi recipe with avocado |
quantity | Unrealistic quantities | 500g of salt for one serving |
step | Fabricated cooking step | "Ferment for 3 weeks" |
time | Incorrect cooking time | "Cook pasta for 2 hours" |
temperature | Unrealistic temperature | "Bake at 50°C for 5 minutes" |
tool | Non-existent tool | "Use a molecular dehydrator" |
context | Wrong cultural context | "Traditional Polish ramen" |
historical_anachronism | Historical anachronism | "Tomatoes in medieval Poland" |
Severity classification:
| Severity | Penalty | Description |
|---|---|---|
HIGH | -20 pts | Critical — recipe is dangerous or absurd |
MEDIUM | -10 pts | Significant — error affecting quality |
LOW | -3 pts | Minor — cosmetic inaccuracy |
Scoring system:
score = 100 # Starting point
for hallucination in detected:
if hallucination.severity == "HIGH": score -= 20
if hallucination.severity == "MEDIUM": score -= 10
if hallucination.severity == "LOW": score -= 3
# Verdict:
# FAIL — any HIGH or score < 50
# WARN — ≥3 MEDIUM or score < 70
# PASS — score ≥ 70 and no HIGHLayer 2: Bielik Validation
Dedicated validation for the local Bielik model (Polish LLM):
- Polish stem matching (
cebula→cebule→cebuli) - Removal of untracked ingredients (not present in input)
- Unit normalization (szklanka→ml, łyżka→g)
- JSON response structure validation
- 97 lines of validation code
Layer 3: Polish Guard
A system protecting the authenticity of Polish cuisine:
- Database of ~48 Polish dishes with canonical ingredients
- List of ~70 forbidden foreign dishes (sushi, pizza, ramen, tacos...)
- "Grandma's test": Could a grandma from the 1980s cook this?
- Auto-correction: foreign dish → Kotlet schabowy (breaded pork cutlet)
Ingredient mapping:
| Forbidden | Substitute |
|---|---|
| Tofu | Twaróg (quark cheese) |
| Wasabi | Chrzan (horseradish) |
| Soy sauce | Maggi |
| Quinoa | Kasza gryczana (buckwheat groats) |
Layer 4: FORBIDDEN Sections in Prompts
Direct instructions in system prompts prohibiting hallucinations:
Extraction:
- DO NOT add ingredients "that should be there"
- DO NOT guess quantities if not specified
- DO NOT correct "errors" in the original text
Generation:
- DO NOT fabricate cooking times
- DO NOT add serving steps
- DO NOT suggest substitutes
Historical:
- DO NOT add ingredients unavailable in the given era
- DO NOT modernize cooking techniques
- DO NOT fabricate historical context
Layer 5: Allergen Validation
13 allergen categories monitored in every recipe:
gluten, milk, eggs, fish, crustaceans, peanuts, tree nuts, soy, celery, mustard, sesame, lupin, mollusks
The system automatically suggests AI-powered ingredient substitutes for each allergen.
Layer 6: Local Fallback Database
200+ ready-made Polish recipes stored locally on the device. When AI fails (timeout, error, Judge score too low), the system automatically falls back to local recipes. The user always gets a result — zero empty screens.
AI Models
| Model | Role | Context |
|---|---|---|
| GPT-4o | Extraction, generation, judge, vision | Primary cloud model |
| GPT-4o-mini | Preprocessing, fast extraction | Cheap and fast |
| Claude Sonnet | Extraction and generation | Anthropic alternative |
| Claude Haiku | Fast extraction | Lightweight alternative |
| Bielik 11B | Polish model via Ollama | Local inference |
Evaluation Research
The framework was designed for systematic comparison of models on Polish culinary data. Each test case contains:
- Original recipe text (various formats: OCR, handwritten, print)
- Expected extraction result (ground truth JSON)
- Expected generation result (reference recipe)
- Metadata: era, region, difficulty level
Metrics: extraction_accuracy, generation_fidelity, hallucination_rate, judge_score, latency_ms
Main Features
Ingredient Scanning
Fridge photo → GPT-4o Vision recognizes ingredients → recipe generation from available products. Full pipeline: CameraScreen → VisionService → RecipeAIService.
HeritageWizard
A 5-step OCR wizard for handwritten family recipes. The user photographs a handwritten grandma's recipe, the system recognizes the text, extracts ingredients and steps, and generates a structured recipe.
Recreate the Taste
The user describes a culinary memory ("that soup at grandma's with those little dumplings") → AI identifies the dish (zupa z lane kluskami — soup with poured dumplings) → full recipe with regional context.
Cooking Mode
Interactive step-by-step mode with TTS (voice instruction reading), timers, Chef's Eye (step photo verification — does it look correct?) and "Ask AI" per step (explanations, substitutes).
DishSwiper
Tinder-style dish discovery — swipe right = like, swipe left = next. The system learns preferences and suggests increasingly relevant dishes.
Insta-Plate AI
Food aesthetics rating 1–10 with composition, color, and presentation analysis. Improvement suggestions ("add green herbs for contrast").
Regions
16 Polish voivodeships with regional specialties. Each region has canonical dishes, local ingredients, and culinary traditions.
Historical Recipes
6 historical periods (Middle Ages → modern day) with ingredient restrictions for each era. The system refuses to add tomatoes to a medieval recipe (because they were unknown in Europe).
Gamification System
XP for cooking, 5 levels (Beginner Cook → Master Chef), eco badges (seasonal ingredients, zero waste), daily culinary challenges.