Featured

LLM Evaluation System and Culinary App

A two-component project combining an LLM Evaluation Framework (Python, 19 files, 50 test cases) with a mobile app ZjemTo (React Native, 29 screens).

PythonReact NativeGPT-4oClaudeBielikOllama

v1.01 — Bielik Quantization for iPhone
Project Description
System Architecture
LLM Evaluation Framework
ZjemTo App
Anti-Hallucination System
AI Models
API Endpoints
Evaluation Research
Main Features

v1.01 — Bielik Quantization for iPhone

Status: in research and planning phase

Problem: Cloud hosting at scale kills an indie project with costs, and self-hosting means queues with more users. Decision: the model must run locally on a device with ≥ 8 GB RAM (iOS).

Quantization: IQ2_XXS vs IQ3_XXS

Variant	Size	Advantages	Risks
`IQ2_XXS`	~3.1 GB	More memory headroom on 8 GB iPhone	At 2 bits the model may lose reasoning capability
`IQ3_XXS`	~3.8 GB	Significantly better quality (big jump between 2 and 3 bits)	Tight — KV cache + iOS overhead may cause OOM

Custom IMatrix for Polish Cuisine

Calibration on texts: Compendium Ferculorum, regional recipes, synthetic data. IMatrix ensures that weights critical to Polish culinary terms are quantized with higher precision.

Hybrid Architecture

Aspect	Online (hybrid)	Offline (full local)
Cost	Lower than hosting full Bielik	Zero
Quality	Higher (cloud preprocessing)	Depends on quantization
Availability	Requires internet	Always

Project Description

A two-component project combining:

LLM Evaluation Framework — A system for evaluating language models on culinary recipe extraction and generation
ZjemTo — A React Native mobile app for discovering Polish and international recipes powered by AI

The project contains ~150 source files, 50 test cases, and advanced mechanisms for detecting and preventing AI hallucinations.

System Architecture

├── llm-eval-framework/          # Python evaluation system
│   ├── config/config.yaml       # Main configuration
│   ├── src/                     # Evaluation modules (19 Python files)
│   │   ├── models/              # AI model runners
│   │   ├── judge/               # Hallucination detection
│   │   ├── eval/                # Evaluation pipeline
│   │   ├── report/              # Report generation
│   │   └── utils/               # Cache, logging
│   ├── data/eval_cases/         # 50 test cases
│   ├── results/                 # Evaluation results
│   └── tests/                   # Unit tests
│
├── ZjemTo/                      # React Native app
│   ├── src/
│   │   ├── screens/             # 29 app screens
│   │   ├── services/            # AI integration layer
│   │   ├── components/          # 50+ UI components
│   │   ├── data/                # Recipes, historical data
│   │   └── config/              # API configuration
│   ├── server/                  # Express backend (AI proxy)
│   └── package.json
│
├── semantic_cache.py            # Redis cache tool
└── README.md

LLM Evaluation Framework

Model Runners

File	Description
`base.py`	Abstract `ModelRunner` class — extract/generate interface
`openai_runner.py`	GPT-4o, GPT-4, GPT-4o-mini support
`claude_runner.py`	Claude Sonnet and Haiku support
`bielik_runner.py`	Local Bielik model via Ollama with validation

Evaluation Pipeline

File	Description
`eval_runner.py`	Main orchestrator (258 lines)
`extraction.py`	OCR→JSON extraction with validation
`generation.py`	JSON→recipe generation in Polish

Hallucination Detection

File	Description
`judge_runner.py`	GPT-4o evaluation (225 lines)
`prompts.py`	Three-phase judge prompts

Reports

File	Description
`report_generator.py`	Markdown + JSON reports with comparison tables
`visualizer.py`	Matplotlib charts: accuracy heatmaps, error distributions

Test Data

Dataset	Count	Description
Modern	20	Contemporary recipes
Traditional	18	Classic Polish dishes
Historical	12	Archaic language, adversarial

ZjemTo App

A React Native (Expo) mobile app with 29 screens, 50+ UI components, and integration with multiple AI models.

Architecture

Phone (Expo Go) → Express Server (:3001) → Ollama (:11434) for Bielik
                                              → OpenAI API for World mode

Screens

Screen	Description
`HomeScreenPremium`	Main hub with Bielik/World mode toggle
`TraditionalRecipesScreen`	Polish recipes from ingredients
`RecreateTasteScreen`	Dish description → AI identifies it
`RegionsScreen`	16 Polish voivodeships
`HistoricalRecipesScreen`	6 historical eras
`CookingModeScreen`	Step-by-step with timers
`CameraScreen`	Fridge photo → GPT-4o Vision
`PlateScoreScreen`	Nutritional value analysis
`ReceiptScannerScreen`	Receipt OCR scanning
`SwipeDiscoveryScreen`	Tinder-style recipe discovery

AI Services

Service	Responsibility
`RecipeAIService`	Recipe generation with validation and guardrails
`BielikService`	Communication with local Bielik via Express proxy
`VisionService`	GPT-4o Vision image analysis (ingredients, plate, steps)
`CacheService`	Semantic cache with Redis for repeated queries

Express Endpoints

Endpoint	Method	Description
`/api/bielik/generate`	`POST`	Recipe generation via Bielik
`/api/bielik/extract`	`POST`	Ingredient extraction from text
`/api/vision/analyze`	`POST`	GPT-4o Vision image analysis
`/api/recipe/validate`	`POST`	Recipe validation (Polish Guard)
`/api/cache/lookup`	`GET`	Semantic cache lookup
`/api/health`	`GET`	Health check + model status

Anti-Hallucination System

A 6-layer system for preventing AI hallucinations, designed specifically for Polish cuisine.

Layer 1: Judge System

The main evaluation system based on GPT-4o. It analyzes generated recipes for 8 types of hallucinations:

Hallucination Type	Description	Example
`ingredient`	Adding an ingredient not in the list	Pierogi recipe with avocado
`quantity`	Unrealistic quantities	500g of salt for one serving
`step`	Fabricated cooking step	"Ferment for 3 weeks"
`time`	Incorrect cooking time	"Cook pasta for 2 hours"
`temperature`	Unrealistic temperature	"Bake at 50°C for 5 minutes"
`tool`	Non-existent tool	"Use a molecular dehydrator"
`context`	Wrong cultural context	"Traditional Polish ramen"
`historical_anachronism`	Historical anachronism	"Tomatoes in medieval Poland"

Severity classification:

Severity	Penalty	Description
`HIGH`	-20 pts	Critical — recipe is dangerous or absurd
`MEDIUM`	-10 pts	Significant — error affecting quality
`LOW`	-3 pts	Minor — cosmetic inaccuracy

Scoring system:

score = 100  # Starting point

for hallucination in detected:
    if hallucination.severity == "HIGH":   score -= 20
    if hallucination.severity == "MEDIUM": score -= 10
    if hallucination.severity == "LOW":    score -= 3

# Verdict:
# FAIL  — any HIGH or score < 50
# WARN  — ≥3 MEDIUM or score < 70
# PASS  — score ≥ 70 and no HIGH

Layer 2: Bielik Validation

Dedicated validation for the local Bielik model (Polish LLM):

Polish stem matching (cebula→cebule→cebuli)
Removal of untracked ingredients (not present in input)
Unit normalization (szklanka→ml, łyżka→g)
JSON response structure validation
97 lines of validation code

Layer 3: Polish Guard

A system protecting the authenticity of Polish cuisine:

Database of ~48 Polish dishes with canonical ingredients
List of ~70 forbidden foreign dishes (sushi, pizza, ramen, tacos...)
"Grandma's test": Could a grandma from the 1980s cook this?
Auto-correction: foreign dish → Kotlet schabowy (breaded pork cutlet)

Ingredient mapping:

Forbidden	Substitute
Tofu	Twaróg (quark cheese)
Wasabi	Chrzan (horseradish)
Soy sauce	Maggi
Quinoa	Kasza gryczana (buckwheat groats)

Layer 4: FORBIDDEN Sections in Prompts

Direct instructions in system prompts prohibiting hallucinations:

Extraction:

DO NOT add ingredients "that should be there"
DO NOT guess quantities if not specified
DO NOT correct "errors" in the original text

Generation:

DO NOT fabricate cooking times
DO NOT add serving steps
DO NOT suggest substitutes

Historical:

DO NOT add ingredients unavailable in the given era
DO NOT modernize cooking techniques
DO NOT fabricate historical context

Layer 5: Allergen Validation

13 allergen categories monitored in every recipe:

gluten, milk, eggs, fish, crustaceans, peanuts,
tree nuts, soy, celery, mustard, sesame, lupin, mollusks

The system automatically suggests AI-powered ingredient substitutes for each allergen.

Layer 6: Local Fallback Database

200+ ready-made Polish recipes stored locally on the device. When AI fails (timeout, error, Judge score too low), the system automatically falls back to local recipes. The user always gets a result — zero empty screens.

AI Models

Model	Role	Context
GPT-4o	Extraction, generation, judge, vision	Primary cloud model
GPT-4o-mini	Preprocessing, fast extraction	Cheap and fast
Claude Sonnet	Extraction and generation	Anthropic alternative
Claude Haiku	Fast extraction	Lightweight alternative
Bielik 11B	Polish model via Ollama	Local inference

Evaluation Research

The framework was designed for systematic comparison of models on Polish culinary data. Each test case contains:

Original recipe text (various formats: OCR, handwritten, print)
Expected extraction result (ground truth JSON)
Expected generation result (reference recipe)
Metadata: era, region, difficulty level

Metrics: extraction_accuracy, generation_fidelity, hallucination_rate, judge_score, latency_ms

Main Features

Ingredient Scanning

Fridge photo → GPT-4o Vision recognizes ingredients → recipe generation from available products. Full pipeline: CameraScreen → VisionService → RecipeAIService.

HeritageWizard

A 5-step OCR wizard for handwritten family recipes. The user photographs a handwritten grandma's recipe, the system recognizes the text, extracts ingredients and steps, and generates a structured recipe.

Recreate the Taste

The user describes a culinary memory ("that soup at grandma's with those little dumplings") → AI identifies the dish (zupa z lane kluskami — soup with poured dumplings) → full recipe with regional context.

Cooking Mode

Interactive step-by-step mode with TTS (voice instruction reading), timers, Chef's Eye (step photo verification — does it look correct?) and "Ask AI" per step (explanations, substitutes).

DishSwiper

Tinder-style dish discovery — swipe right = like, swipe left = next. The system learns preferences and suggests increasingly relevant dishes.

Insta-Plate AI

Food aesthetics rating 1–10 with composition, color, and presentation analysis. Improvement suggestions ("add green herbs for contrast").

Regions

16 Polish voivodeships with regional specialties. Each region has canonical dishes, local ingredients, and culinary traditions.

Historical Recipes

6 historical periods (Middle Ages → modern day) with ingredient restrictions for each era. The system refuses to add tomatoes to a medieval recipe (because they were unknown in Europe).

Gamification System

XP for cooking, 5 levels (Beginner Cook → Master Chef), eco badges (seasonal ingredients, zero waste), daily culinary challenges.

← Back to Projects Contact →