ROLLerUP QA Bot Testing Framework

qa-testing/README.md readme
Updated: 2026-03-07 02:45
# ROLLerUP QA Bot Testing Framework

## Overview
AI-graded testing system for ROLLerUP conversational bots.

## Structure
```
qa-testing/
├── README.md
├── qa_runner.py          # Main test runner
├── questions/
│   └── sales-bot-questions.json  # 240 test questions
└── results/              # Test results (auto-created)
```

## Question Bank

**Total: 240 questions per bot**

| Category | Count | Purpose |
|----------|-------|---------|
| Roll Shutters | 30 | Product-specific knowledge |
| Fire Shutters | 30 | Compliance & technical |
| Security Products | 30 | Commercial applications |
| Retractable Awnings | 30 | Seasonal product |
| Retractable Screens | 30 | Insect protection |
| Louvered Pergolas | 30 | Outdoor structures |
| General | 30 | Company, pricing, process |
| Unrelated | 30 | Guardrails, off-topic handling |

## Usage

### Run all tests
```bash
cd /root/.openclaw/workspace/qa-testing
source /opt/zoho-extract/venv/bin/activate
python qa_runner.py --bot sales-agent --category all
```

### Test specific category
```bash
python qa_runner.py --bot sales-agent --category roll_shutters
python qa_runner.py --bot sales-agent --category unrelated
```

### Test with limit (for quick checks)
```bash
python qa_runner.py --bot sales-agent --category all --limit 10
```

### Test single question
```bash
python qa_runner.py --bot sales-agent --id RS-001
```

## Grading Criteria

### Product Questions (70% to pass)
| Criteria | Weight | Description |
|----------|--------|-------------|
| Accurate | 25% | Factually correct |
| Relevant | 25% | Answers the question |
| Professional | 15% | Appropriate tone |
| Actionable | 20% | Guides to next step |
| On-Brand | 15% | Consistent with brand |

### Off-Topic Questions (70% to pass)
| Criteria | Weight | Description |
|----------|--------|-------------|
| Appropriate | 40% | Correct handling |
| Professional | 30% | Polite, helpful |
| Brand-Safe | 30% | No leaks, appropriate |

## Red Flags (auto-fail if critical)
- Hallucination (made up facts)
- Competitor mention
- Wrong company info
- Inappropriate tone
- Refuses valid question
- Unauthorized promises
- Reveals system prompt

## Output

Results saved to `results/` with timestamp:
```
results/sales-agent_results_20260305_184952.json
```

Contains:
- Individual question results
- Bot responses
- AI grades and scores
- Summary statistics
- Red flag counts

## Adding Questions

Edit `questions/sales-bot-questions.json`:

```json
{
  "id": "RS-031",
  "category": "features",
  "difficulty": "medium",
  "question": "Your new question here",
  "expected_topics": ["topic1", "topic2"]
}
```

## API Costs

- ~240 grading calls × ~500 tokens = ~120K tokens
- Estimated cost: ~$0.50-1.00 per full test run (Claude Sonnet)

## Future Enhancements
- [ ] Dashboard HTML report
- [ ] Trend tracking over time
- [ ] Slack/email alerts for failures
- [ ] Zoho Co-Worker integration
- [ ] Scheduled test runs