API Documentation
Score, label, and evaluate any dataset with AI agents. One API call, structured results.
Quickstart
Score a batch of items in one API call. Here's a complete example:
curl -X POST https://scorehive.polsia.app/api/evaluate \ -H "Content-Type: application/json" \ -H "X-API-Key: sh_live_YOUR_KEY_HERE" \ -d '{ "name": "Search Quality Check", "context": "best pizza restaurants in NYC", "items": [ { "title": "Best Pizza in NYC - Ultimate Guide", "url": "example.com/pizza-guide", "snippet": "Top 10 pizza places in New York City, rated by locals." }, { "title": "NYC Weather Forecast", "url": "weather.com/nyc", "snippet": "Tomorrow will be sunny with highs near 72F." } ] }'
Response:
{
"success": true,
"evaluation": {
"id": 42,
"name": "Search Quality Check",
"status": "completed",
"total_items": 2,
"avg_score": 0.62
},
"results": [
{
"input": { "title": "Best Pizza in NYC - Ultimate Guide", "..." },
"scores": {
"relevance": 0.95,
"accuracy": 0.88,
"intent_alignment": 0.92
},
"overall_score": 0.92,
"confidence": 0.95,
"reasoning": "Directly relevant guide to NYC pizza restaurants.",
"flags": []
},
{
"input": { "title": "NYC Weather Forecast", "..." },
"scores": {
"relevance": 0.10,
"accuracy": 0.75,
"intent_alignment": 0.05
},
"overall_score": 0.31,
"confidence": 0.92,
"reasoning": "Weather forecast is unrelated to pizza restaurants.",
"flags": ["off_topic"]
}
]
}That's it. Two items scored against three criteria, with confidence and reasoning for each.
Authentication
All API requests require an API key. Pass it in one of two ways:
X-API-Key: sh_live_YOUR_KEY_HEREAuthorization: Bearer sh_live_YOUR_KEY_HEREGetting an API Key
- Go to the Dashboard and scroll to API Key Management
- Enter the admin secret and click Authenticate
- Fill in a name and click Generate Key
- Copy the key immediately — it's shown only once
sh_live_ and are hashed on our side. If you lose your key, you'll need to generate a new one.
Base URL
https://scorehive.polsia.app
All endpoint paths below are relative to this base URL.
Endpoints
The core endpoint. Send an array of items and get back structured scores against a rubric.
| Field | Type | Required | Description |
|---|---|---|---|
| items | array | Required | Array of items to score. Each item can be a string or an object. Max 50 items per request. |
| name | string | Optional | Name for this evaluation batch. Defaults to "Evaluation {date}". |
| context | string | Optional | Context or query that items should be evaluated against (e.g., a search query). |
| rubric | object | Optional | Custom scoring rubric. See Rubric Configuration. Uses default rubric if omitted. |
Items can be plain strings or objects with any structure:
"items": [ "The Earth revolves around the Sun", "Water boils at 50 degrees Celsius" ]
"items": [ { "title": "Best Pizza in NYC", "url": "example.com/pizza", "snippet": "Top rated pizza places..." } ]
| Field | Type | Description |
|---|---|---|
| success | boolean | Always true on success |
| evaluation | object | Evaluation metadata: id, name, status, total_items, avg_score |
| results | array | Scored items. Each has scores, overall_score, confidence, reasoning, flags |
Returns paginated list of your past evaluations, newest first.
| Param | Type | Default | Description |
|---|---|---|---|
| limit | integer | 20 | Number of evaluations to return. Max 100. |
| offset | integer | 0 | Offset for pagination. |
{
"success": true,
"evaluations": [
{
"id": 42,
"name": "Search Quality Check",
"status": "completed",
"total_items": 10,
"scored_items": 10,
"avg_score": 0.74,
"created_at": "2026-04-01T12:00:00Z"
}
],
"total": 156,
"limit": 20,
"offset": 0
}Returns a single evaluation with its full list of scored items and metadata.
| Param | Type | Description |
|---|---|---|
| id | integer | Evaluation ID |
{
"success": true,
"evaluation": {
"id": 42,
"name": "Search Quality Check",
"rubric": { /* rubric config */ },
"status": "completed",
"total_items": 2,
"avg_score": 0.62,
"created_at": "2026-04-01T12:00:00Z"
},
"items": [
{
"id": 101,
"input_data": { "title": "Best Pizza..." },
"scores": { "relevance": 0.95, "accuracy": 0.88 },
"overall_score": 0.92,
"confidence": 0.95,
"reasoning": "Directly relevant...",
"flags": []
}
]
}Returns aggregate statistics for your API key: totals, averages, score distribution, and 7-day trend.
{
"success": true,
"stats": {
"total_evaluations": 156,
"total_items_scored": 2340,
"global_avg_score": 0.72
},
"score_distribution": [
{ "bucket": "0.9-1.0", "count": 120 },
{ "bucket": "0.8-0.9", "count": 340 },
// ... 10 buckets from 0.0-0.1 to 0.9-1.0
],
"trend": [
{ "date": "2026-04-01", "evaluations": 12, "avg_score": 0.74 },
// ... last 7 days
]
}Rubric Configuration
A rubric defines what criteria items are scored against. Each criterion has a weight (how much it contributes to the overall score) and a description (what the AI evaluates).
Default Rubric
If you don't pass a rubric field, ScoreHive uses this default:
{
"relevance": {
"weight": 0.4,
"description": "How relevant is this item to the query or context?"
},
"accuracy": {
"weight": 0.3,
"description": "How factually accurate is the content?"
},
"intent_alignment": {
"weight": 0.3,
"description": "How well does this align with the user intent?"
}
}Custom Rubric
Pass your own rubric to score items on whatever criteria matter to your use case. Weights should sum to 1.0.
{
"rubric": {
"content_quality": {
"weight": 0.35,
"description": "Is the content well-written and informative?"
},
"brand_safety": {
"weight": 0.30,
"description": "Is the content safe for brand placement?"
},
"audience_fit": {
"weight": 0.20,
"description": "Does this match the target audience?"
},
"freshness": {
"weight": 0.15,
"description": "Is the information current and up-to-date?"
}
}
}Response Format
Every scored item in the results array has the same structure:
| Field | Type | Range | Description |
|---|---|---|---|
| overall_score | float | 0.0 – 1.0 | Weighted average across all rubric criteria |
| scores | object | 0.0 – 1.0 each | Individual score per rubric criterion |
| confidence | float | 0.0 – 1.0 | AI confidence in the scoring. Lower values mean the item may need human review. |
| reasoning | string | — | 1-2 sentence explanation of the score |
| flags | string[] | — | Array of concern flags (e.g., off_topic, low_quality). Empty if no issues. |
Score Interpretation
| Range | Label | Meaning |
|---|---|---|
| 0.7 – 1.0 | High | Strong match. Production-ready. |
| 0.4 – 0.69 | Medium | Partial match. May need review. |
| 0.0 – 0.39 | Low | Poor match. Likely irrelevant or incorrect. |
Rate Limits & Pricing
| Limit | Value |
|---|---|
| Items per request | 50 |
| Concurrent scoring | 5 items in parallel |
Error Handling
All errors follow a consistent format:
{
"success": false,
"message": "Description of what went wrong"
}| Status | Meaning | Common Cause |
|---|---|---|
| 200 | Success | Request processed successfully |
| 400 | Bad Request | Missing items array, more than 50 items, invalid rubric structure |
| 401 | Unauthorized | Missing or invalid API key |
| 403 | Forbidden | API key has been deactivated |
| 404 | Not Found | Evaluation ID doesn't exist or belongs to another key |
| 500 | Server Error | Internal error during evaluation |