Why I Built Sea Land AI
The Origin: AI Has Become Part of My Workflow
The progress of reasoning models and AI Agents in 2025 has been nothing short of remarkable.
I'm a heavy user of these technologies myself. Over the past year, I've written a significant amount of code with various AI models. They've become an indispensable part of my daily workflow—from debugging to refactoring, from prototyping to documentation, AI is involved in almost every aspect of my work.
But the more I use them, the clearer the problems become.
Two Core Pain Points:
-
Hallucination: AI confidently fabricates non-existent APIs, incorrect mathematical results, and fake references. This is especially deadly when coding—it might confidently tell you a library has a certain function, when in reality it doesn't exist.
-
Insufficient Reasoning: When facing moderately complex logical reasoning or math problems, current models often fall short. Steps get skipped, contradictions appear, and answers come out wrong.
I've tried many workarounds—switching models, tweaking prompts, adding verification pipelines—but these are all band-aids, not solutions.
So I decided: Rather than waiting for someone else to solve these problems, I'll do it myself.
If I can improve AI's accuracy and reasoning capabilities, the first beneficiary will be myself—a real boost to my productivity.
Current Results
We evaluated Sea Land AI against Google's Gemini 3.0 Pro on four authoritative benchmarks:
| Benchmark | Description | Gemini 3.0 Pro | Sea Land AI |
|---|---|---|---|
| MathArena Apex | 12 top-tier math competition problems, 4 attempts each | 14.6% | 20.83% |
| SimpleQA | 1000 factual knowledge questions | 72.6% | 75.4% |
| IMO 2025 | 6 International Math Olympiad problems, max score 42 | 63.7% (26.8/42) | 76.2% (32/42) |
| ARC-AGI-2 | 120 abstract reasoning problems, 2 attempts each | 23.3% | 24.1% |
Important Notes:
-
Testing Conditions: All tests were conducted with no tool usage, no web search, and no code execution. This is to evaluate the model's native reasoning capabilities, not its ability to use external tools.
-
About Result Discrepancies:
- SimpleQA: Our measured Gemini 3.0 Pro result (72.6%) is higher than their official reported figure, possibly due to differences in prompt design.
- MathArena Apex and ARC-AGI-2: Our measured Gemini results are lower than official figures, possibly because of different testing setups—we use average scores across multiple attempts with tools disabled; their official results may use best-of-N or have tools enabled.
- The exact reasons would need clarification from the Gemini team.
-
Open Source: We've made all benchmark code publicly available. Interested parties can read and reproduce the results themselves.
Our Goal
The core mission of Sea Land AI is simple:
Reduce AI hallucination. Improve AI logic and reasoning.
We believe this is the key to AI's broader adoption. Only when AI is accurate enough and reliable enough in reasoning will people truly trust and depend on it, without having to manually verify every output. This will dramatically reduce the cost and effort of using AI.
Current Status
Sea Land AI is now open for beta testing.
Visit: sea-land.ai
If the beta results meet expectations, we will open public access in one week.
We welcome everyone to try it out and provide feedback.
— Jinming Hu (胡津铭), Founder of Sea Land