Olostep Parsers Turn URLs Into Stable JSON Contracts
Most scraping failures do not look like failures. The job succeeds, status is green, and the HTML still arrives.
Then your pipeline degrades quietly. The field you depend on goes missing. Price becomes null. Email extraction returns an empty list. Dates shift format. Alerts stop firing and enrichment records go empty. Reviewers stop trusting the system because everything looks healthy while the output gets worse.
That is the real shift: pages are not the product, fields are. Once you accept that, the fix is not more scraping. The fix is treating the output as a contract your systems can rely on.
What a parser gives you in practice
An Olostep Parser is a field first extractor. You provide a URL and request JSON. You get structured output with predictable keys, plus metadata you can store for provenance.
That provenance matters in production because when someone asks why a record looks wrong, you want traceability, not guesswork, and you can show exactly what was extracted and when.
A simple record you persist might look like this. It is your storage shape, not the full Olostep response.
{
"url_to_scrape": "https://example.com/contact",
"created": 1745673871,
"retrieve_id": "ret_...",
"parser_id": "@olostep/extract-emails",
"data": { "emails": ["team@example.com"] }
}Olostep can return JSON inline as json_content or provide a hosted file via json_hosted_url when payloads are larger. That behavior is described in the Parsers documentation.
Parsers vs LLM extraction and the hybrid pattern
Olostep supports structured JSON through parsers and through LLM Extraction. They fit different situations.
If you have recurring runs and fields that must stay stable, parsers are a strong default.
If you have one off pages, changing structures, or you need a custom schema quickly, LLM extraction is often the fastest way to prove the shape.
Many teams land on a hybrid. Parsers handle the hard fields that power downstream logic, while LLM extraction handles fuzzy enrichment like classification, normalization, and summaries. One expectation to set early is access, and Olostep notes that LLM Extraction may require enablement. Here is a practical way to think about the choice.
| Approach | Best for | What to watch |
|---|---|---|
| Parsers | Recurring runs and stable fields | Limited to supported extractors unless you go custom |
| LLM extraction | One off schemas and fuzzy structure | Drift risk unless you enforce schema, cost and latency can vary |
| Hybrid | Parsers for stable fields and LLM for fuzzy enrichment | You decide what must stay contract stable |
Prototype mode lock the contract on one URL
Start with one URL and validate the JSON shape that downstream systems will depend on. This is where Scrapes work well, and the snippet below is a minimal runnable example with a hosted fallback.
import requests, json
URL = "https://example.com/contact"
r = requests.post(
"https://api.olostep.com/v1/scrapes",
headers={"Authorization": "Bearer <API_KEY>"},
json={
"url_to_scrape": URL,
"formats": ["json"],
"parser": {"id": "@olostep/extract-emails"}
}
).json()
res = r["result"]
payload = res.get("json_content")
if payload:
data = json.loads(payload) if isinstance(payload, str) else payload
elif res.get("json_hosted_url"):
data = requests.get(res["json_hosted_url"]).json()
else:
raise RuntimeError("No JSON returned")
print(data)Once this contract looks right, you are ready to scale without turning the blog into an endpoint walkthrough, because the downstream systems already have a stable shape to rely on.
Production mode scale with batches
When you move from one URL to thousands, orchestration becomes the product. Olostep’s Batches workflow is built for that jump.
The docs describe a few production oriented realities. You can run up to 10k URLs per batch, processing time is often described as roughly constant and commonly around five to eight minutes, and new accounts may be limited to 100 items per batch.
The production mental model stays simple: start a batch and get a batch id, poll status until it completes, list batch items using cursor pagination so you can collect retrieve ids, then fetch results later using the Retrieve API. For item listing, follow the Batch Items endpoint.
That flow gives you traceable outputs at scale so you are not just scraping pages, you are producing records you can audit and reason about later.
Guardrails that keep JSON trustworthy
This is where most teams win or lose reliability, so first store provenance by default. Persist url, created, retrieve id, parser id, and batch or custom ids to make debugging and audits realistic instead of painful.
Second handle large payloads explicitly. The Retrieve docs mention size limits and the size_exceeded behavior. They also note hosted URLs expire in 7 days. Treat hosted output as a download link, not storage.
Third monitor field drift, not just job success. Track missing keys, null spikes per field, and shape changes. Those checks catch the quiet failures that erode trust.
If no parser fits
Start with LLM Extraction to prove the schema. If it becomes recurring or high volume, request or build a custom parser so the contract stays stable. The custom path is discussed on the Parsers page.
Key takeaways
- Treat parser output as a JSON contract that downstream systems can depend on.
- Lock the contract on one representative URL before you scale.
- Use batches for volume and the Retrieve API for later access.
- Persist provenance (url, created, retrieve id, parser id) to keep audits and debugging real.
- Monitor field drift and store long term output instead of relying on expiring hosted URLs.
FAQs
How do I avoid re-downloading hosted JSON after it expires
Hosted URLs are temporary with 7 days window. Persist the parsed JSON you need in your own storage as part of your saved record.
Can I combine parsers with LLM extraction in one workflow
Yes. A common production pattern is parsers for stable fields like ids, prices, and emails, plus LLM Extraction for fuzzy fields like summaries, normalization, and classification.
What should I log so I can debug wrong outputs later
At minimum log url, created, retrieve id, parser id or your LLM schema version, and the final parsed JSON. The Parsers documentation shows the metadata you can use for a clean provenance trail.