Valid JSONL can still make broken fine-tuning data

Parallelogram is a local-first tool for checking fine-tuning datasets. A data file can be valid JSONL but still be bad for training. Common problems include the wrong role order, empty assistant answers, repeated examples, context window overflow, and strange text encoding issues.

A public check found that the website had HSTS but was missing several basic trust signals. The missing pieces included CSP, frame protection, nosniff, Referrer-Policy, robots.txt, and security.txt. Those have now been added, along with Permissions-Policy, a sitemap, and SECURITY.md in the code repository.

The browser demo still makes no network calls while checking a dataset.

Key points

  • Valid JSONL does not mean the data is safe or useful for fine-tuning.
  • The tool checks issues like bad role order, empty answers, duplicates, context window overflow, and encoding problems.
  • The browser demo checks datasets without making network calls.
  • The website now includes CSP, Referrer-Policy, security.txt, and other basic security files.
  • For AI agent work, catching bad training data early can reduce wasted testing and training cost.

Quick term guide

local-first
An app design where your data is mainly stored and controlled on your own device.
fine-tuning
Taking an already-trained AI model and doing additional training to specialize it for a specific task.
context window
The amount of text an AI tool can remember and use in one chat.
trust signals
Signs that help users feel a service is real, safe, and careful with their data.
robots.txt
A small text file on a website that tells bots which pages they are or aren't allowed to visit.
permissions
Settings that define what files or actions a system or user is allowed to access.
repository
The folder that holds all the code files for a software project, often called a 'repo'
training data
The collection of information used to teach an AI how to recognize patterns and answer questions.
Read original