Document Workflow

Detect a text file's encoding before import

Inspect BOM, strict UTF-8, NUL placement, control bytes, and ambiguity before choosing a decoder for CSV, log, source, or export imports.

Written and tested by SimpleWebUtilsPublished: May 20, 2026Reviewed: July 18, 2026

How this workflow was checked

In Detect Text File Encoding from Byte Evidence, the “Distinguish a UTF-32LE export from its UTF-16LE prefix” fixture was run without repairing or simplifying its input. We verified the transition from “Inspect exact bytes manually” to “Keep ambiguous results ambiguous”, compared the final artifact or values, and reviewed “Treating ASCII-compatible as ASCII origin” plus “Accepting any continuation-shaped UTF-8” as non-success paths.

The four-byte FF FE 00 00 signature and code-unit layout identified UTF-32LE; the detector did not stop at the shorter UTF-16LE prefix.

Open Detect Text File Encoding from Byte Evidence

Problem

An importer must turn bytes into characters before it can parse columns, delimiters, or records. If its decoder does not match the source, a visible header such as name can acquire a hidden U+FEFF, a city such as Zürich can become ZÃ¼rich, and CJK text can turn into unrelated symbols. The extension .csv or .txt does not carry a reliable charset, and opening the file in an editor may silently decode and resave it before you inspect the evidence. Detection also has hard limits: a BOM is a strong signature but does not prove that every following code unit is valid; printable ASCII bytes are shared by UTF-8 and many legacy encodings; and similar single-byte charsets often cannot be distinguished without expected text or source metadata. A safe import workflow therefore preserves the original, records exact byte evidence, separates confirmation from heuristic classification, and tests a small destination import before modifying the full dataset.

Sources and standards

These authoritative references define the formats or security boundaries used in this workflow. Tool-specific verification is documented separately above.

Encoding Standard
WHATWG

When to use this

A CSV header, customer name, accent, symbol, or CJK field changes after spreadsheet, database, or ETL import.
A receiving system requires UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or a BOM-free stream.
A source or translation repository must be audited before standardizing files on UTF-8.
An attachment has an uncertain file type and must be separated into plausible text or likely binary data before decoding.
You need a reproducible byte report for a vendor or data-pipeline incident rather than an editor's encoding label.

Steps

Step 1
Preserve the untouched source and destination requirement
Keep a byte-for-byte copy before opening or saving the file. Record the exact encoding names accepted by the importer, whether a BOM is required or forbidden, and one or more known values that a correct decode must reproduce.
Step 2
Inspect exact bytes manually
Select the original file in the character set detector and run the inspection. The file extension and MIME label are not evidence; the report starts with byte count and the leading 24-byte hex prefix.
Step 3
Separate signatures from structure and heuristics
Treat EF BB BF, FE FF, FF FE, 00 00 FE FF, or FF FE 00 00 at byte zero as confirmed BOM evidence. Read strict UTF-8 validation and UTF-16 or UTF-32 structure separately, because a complete signature can still precede malformed data.
Step 4
Keep ambiguous results ambiguous
ASCII-compatible means several encodings can produce the same bytes. A legacy or single-byte candidate means strict UTF-8 failed without strong binary evidence; it does not prove Windows-1252, ISO-8859-1, Shift_JIS, or another named charset.
Step 5
Try controlled candidate decoders
Start with a decoder supported by the byte evidence and source metadata. Compare known headers, accents, names, delimiters, and replacement characters. Do not overwrite the source while trying alternatives.
Step 6
Run a small destination import
Import a few representative rows containing ASCII, accents, CJK, emoji, empty fields, and quoted delimiters. Verify both visible text and the exact first field name before scaling the job.
Step 7
Convert only after validation
When the source decoder and destination contract are confirmed, convert once with an explicit output encoding. Reinspect the output bytes and keep the original plus the conversion record for rollback.

Example

Distinguish a UTF-32LE export from its UTF-16LE prefix

Input

customers.bin
Leading bytes: FF FE 00 00 69 00 00 00 64 00 00 00
Expected first field: id

Output

Classification: UTF-32 LE
Evidence: confirmed four-byte BOM; valid UTF-32 structure
Do not classify as UTF-16LE from FF FE alone
Next step: test a UTF-32LE decoder against the id header

Common mistakes

Trusting the extension or MIME type

.csv and text/plain describe format intent, not the byte-to-character mapping. Inspect the stream and source contract.

Treating ASCII-compatible as ASCII origin

Printable ASCII bytes are also valid UTF-8 and overlap many legacy charsets, so the original label remains unknown.

Accepting any continuation-shaped UTF-8

Overlong sequences, surrogate encodings, truncated sequences, and values above U+10FFFF are ill-formed even when some continuation bytes look plausible.

Forcing text decoding on binary data

NUL and control-byte evidence that does not form credible Unicode lanes can indicate a binary file. Verify the file type before decoding.

Skipping the destination sample

A plausible local preview does not prove that the spreadsheet, database, or ETL job uses the same decoder or BOM rule.

FAQ

Why is a BOM result called confirmed but the structure can be invalid?

The signature bytes identify an intended Unicode encoding scheme. They do not prevent truncation, unpaired UTF-16 surrogates, or invalid UTF-32 scalar values later in the file.

Why does printable English text remain ASCII-compatible?

The same byte values represent those characters in ASCII, UTF-8, Windows-1252, ISO-8859 variants, and other ASCII-compatible encodings. Bytes alone cannot recover the original label.

Can NUL bytes prove UTF-16 or UTF-32?

No. Repeated endian-specific NUL lanes plus valid code-unit structure can support a heuristic for sufficiently long ASCII-heavy text, but binary formats can also contain NUL bytes.

Should every valid UTF-8 file be converted?

No. If the destination already accepts the file's BOM policy and content, conversion adds risk without value.

Is the source file uploaded for inspection?

No. Normal inspection reads the file in the browser. Aggregate events exclude the file name and contents.

What evidence should I save with an import incident?

Record the source hash, byte count, leading hex, BOM result, strict UTF-8 result, selected decoder, destination setting, and a verified sample output without exposing sensitive row data.

Problem

Sources and standards

When to use this

Steps

Preserve the untouched source and destination requirement

Inspect exact bytes manually

Separate signatures from structure and heuristics

Keep ambiguous results ambiguous

Try controlled candidate decoders

Run a small destination import

Convert only after validation