Document Workflow

Recover garbled text from the original file bytes

Diagnose mojibake from untouched bytes, test candidate decoders against known text, and avoid destructive resaves before importing repaired data.

Written and tested by SimpleWebUtilsPublished: May 21, 2026Reviewed: July 18, 2026

How this workflow was checked

Verification started with the guide's “Recover a UTF-8 customer name displayed as Windows-1252” fixture in Detect Text File Encoding from Byte Evidence. The run followed “Stop resaving the broken preview” through “Choose candidates from evidence and provenance”, compared the produced result to the documented expectation, and checked the distinct limits behind “Encoding the broken display again” and “Using Unicode normalization as a repair”.

Decoding the original 4A C3 BC byte sequence as UTF-8 recovered Jürgen, while re-encoding the already broken JÃ¼rgen display failed the byte comparison.

Open Detect Text File Encoding from Byte Evidence

Problem

Mojibake appears when one encoding's bytes are interpreted through a different character map. UTF-8 bytes for ü can be displayed as Ã¼ when read as Windows-1252, and Korean text can become several Latin-looking fragments after the same mistake. Once that broken display is saved, the application may encode the wrong characters again, creating a second transformation and sometimes losing information through replacement characters. BOM handling is a separate issue: a hidden U+FEFF in the first CSV field can break a header even when the rest of the text decodes correctly. Unicode normalization is separate again; NFC and NFD can compare differently after correct decoding, but normalization cannot repair wrong byte interpretation. Recovery must begin with the earliest untouched bytes, use byte evidence and source metadata to choose decoder candidates, verify known names or headers, and distinguish reversible mojibake from data that has already been replaced or discarded.

Sources and standards

These authoritative references define the formats or security boundaries used in this workflow. Tool-specific verification is documented separately above.

Encoding Standard
WHATWG

When to use this

Text looked correct in the source system but became Ã¼, â€™, mojibake fragments, boxes, or replacement characters after export or import.
A multilingual CSV has correct delimiters but broken names, cities, comments, or headers.
The first column behaves differently from later columns because a BOM may have become part of its name.
A copied broken string exists, but you need to determine whether the untouched file still contains recoverable source bytes.
A repair script must be documented and tested before it changes a full customer or production dataset.

Steps

Step 1
Stop resaving the broken preview
Close the editor or spreadsheet without saving. Preserve the original export and calculate a hash if the incident matters operationally. Work on copies so each decoder trial starts from identical bytes.
Step 2
Locate the first wrong boundary
Identify where bytes became text: export, HTTP response, file reader, spreadsheet import, database client, or terminal. Record the charset declared or assumed at that boundary.
Step 3
Inspect the original byte evidence
Run the original file through the character set detector. Record the BOM, strict UTF-8 status, NUL pattern, control-byte warning, leading hex, and whether the result is confirmed, strong, heuristic, compatible, or undetermined.
Step 4
Choose candidates from evidence and provenance
Prefer an exact BOM or valid non-ASCII UTF-8. For undetermined legacy text, use the producing application, locale, operating system, protocol header, and vendor documentation to choose a short candidate list.
Step 5
Compare known-text previews
Decode a copy with each candidate and compare expected headers, names, punctuation, CJK, emoji, and delimiters. Reject any preview containing unexpected replacement characters or implausible control characters.
Step 6
Handle BOM and normalization separately
Remove a BOM only when the destination rejects it or treats it as data. Apply NFC or another normalization form only after decoding is correct and the destination contract requires it.
Step 7
Convert once and verify the destination
Write a new file with an explicit target encoding, normally UTF-8 when the destination supports it. Reinspect its bytes, run a small import, compare counts and key values, and retain the original for rollback.

Example

Recover a UTF-8 customer name displayed as Windows-1252

Input

Visible broken value: JÃ¼rgen
Original bytes around the name: 4A C3 BC 72 67 65 6E
Detector: strict UTF-8, no BOM, strong evidence

Output

Decode the original bytes as UTF-8 -> Jürgen
Do not encode the visible JÃ¼rgen text as a new source
Verify the corrected name and surrounding CSV delimiters in a sample import

Common mistakes

Encoding the broken display again

The visible mojibake is already text produced by a wrong decode. Re-encoding it without tracing the transformations can add another layer of corruption.

Using Unicode normalization as a repair

Normalization reconciles equivalent Unicode sequences after correct decoding. It does not reinterpret source bytes.

Removing every BOM

A BOM can be required, optional, or forbidden by the destination. It is not the cause of every garbled character.

Guessing Windows-1252 from invalid UTF-8 alone

Many legacy encodings fail UTF-8 validation. Use provenance and known text before selecting one.

Ignoring irreversible replacement

If an earlier system replaced unknown bytes with U+FFFD or question marks and discarded the source, the missing information may not be recoverable.

FAQ

Can the detector fix mojibake automatically?

No. It classifies source byte evidence. Recovery still requires choosing and verifying the correct decoder, then writing a new file under an explicit target encoding.

Why is JÃ¼rgen a clue for UTF-8 read as Windows-1252?

The UTF-8 bytes C3 BC for ü map to the two Windows-1252 characters Ã and ¼-like output patterns. It is a useful clue, but the original bytes and source context should confirm it.

Can I recover text after U+FFFD appears?

Only if the original bytes or another lossless copy still exists. U+FFFD signals that a decoder replaced an invalid sequence; it does not retain the discarded byte value.

When should I normalize to NFC?

After correct decoding, and only when the destination's comparison, search, or storage contract expects NFC. Keep normalization separate from charset recovery.

Should the repaired file always be UTF-8 without BOM?

No. UTF-8 is a common target, but the receiving protocol decides whether a BOM is allowed and whether another encoding is required.

How do I verify a bulk repair safely?

Use a fixed sample with known multilingual values, compare row and field counts, check replacement characters, hash the source, and make the output a new file rather than overwriting the original.

Problem

Sources and standards

When to use this

Steps

Stop resaving the broken preview

Locate the first wrong boundary

Inspect the original byte evidence

Choose candidates from evidence and provenance

Compare known-text previews

Handle BOM and normalization separately

Convert once and verify the destination