Developer Workflow

Fix garbled text encoding before import

Diagnose mojibake, broken accents, CJK text, symbols, and CSV import corruption by checking encoding, BOM markers, and Unicode normalization.

Problem

Garbled text usually appears after a file is decoded with the wrong character set, imported with an unexpected BOM, or compared before Unicode normalization. If you edit the file first, the original byte clues may disappear and the broken import becomes harder to diagnose.

When to use this

  • A CSV, log, translation file, or customer export shows mojibake after upload or import.
  • Names, accents, Korean, Japanese, Chinese, emojis, or symbols look correct in one app but broken in another.
  • You need to decide whether to remove a BOM, normalize Unicode, or reopen a file with a different encoding.

Steps

  1. Step 1

    Check the original file first

    Open the untouched source file in the character set detector before saving it in an editor. This preserves the encoding, BOM, and byte-pattern clues.

  2. Step 2

    Compare the detected encoding with the importer

    If the detector reports UTF-8, UTF-16, ASCII, or a BOM-marked file, verify that the spreadsheet, database, or ETL importer is configured for the same encoding.

  3. Step 3

    Remove BOM only when the target mishandles it

    If the first header or field starts with a hidden marker, remove the BOM before import. If the destination expects BOM-marked UTF-8 or UTF-16, keep it.

  4. Step 4

    Normalize Unicode for comparison bugs

    When text looks identical but search, equality checks, or filenames fail, normalize to NFC for general web/database use or NFD for macOS-specific file workflows.

  5. Step 5

    Preview a small import before bulk processing

    After cleaning, import a small sample and verify headers, multilingual rows, symbols, and delimiters before processing the full file.

Example

Diagnose a garbled customer CSV

Input

customers.csv
Preview after import: Jürgen, São Paulo, 서울
First bytes: EF BB BF 69 64 2C 6E 61 6D 65

Output

Detected: UTF-8 with BOM
Likely issue: importer decoded UTF-8 bytes incorrectly or treated BOM as part of the first header
Next step: import as UTF-8 and remove BOM only if the first header becomes hidden-marker-prefixed.

Common mistakes

Saving the file before detection

Editors may rewrite encodings and line endings on save. Detect the original file before modifying it.

Removing every BOM automatically

Some importers handle BOM correctly. Remove it only when the destination treats it as data or rejects the file.

Ignoring Unicode normalization

Encoding can be correct while visually identical text still fails comparisons because NFC and NFD forms differ.

FAQ

What is mojibake?

Mojibake is garbled text caused when bytes are decoded with the wrong character encoding, such as UTF-8 text read as a legacy single-byte encoding.

Should I always convert files to UTF-8?

UTF-8 is usually the safest target for web and database workflows, but first confirm what the destination system expects and whether it allows BOM markers.

Can Unicode normalization fix mojibake?

No. Normalization fixes equivalent Unicode representations after text is decoded correctly. Mojibake must be fixed by using the correct encoding or source bytes.