Developer Workflow
Fix garbled text encoding before import
Diagnose mojibake, broken accents, CJK text, symbols, and CSV import corruption by checking encoding, BOM markers, and Unicode normalization.
Problem
Garbled text usually appears after a file is decoded with the wrong character set, imported with an unexpected BOM, or compared before Unicode normalization. If you edit the file first, the original byte clues may disappear and the broken import becomes harder to diagnose.
When to use this
- A CSV, log, translation file, or customer export shows mojibake after upload or import.
- Names, accents, Korean, Japanese, Chinese, emojis, or symbols look correct in one app but broken in another.
- You need to decide whether to remove a BOM, normalize Unicode, or reopen a file with a different encoding.
Steps
- Step 1
Check the original file first
Open the untouched source file in the character set detector before saving it in an editor. This preserves the encoding, BOM, and byte-pattern clues.
- Step 2
Compare the detected encoding with the importer
If the detector reports UTF-8, UTF-16, ASCII, or a BOM-marked file, verify that the spreadsheet, database, or ETL importer is configured for the same encoding.
- Step 3
Remove BOM only when the target mishandles it
If the first header or field starts with a hidden marker, remove the BOM before import. If the destination expects BOM-marked UTF-8 or UTF-16, keep it.
- Step 4
Normalize Unicode for comparison bugs
When text looks identical but search, equality checks, or filenames fail, normalize to NFC for general web/database use or NFD for macOS-specific file workflows.
- Step 5
Preview a small import before bulk processing
After cleaning, import a small sample and verify headers, multilingual rows, symbols, and delimiters before processing the full file.
Example
Diagnose a garbled customer CSV
Input
customers.csv
Preview after import: Jürgen, São Paulo, 서울
First bytes: EF BB BF 69 64 2C 6E 61 6D 65Output
Detected: UTF-8 with BOM
Likely issue: importer decoded UTF-8 bytes incorrectly or treated BOM as part of the first header
Next step: import as UTF-8 and remove BOM only if the first header becomes hidden-marker-prefixed.Common mistakes
Saving the file before detection
Editors may rewrite encodings and line endings on save. Detect the original file before modifying it.
Removing every BOM automatically
Some importers handle BOM correctly. Remove it only when the destination treats it as data or rejects the file.
Ignoring Unicode normalization
Encoding can be correct while visually identical text still fails comparisons because NFC and NFD forms differ.
FAQ
What is mojibake?
Mojibake is garbled text caused when bytes are decoded with the wrong character encoding, such as UTF-8 text read as a legacy single-byte encoding.
Should I always convert files to UTF-8?
UTF-8 is usually the safest target for web and database workflows, but first confirm what the destination system expects and whether it allows BOM markers.
Can Unicode normalization fix mojibake?
No. Normalization fixes equivalent Unicode representations after text is decoded correctly. Mojibake must be fixed by using the correct encoding or source bytes.