Document Workflow

Remove a BOM from a CSV or text file without re-encoding it

Inspect exact file bytes, remove one confirmed UTF-8, UTF-16, or UTF-32 BOM, preserve the remaining stream, and verify the result against the destination encoding contract.

Written and tested by SimpleWebUtilsPublished: May 21, 2026Reviewed: July 18, 2026

How this workflow was checked

For “Repair a UTF-8 BOM CSV before database import”, we entered the documented fixture in Remove BOM from UTF-8, UTF-16, and UTF-32 Files and followed “Inspect file bytes instead of opening and resaving text” before “Resolve incomplete or missing signatures conservatively”. We compared the browser result with the stated output, then reviewed “Using decoded text mode when the original file exists” and “Removing FF FE before checking for UTF-32 LE” as separate failure boundaries.

Removing exactly EF BB BF reduced the file from 34 to 31 bytes and made id the first header bytes without decoding and rewriting the remaining CSV.

Open Remove BOM from UTF-8, UTF-16, and UTF-32 Files

Problem

A Byte Order Mark is a sequence of bytes at the start of a Unicode text stream. UTF-8 uses EF BB BF as an optional encoding signature; UTF-16 and UTF-32 signatures can also identify whether multi-byte code units are serialized in big-endian or little-endian order. Some CSV importers, command interpreters, config parsers, and data pipelines expose those bytes as part of the first field or reject the required first token. The visible text may look normal while the importer compares a hidden-marker-prefixed header with id, or while a script expects #! at byte zero. Repair must stay at the byte boundary: decoding a UTF-16 or UTF-32 file and saving it as UTF-8 is a conversion, not BOM removal. Removing a UTF-16 or UTF-32 BOM without a separate encoding label can also erase the only byte-order signal. Keep the source, verify the target contract, remove only a complete supported signature at byte zero, and validate the downloaded artifact rather than trusting the editor preview.

Sources and standards

These authoritative references define the formats or security boundaries used in this workflow. Tool-specific verification is documented separately above.

Encoding Standard
WHATWG

When to use this

A CSV importer reports an unknown first column even though the visible header appears to be id, name, or another expected field.
A UTF-8 script, JSON document, XML file, or config fails because the parser requires an ASCII token at byte zero.
A receiving interface explicitly requires UTF-8 without BOM or a named UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE stream without a signature.
You need evidence of the exact leading bytes before and after a file-import repair rather than a text editor's encoding guess.
A copied string begins with decoded U+FEFF and the original file bytes are no longer available, so a clearly UTF-8 text result is acceptable.

Steps

Step 1
Keep the original and record the destination contract
Duplicate the source file before changing it. Record whether the receiving application expects UTF-8 without BOM, a specifically labelled LE or BE stream, or a BOM-marked Unicode file. Do not infer the requirement from one failed import.
Step 2
Inspect file bytes instead of opening and resaving text
Use File bytes mode with the original file. This reads an ArrayBuffer and can distinguish EF BB BF, FE FF, FF FE, 00 00 FE FF, and FF FE 00 00 without letting an editor decode or normalize the content first.
Step 3
Run the review manually
Select the file and start the inspection. File selection alone must not remove data. Confirm the detected encoding label, signature length, input size, and the before/after leading-byte rows.
Step 4
Resolve incomplete or missing signatures conservatively
If no signature is found, do not treat that as proof of UTF-8. If only part of a signature is present, such as EF BB, preserve every byte and investigate a truncated transfer or non-text file instead of guessing the missing byte.
Step 5
Stop before removing UTF-16 or UTF-32 byte order blindly
For UTF-16 or UTF-32, verify that the target already knows LE or BE from a charset label, schema, protocol, or import setting. If it does not, keep the BOM or convert the file with a deliberate encoding tool and a separate round-trip review.
Step 6
Download and verify the exact byte change
A successful UTF-8 repair should be exactly 3 bytes smaller; UTF-16 should be 2 bytes smaller; UTF-32 should be 4 bytes smaller. The first bytes after removal must equal the original bytes immediately after the signature, and every later byte must remain in the same order.
Step 7
Retry the real parser and check adjacent failures
Import a small copy and verify the first header, first data row, and expected column mapping. If a script still fails, inspect CRLF/LF endings. If characters are garbled, perform a separate charset investigation rather than repeatedly stripping bytes.

Example

Repair a UTF-8 BOM CSV before database import

Input

File: customers.csv
Input size: 34 bytes
Leading bytes: EF BB BF 69 64 2C 6E 61 6D 65
Visible header: id,name

Output

Detected: UTF-8 BOM (EF BB BF)
Output size: 31 bytes
Leading bytes: 69 64 2C 6E 61 6D 65
Verification: the importer maps the first field to id

Common mistakes

Using decoded text mode when the original file exists

A text box no longer contains the source encoding bytes. Use the file path when byte preservation matters; decoded-text output is explicitly UTF-8.

Removing FF FE before checking for UTF-32 LE

FF FE 00 00 is a four-byte UTF-32 LE signature. A detector that stops at FF FE removes only half the marker and corrupts the intended result.

Deleting U+FEFF away from the beginning

Interior U+FEFF is not a BOM at byte zero and may be content or legacy word-joining data. Preserve it for code-point-level review.

Assuming BOM removal fixes mojibake

Garbled characters usually indicate a wrong decoder, not an unwanted leading signature. Detect or confirm the charset separately.

Replacing the source before the import succeeds

Keep the untouched export until the target accepts the repaired copy and field values, row counts, delimiters, and line endings have been verified.

FAQ

Why can a BOM change only the first CSV column?

The signature appears before the first header byte. A parser that does not consume it may include the decoded marker in the first field name while every later field remains normal.

Is a UTF-8 BOM invalid?

No. It is an optional signature and has no byte-order purpose in UTF-8. Whether it should be present depends on the file format or receiving protocol.

Why does the workflow warn for UTF-16 and UTF-32?

Their BOM can identify LE or BE serialization. Removing it from an otherwise unlabelled stream can erase information a decoder needs.

Can a file without a BOM still be UTF-16 or UTF-32?

Yes. A charset label, protocol, schema, or application setting can define the byte order without a BOM. Absence of a signature is not a charset verdict.

What happens when the file starts with only part of a BOM?

The tool reports a possible truncated signature and removes nothing. Repair requires knowing whether bytes were lost; guessing can destroy legitimate input.

Are the selected bytes sent to a conversion server?

During the normal tool workflow, the browser reads and rewrites the selected byte array locally. Aggregate analytics excludes file names, file bytes, pasted content, and result content.

Problem

Sources and standards

When to use this

Steps

Keep the original and record the destination contract

Inspect file bytes instead of opening and resaving text

Run the review manually

Resolve incomplete or missing signatures conservatively

Stop before removing UTF-16 or UTF-32 byte order blindly

Download and verify the exact byte change

Retry the real parser and check adjacent failures