RFC 2047: MIME Part 3 — Message Header Extensions for Non-ASCII Text
Why This Exists
RFC 2045 and RFC 2046 solved the problem of carrying non-ASCII content in message bodies with Content-Transfer-Encoding. But email headers — Subject, From display names, To display names, and others — are governed by RFC 5322, which restricts them to 7-bit US-ASCII.
This creates an obvious problem: billions of email users write in languages that require non-ASCII characters. Without RFC 2047, you could not send an email with:
- A Subject line in Chinese, Japanese, Korean, Arabic, Hebrew, or Thai
- A sender display name with accented characters (René, Müller, Björk)
- Any header field containing characters outside the ASCII range
RFC 2047 defines the encoded-word syntax: a compact way to embed non-ASCII text inside ASCII-only headers, readable by any MIME-aware mail client.
How It Works
The Encoded-Word Format
An encoded-word has this structure:
=?charset?encoding?encoded-text?=
The three components are:
| Component | Purpose | Values |
|---|---|---|
charset |
The character set of the original text |
UTF-8, ISO-8859-1, ISO-2022-JP, etc. |
encoding |
How the text is encoded into ASCII |
B (base64) or Q (quoted-printable variant) |
encoded-text |
The encoded representation | ASCII characters only |
B Encoding (Base64)
Uses standard base64 encoding. Best for text that is heavily non-ASCII, such as CJK scripts:
; Subject: "Meeting confirmation" in Japanese Subject: =?UTF-8?B?5Lya6K2w44Gu56K66KqN?= ; From display name in Chinese From: =?UTF-8?B?5byg5LiJ?= <zhang@example.com>
Q Encoding (Quoted-Printable Variant)
A modified quoted-printable encoding optimized for headers. Like body QP, non-ASCII bytes become =XX hex pairs. Key difference: spaces are encoded as underscores (_):
; Subject: "Café menu" with accented e Subject: =?UTF-8?Q?Caf=C3=A9_menu?= ; From display name: "René Dupont" From: =?UTF-8?Q?Ren=C3=A9_Dupont?= <rene@example.com> ; Subject: "Gruße aus Berlin" (German greetings) Subject: =?UTF-8?Q?Gru=C3=9Fe_aus_Berlin?=
Q encoding is more human-readable when most of the text is ASCII with just a few non-ASCII characters. B encoding is more compact when most characters are non-ASCII.
Where Encoded-Words Can Appear
Encoded-words are allowed in specific positions within headers:
-
Subject, Comments, Keywords: Anywhere text is expected (as a replacement for an
atomorquoted-string). - From, To, Cc, Bcc, Reply-To, Sender: Only in the display name portion, never in the email address itself.
- Content-Description: Allowed for describing MIME parts.
Encoded-words are not allowed inside quoted-strings, in the local-part or domain of an email address, or as parameter values in structured headers like Content-Type (use RFC 2231 for that).
Key Technical Details
Length Limits
Each encoded-word must not exceed 75 characters. If the encoded text is longer, it must be split into multiple encoded-words separated by folding whitespace (CRLF + space or tab):
; Long subject split across two encoded-words Subject: =?UTF-8?B?5LuK5pel44Gu5Lya6K2w44Gr44Gk44GE44Gm?= =?UTF-8?B?44GU5qGI5YaF44GE44Gf44GX44G+44GZ?=
When two adjacent encoded-words are separated only by linear whitespace, the whitespace between them is ignored during decoding. This allows seamless splitting of long text across multiple encoded-words.
Charset Selection
Always use UTF-8 for new messages. The other charsets exist for legacy reasons:
| Charset | Use Case | Recommendation |
|---|---|---|
UTF-8 |
Covers all Unicode characters | Always use this |
ISO-8859-1 |
Western European legacy | Do not use in new messages |
ISO-2022-JP |
Japanese legacy encoding | Still seen from some Japanese mail clients |
GB2312 |
Simplified Chinese legacy | Do not use in new messages |
Interaction with Header Folding
RFC 5322 limits header lines to 998 characters and recommends keeping them under 78. Encoded-words interact with folding: you can break between encoded-words at whitespace boundaries, but you must never break in the middle of an encoded-word. The =?...?= wrapper must be on a single line.
Decoding Rules
When a mail client encounters an encoded-word, it:
- Extracts the charset, encoding type, and encoded text from the
=?charset?encoding?text?=wrapper. - Decodes the text using base64 (B) or quoted-printable (Q).
- Interprets the resulting bytes according to the declared charset.
- Displays the decoded Unicode text to the user.
If the client does not recognize the charset, it should display the encoded-word as-is rather than displaying garbled text.
Examples
A Complete Message with Encoded Headers
MIME-Version: 1.0 From: =?UTF-8?Q?Ren=C3=A9_Dupont?= <rene@example.fr> To: =?UTF-8?B?5bGx55Sw5aSq6YOO?= <yamada@example.jp> Subject: =?UTF-8?Q?Re:_R=C3=A9union_du_15_mars?= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Bonjour Taro, Confirmons la r=C3=A9union pour le 15 mars.
Note: the header uses RFC 2047 encoded-words (=?...?=), while the body uses regular quoted-printable encoding (=XX without the wrapper). These are different mechanisms for different parts of the message.
Encoding Comparison
The same text — "München" — encoded both ways:
; Q encoding: readable, good for mostly-ASCII text =?UTF-8?Q?M=C3=BCnchen?= ; B encoding: compact but opaque =?UTF-8?B?TcO8bmNoZW4=?=
Common Mistakes
-
Encoding the email address itself. Only the display name can be encoded.
=?UTF-8?Q?user?=@example.comis invalid and will be rejected or misinterpreted. For internationalized email addresses, see RFC 6531/6532. -
Missing space between encoded-words and regular text. An encoded-word must be separated from adjacent text by whitespace.
Hello=?UTF-8?Q?World?=is malformed; it should beHello =?UTF-8?Q?World?=. -
Breaking an encoded-word across lines. The entire
=?...?=token must fit on one line. If you need to fold, split into multiple encoded-words at word boundaries. -
Using RFC 2047 in Content-Type parameters. Encoded-words are not valid in structured header parameters like
filename=orname=. Use RFC 2231 parameter encoding instead:filename*=UTF-8''R%C3%A9sum%C3%A9.pdf. - Exceeding the 75-character limit. Each encoded-word must be 75 characters or fewer. Long text must be split into multiple encoded-words. Oversized encoded-words may be silently truncated by mail servers.
-
Double-encoding. Encoding text that is already encoded produces garbage like
=?UTF-8?Q?=3D=3FUTF-8=3FQ=3F...?=. Ensure your encoding pipeline runs exactly once.
Deliverability Impact
- Incorrect encoding triggers spam filters. Malformed encoded-words in Subject lines are a red flag. Spam filters have seen decades of broken encoding from spam tools. Clean, standards-compliant encoding signals legitimate sending software.
-
Display name encoding affects trust. If the From display name contains non-ASCII characters that are not properly encoded, recipients see raw
=?UTF-8?Q?...?=text instead of a readable name. This looks suspicious and hurts open rates. - Subject line rendering is critical for engagement. A garbled Subject line due to wrong charset or broken encoding means the recipient cannot read it. The email gets ignored or reported as spam.
- Always use UTF-8. Legacy charsets like ISO-8859-1 cannot represent all characters. If a system mixes charsets across different headers, clients may display some correctly and others as mojibake. Standardize on UTF-8 everywhere.
- Test across clients. Outlook, Gmail, Apple Mail, and Thunderbird all have slightly different RFC 2047 decoding behaviors, especially around edge cases like long encoded-words and mixed encoding/non-encoding in a single header.