Understanding Unicode and String Encoding in Java

Illustration for Understanding Unicode and String Encoding in Java
By Last updated:

In today’s globalized world, applications must support multiple languages and scripts — from English and Hindi to Chinese, Arabic, or emojis. Java’s string handling is built on Unicode, ensuring consistent behavior across platforms and locales.

But developers still encounter encoding issues when reading/writing files, converting between bytes and characters, or dealing with legacy systems. This tutorial explains how Unicode and encoding work in Java, with practical tips for building robust, international-ready applications.


🧠 What is Unicode?

Unicode is a universal character encoding standard that assigns a unique number (code point) to every character in every language and symbol set.

Examples:

  • 'A' → U+0041
  • '你' → U+4F60
  • '😊' → U+1F60A

🔤 Java's Internal String Representation

Java uses UTF-16 for storing String objects in memory.

  • Characters from U+0000 to U+FFFF are stored in a single char (16-bit).
  • Supplementary characters (above U+FFFF) are represented using surrogate pairs.
String emoji = "😊";
System.out.println(emoji.length()); // 2 (due to surrogate pair)

🔧 Converting Between Strings and Bytes

Java uses the Charset class for encoding/decoding.

String → Byte Array

byte[] utf8Bytes = "नमस्ते".getBytes(StandardCharsets.UTF_8);

Byte Array → String

String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);

🛠️ Common Encodings in Java

Charset Description
UTF-8 Most popular, variable-length
UTF-16 Java internal representation
ISO-8859-1 Western European
US-ASCII Basic Latin (0–127)

📤 Writing Unicode-Aware Files

Files.write(Paths.get("out.txt"), "こんにちは".getBytes(StandardCharsets.UTF_8));

Ensure your text editors, databases, and file systems use the same encoding.


📥 Reading Unicode Files Safely

List<String> lines = Files.readAllLines(Paths.get("data.txt"), StandardCharsets.UTF_8);

🔎 Understanding Surrogate Pairs

Characters above U+FFFF (like emojis) are represented using two chars in UTF-16.

String smile = "😊";
System.out.println(smile.length()); // 2
System.out.println(smile.codePointCount(0, smile.length())); // 1

🧪 Working with Code Points

Use these methods to handle supplementary characters correctly:

String str = "𝒜𝒷𝒸"; // fancy script
str.codePoints().forEach(cp -> System.out.println(Character.toChars(cp)));

📉 Common Encoding Pitfalls

  • Mismatched encodings: Writing in UTF-8 but reading in ISO-8859-1.
  • Truncated characters: When multibyte sequences are cut mid-way.
  • Unescaped Unicode: Java strings support \uXXXX escapes but require proper escaping in source files.

🔄 Refactoring Example: Avoid Platform Defaults

❌ Risky

byte[] bytes = str.getBytes(); // uses platform default charset

✅ Safer

byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

📌 What's New in Java for Unicode/Encoding?

Java 7+

  • StandardCharsets added for encoding safety.
  • Files.readAllLines() and write() methods with charset support.

Java 11+

  • Better Unicode support in string methods like isBlank(), strip(), lines().

Java 21 (Preview)

  • String Templates support complex interpolated strings with emojis and Unicode.

✅ Best Practices

  • Always specify a charset explicitly when reading/writing strings or files.
  • Use StandardCharsets.UTF_8 — it’s safe and widely compatible.
  • Prefer codePointCount() and codePoints() when working with emojis or non-BMP characters.
  • Avoid getBytes() without arguments.
  • Validate user input encoding early.

🔚 Conclusion and Key Takeaways

  • Java is Unicode-native, using UTF-16 for internal string representation.
  • Understand the distinction between characters, code points, and bytes.
  • Use standard APIs to control encoding and prevent data loss or corruption.
  • Internationalization starts with proper string and encoding handling.

❓ FAQ

1. What encoding does Java use internally for strings?

UTF-16.

2. What's the difference between a character and a code point?

A character can be one or more code units; a code point is the Unicode identifier.

3. Why does emoji.length() return 2?

Because emojis use surrogate pairs in UTF-16.

4. How to safely convert strings to bytes?

Use getBytes(StandardCharsets.UTF_8).

5. What happens if I read UTF-8 as ISO-8859-1?

The result will be garbled text (mojibake).

6. Is new String(bytes) safe?

Only if you specify the correct encoding.

7. How to count actual characters in a string?

Use codePointCount() instead of length().

8. How to detect encoding of a file?

Java doesn’t detect encoding automatically. Use external libraries or metadata.

9. Are emojis supported in Java?

Yes — as Unicode characters using surrogate pairs.

10. Is UTF-8 always the best choice?

For most modern systems, yes. It's compact and widely supported.