UTF-8 Migration Progress
Goal: Make facsimile fully UTF-8 aware so box-drawing characters (├─│└) and other multi-byte UTF-8 sequences display and edit correctly.
Problem
Fortran's string operations work on bytes, not characters. A UTF-8 character like ├ is 3 bytes but should be treated as 1 character and displayed as 1 column.
Example:
"Hello"→ 5 bytes, 5 chars, 5 display columns ✓ (works)"├──"→ 9 bytes, 3 chars, 3 display columns ✗ (broken before migration)
✅ Completed
1. Core UTF-8 Infrastructure
src/utils/utf8_module.f90- COMPLETE- ✅
utf8_char_count()- Count UTF-8 characters - ✅
utf8_char_at()- Extract character at position - ✅
utf8_char_to_byte_index()- Convert char pos → byte pos - ✅
utf8_byte_to_char_index()- Convert byte pos → char pos - ✅
utf8_display_width()- Calculate screen columns needed - ✅
utf8_char_byte_length()- Get byte length of UTF-8 char - ✅ Handles 1-4 byte UTF-8 sequences
- ✅ Handles wide characters (CJK = 2 columns)
- ✅ Handles combining characters (0 width)
- ✅
2. Cursor Semantics
src/editor_state_module.f90- COMPLETE- ✅ Documented:
cursor%column= UTF-8 character position (NOT byte index) - ✅ Added detailed comments explaining the semantics
- ✅ Example: In
"├──", column=2 refers to second─(byte 4)
- ✅ Documented:
3. Text Buffer UTF-8 Helpers
src/buffer/text_buffer_module.f90- COMPLETE- ✅ Added
use utf8_module - ✅
buffer_get_line_char_count()- Get character count of line - ✅
buffer_char_at()- Get character at char position in line - ✅
buffer_byte_to_char_col()- Convert byte col → char col - ✅
buffer_char_to_byte_col()- Convert char col → byte col
- ✅ Added
4. Basic Cursor Movement
src/commands/command_handler_module.f90- PARTIAL- ✅
move_cursor_left()- Usesbuffer_get_line_char_count() - ✅
move_cursor_right()- Usesbuffer_get_line_char_count() - ✅ Both functions now work with character positions
- ✅
5. Module Imports
src/terminal/renderer_module.f90- PARTIAL- ✅ Added
use utf8_module - ✅ Added
buffer_get_line_char_countto imports
- ✅ Added
6. Renderer Display (HIGH PRIORITY)
src/terminal/renderer_module.f90- COMPLETE- ✅
render_line()- Uses UTF-8 character positions and display width - ✅ Converts character positions to byte positions for slicing
- ✅ Uses
utf8_display_width()for padding calculations - ✅ Cursor screen positioning uses display width calculations
- ✅ Both active and inactive cursors positioned correctly
- ✅
Impact: UTF-8 characters now display correctly!
📋 TODO (Remaining Work)
HIGH PRIORITY - Renderer Fixes
Files: src/terminal/renderer_module.f90
Specific locations that need fixing:
- Line 83:
len(line_content)→ needs UTF-8 char count - Line 208:
len(line)→ needs UTF-8 char count - Line 219-220: Padding calculation needs display width
- Line 245:
len(line)→ needs UTF-8 char count - Line 480, 487, 504, 517: Cursor screen position calculations
- Line 570-573, 597-600: Viewport scrolling with character positions
- Line 754:
len(line_content)→ needs UTF-8 char count - Line 959-960: Viewport range calculation
- Line 1036, 1129, 1136, 1156, 1197: More cursor positioning
MEDIUM PRIORITY - Word Movement
Files: src/commands/command_handler_module.f90
Functions to update:
move_cursor_word_left()(line ~1105)move_cursor_word_right()(line ~1176)extend_selection_word_left()(line ~3447)extend_selection_word_right()(line ~3521)delete_word_backward()(line ~3680)delete_word_forward()(line ~690)
Issue: Word boundaries detected by byte operations, breaks on UTF-8
MEDIUM PRIORITY - Editing Operations
Files: src/commands/command_handler_module.f90
Functions to update:
insert_char()- Insert at character positiondelete_char()- Delete character (not byte)delete_selection()- Use character positionsinsert_newline()- Character position aware- All text manipulation that uses
line(i:i)slicing
Issue: Inserting/deleting can break UTF-8 sequences
MEDIUM PRIORITY - Selection Operations
Files: src/commands/command_handler_module.f90
Functions to update:
extend_selection_left/right/up/down()- Character boundariesselect_word_at_cursor()- UTF-8 word boundariesget_selected_text()- Extract text by character positions- Selection rendering in renderer_module
Issue: Selection ranges use byte positions, breaks UTF-8
LOWER PRIORITY - Search & Find
Files: src/prompts/*.f90, src/commands/command_handler_module.f90
Functions to update:
find_next_occurrence()- Search with UTF-8 awarenessselect_next_match()- Match by characters- Search prompt operations
Issue: Pattern matching needs UTF-8 awareness
LOWER PRIORITY - Other Operations
Various files:
- Smart home: Character-based indentation detection
- Go to column: User enters character position
- Transpose characters: Swap UTF-8 characters
- Bracket matching: Find brackets in UTF-8 text
- Line operations (move, duplicate): Should already work
Testing Strategy
Test Files
/tmp/test_unicode.txt- Box drawing characters/tmp/ctrl_d_pagination_test.txt- For ctrl-d testing
Test Cases
- Display: Open UTF-8 file, verify box chars show correctly
- Cursor Movement: Arrow keys move by character (not byte)
- Editing: Type at UTF-8 char boundaries
- Selection: Select text containing UTF-8 chars
- Search: Find UTF-8 characters with ctrl-d
- Word Movement: Alt-left/right across UTF-8 words
Success Criteria
- Box drawing characters (├─│└) display correctly
- Cursor doesn't get "stuck" in middle of UTF-8 sequence
- Typing doesn't corrupt UTF-8 sequences
- Selections work across UTF-8 boundaries
- File saves/loads preserve UTF-8 content
Notes
Design Decisions
-
Cursor column = character position (not byte position)
- More intuitive for users
- Matches behavior of other editors
-
Display width vs character count
- Most chars: 1 char = 1 column
- CJK chars: 1 char = 2 columns
- Combining: 1 char = 0 columns
-
Viewport in character positions
- Viewport uses character positions
- Converted to byte positions when rendering
Performance Considerations
- UTF-8 operations have overhead vs byte operations
- Caching line char counts could help
- Most operations stay O(n) in line length
Edge Cases to Handle
- Cursor at end of line (column = char_count + 1)
- Empty lines (char_count = 0)
- Files with invalid UTF-8 (treat as bytes)
- Mixed width characters (CJK)
- Combining characters
Current Build Status
✅ Builds successfully ✅ UTF-8 module complete and tested (10/10 tests passing) ✅ Basic cursor movement works (character-based, not byte-based) ✅ Display rendering works (box chars render correctly) ✅ Character insertion works at UTF-8 boundaries ⏳ Remaining: viewport, word movement, editing ops, selections
Test Results
Unit Tests
Created test/test_utf8_integration.f90 with 10 comprehensive tests:
- ✅ All 10 tests passing
- Covers: char counting, byte↔char conversion, display width, buffer integration
Manual Testing
Tested with /tmp/test_utf8_simple.txt containing box-drawing chars (├──):
- ✅ Box characters display correctly in editor
- ✅ Cursor moves by CHARACTER positions (not bytes)
- Moving right through
├(3 bytes) increments column by 1 - Moving right through
─(3 bytes) increments column by 1
- Moving right through
- ✅ Character insertion works at correct UTF-8 boundaries
Last updated: 2025-11-04
View source
| 1 | # UTF-8 Migration Progress |
| 2 | |
| 3 | **Goal:** Make facsimile fully UTF-8 aware so box-drawing characters (├─│└) and other multi-byte UTF-8 sequences display and edit correctly. |
| 4 | |
| 5 | ## Problem |
| 6 | Fortran's string operations work on bytes, not characters. A UTF-8 character like `├` is 3 bytes but should be treated as 1 character and displayed as 1 column. |
| 7 | |
| 8 | **Example:** |
| 9 | - `"Hello"` → 5 bytes, 5 chars, 5 display columns ✓ (works) |
| 10 | - `"├──"` → 9 bytes, 3 chars, 3 display columns ✗ (broken before migration) |
| 11 | |
| 12 | ## ✅ Completed |
| 13 | |
| 14 | ### 1. Core UTF-8 Infrastructure |
| 15 | - **`src/utils/utf8_module.f90`** - COMPLETE |
| 16 | - ✅ `utf8_char_count()` - Count UTF-8 characters |
| 17 | - ✅ `utf8_char_at()` - Extract character at position |
| 18 | - ✅ `utf8_char_to_byte_index()` - Convert char pos → byte pos |
| 19 | - ✅ `utf8_byte_to_char_index()` - Convert byte pos → char pos |
| 20 | - ✅ `utf8_display_width()` - Calculate screen columns needed |
| 21 | - ✅ `utf8_char_byte_length()` - Get byte length of UTF-8 char |
| 22 | - ✅ Handles 1-4 byte UTF-8 sequences |
| 23 | - ✅ Handles wide characters (CJK = 2 columns) |
| 24 | - ✅ Handles combining characters (0 width) |
| 25 | |
| 26 | ### 2. Cursor Semantics |
| 27 | - **`src/editor_state_module.f90`** - COMPLETE |
| 28 | - ✅ Documented: `cursor%column` = UTF-8 character position (NOT byte index) |
| 29 | - ✅ Added detailed comments explaining the semantics |
| 30 | - ✅ Example: In `"├──"`, column=2 refers to second `─` (byte 4) |
| 31 | |
| 32 | ### 3. Text Buffer UTF-8 Helpers |
| 33 | - **`src/buffer/text_buffer_module.f90`** - COMPLETE |
| 34 | - ✅ Added `use utf8_module` |
| 35 | - ✅ `buffer_get_line_char_count()` - Get character count of line |
| 36 | - ✅ `buffer_char_at()` - Get character at char position in line |
| 37 | - ✅ `buffer_byte_to_char_col()` - Convert byte col → char col |
| 38 | - ✅ `buffer_char_to_byte_col()` - Convert char col → byte col |
| 39 | |
| 40 | ### 4. Basic Cursor Movement |
| 41 | - **`src/commands/command_handler_module.f90`** - PARTIAL |
| 42 | - ✅ `move_cursor_left()` - Uses `buffer_get_line_char_count()` |
| 43 | - ✅ `move_cursor_right()` - Uses `buffer_get_line_char_count()` |
| 44 | - ✅ Both functions now work with character positions |
| 45 | |
| 46 | ### 5. Module Imports |
| 47 | - **`src/terminal/renderer_module.f90`** - PARTIAL |
| 48 | - ✅ Added `use utf8_module` |
| 49 | - ✅ Added `buffer_get_line_char_count` to imports |
| 50 | |
| 51 | ### 6. Renderer Display (HIGH PRIORITY) |
| 52 | - **`src/terminal/renderer_module.f90`** - COMPLETE |
| 53 | - ✅ `render_line()` - Uses UTF-8 character positions and display width |
| 54 | - ✅ Converts character positions to byte positions for slicing |
| 55 | - ✅ Uses `utf8_display_width()` for padding calculations |
| 56 | - ✅ Cursor screen positioning uses display width calculations |
| 57 | - ✅ Both active and inactive cursors positioned correctly |
| 58 | |
| 59 | **Impact:** UTF-8 characters now display correctly! |
| 60 | |
| 61 | ## 📋 TODO (Remaining Work) |
| 62 | |
| 63 | ### HIGH PRIORITY - Renderer Fixes |
| 64 | Files: `src/terminal/renderer_module.f90` |
| 65 | |
| 66 | **Specific locations that need fixing:** |
| 67 | - Line 83: `len(line_content)` → needs UTF-8 char count |
| 68 | - Line 208: `len(line)` → needs UTF-8 char count |
| 69 | - Line 219-220: Padding calculation needs display width |
| 70 | - Line 245: `len(line)` → needs UTF-8 char count |
| 71 | - Line 480, 487, 504, 517: Cursor screen position calculations |
| 72 | - Line 570-573, 597-600: Viewport scrolling with character positions |
| 73 | - Line 754: `len(line_content)` → needs UTF-8 char count |
| 74 | - Line 959-960: Viewport range calculation |
| 75 | - Line 1036, 1129, 1136, 1156, 1197: More cursor positioning |
| 76 | |
| 77 | ### MEDIUM PRIORITY - Word Movement |
| 78 | Files: `src/commands/command_handler_module.f90` |
| 79 | |
| 80 | Functions to update: |
| 81 | - `move_cursor_word_left()` (line ~1105) |
| 82 | - `move_cursor_word_right()` (line ~1176) |
| 83 | - `extend_selection_word_left()` (line ~3447) |
| 84 | - `extend_selection_word_right()` (line ~3521) |
| 85 | - `delete_word_backward()` (line ~3680) |
| 86 | - `delete_word_forward()` (line ~690) |
| 87 | |
| 88 | **Issue:** Word boundaries detected by byte operations, breaks on UTF-8 |
| 89 | |
| 90 | ### MEDIUM PRIORITY - Editing Operations |
| 91 | Files: `src/commands/command_handler_module.f90` |
| 92 | |
| 93 | Functions to update: |
| 94 | - `insert_char()` - Insert at character position |
| 95 | - `delete_char()` - Delete character (not byte) |
| 96 | - `delete_selection()` - Use character positions |
| 97 | - `insert_newline()` - Character position aware |
| 98 | - All text manipulation that uses `line(i:i)` slicing |
| 99 | |
| 100 | **Issue:** Inserting/deleting can break UTF-8 sequences |
| 101 | |
| 102 | ### MEDIUM PRIORITY - Selection Operations |
| 103 | Files: `src/commands/command_handler_module.f90` |
| 104 | |
| 105 | Functions to update: |
| 106 | - `extend_selection_left/right/up/down()` - Character boundaries |
| 107 | - `select_word_at_cursor()` - UTF-8 word boundaries |
| 108 | - `get_selected_text()` - Extract text by character positions |
| 109 | - Selection rendering in renderer_module |
| 110 | |
| 111 | **Issue:** Selection ranges use byte positions, breaks UTF-8 |
| 112 | |
| 113 | ### LOWER PRIORITY - Search & Find |
| 114 | Files: `src/prompts/*.f90`, `src/commands/command_handler_module.f90` |
| 115 | |
| 116 | Functions to update: |
| 117 | - `find_next_occurrence()` - Search with UTF-8 awareness |
| 118 | - `select_next_match()` - Match by characters |
| 119 | - Search prompt operations |
| 120 | |
| 121 | **Issue:** Pattern matching needs UTF-8 awareness |
| 122 | |
| 123 | ### LOWER PRIORITY - Other Operations |
| 124 | Various files: |
| 125 | |
| 126 | - Smart home: Character-based indentation detection |
| 127 | - Go to column: User enters character position |
| 128 | - Transpose characters: Swap UTF-8 characters |
| 129 | - Bracket matching: Find brackets in UTF-8 text |
| 130 | - Line operations (move, duplicate): Should already work |
| 131 | |
| 132 | ## Testing Strategy |
| 133 | |
| 134 | ### Test Files |
| 135 | - `/tmp/test_unicode.txt` - Box drawing characters |
| 136 | - `/tmp/ctrl_d_pagination_test.txt` - For ctrl-d testing |
| 137 | |
| 138 | ### Test Cases |
| 139 | 1. **Display:** Open UTF-8 file, verify box chars show correctly |
| 140 | 2. **Cursor Movement:** Arrow keys move by character (not byte) |
| 141 | 3. **Editing:** Type at UTF-8 char boundaries |
| 142 | 4. **Selection:** Select text containing UTF-8 chars |
| 143 | 5. **Search:** Find UTF-8 characters with ctrl-d |
| 144 | 6. **Word Movement:** Alt-left/right across UTF-8 words |
| 145 | |
| 146 | ### Success Criteria |
| 147 | - Box drawing characters (├─│└) display correctly |
| 148 | - Cursor doesn't get "stuck" in middle of UTF-8 sequence |
| 149 | - Typing doesn't corrupt UTF-8 sequences |
| 150 | - Selections work across UTF-8 boundaries |
| 151 | - File saves/loads preserve UTF-8 content |
| 152 | |
| 153 | ## Notes |
| 154 | |
| 155 | ### Design Decisions |
| 156 | 1. **Cursor column = character position** (not byte position) |
| 157 | - More intuitive for users |
| 158 | - Matches behavior of other editors |
| 159 | |
| 160 | 2. **Display width vs character count** |
| 161 | - Most chars: 1 char = 1 column |
| 162 | - CJK chars: 1 char = 2 columns |
| 163 | - Combining: 1 char = 0 columns |
| 164 | |
| 165 | 3. **Viewport in character positions** |
| 166 | - Viewport uses character positions |
| 167 | - Converted to byte positions when rendering |
| 168 | |
| 169 | ### Performance Considerations |
| 170 | - UTF-8 operations have overhead vs byte operations |
| 171 | - Caching line char counts could help |
| 172 | - Most operations stay O(n) in line length |
| 173 | |
| 174 | ### Edge Cases to Handle |
| 175 | - Cursor at end of line (column = char_count + 1) |
| 176 | - Empty lines (char_count = 0) |
| 177 | - Files with invalid UTF-8 (treat as bytes) |
| 178 | - Mixed width characters (CJK) |
| 179 | - Combining characters |
| 180 | |
| 181 | ## Current Build Status |
| 182 | ✅ Builds successfully |
| 183 | ✅ UTF-8 module complete and tested (10/10 tests passing) |
| 184 | ✅ Basic cursor movement works (character-based, not byte-based) |
| 185 | ✅ Display rendering works (box chars render correctly) |
| 186 | ✅ Character insertion works at UTF-8 boundaries |
| 187 | ⏳ Remaining: viewport, word movement, editing ops, selections |
| 188 | |
| 189 | ## Test Results |
| 190 | |
| 191 | ### Unit Tests |
| 192 | Created `test/test_utf8_integration.f90` with 10 comprehensive tests: |
| 193 | - ✅ All 10 tests passing |
| 194 | - Covers: char counting, byte↔char conversion, display width, buffer integration |
| 195 | |
| 196 | ### Manual Testing |
| 197 | Tested with `/tmp/test_utf8_simple.txt` containing box-drawing chars (├──): |
| 198 | - ✅ Box characters display correctly in editor |
| 199 | - ✅ Cursor moves by CHARACTER positions (not bytes) |
| 200 | - Moving right through `├` (3 bytes) increments column by 1 |
| 201 | - Moving right through `─` (3 bytes) increments column by 1 |
| 202 | - ✅ Character insertion works at correct UTF-8 boundaries |
| 203 | |
| 204 | Last updated: 2025-11-04 |