markdown · 7787 bytes Raw Blame History

UTF-8 Migration Progress

Goal: Make facsimile fully UTF-8 aware so box-drawing characters (├─│└) and other multi-byte UTF-8 sequences display and edit correctly.

Problem

Fortran's string operations work on bytes, not characters. A UTF-8 character like is 3 bytes but should be treated as 1 character and displayed as 1 column.

Example:

  • "Hello" → 5 bytes, 5 chars, 5 display columns ✓ (works)
  • "├──" → 9 bytes, 3 chars, 3 display columns ✗ (broken before migration)

✅ Completed

1. Core UTF-8 Infrastructure

  • src/utils/utf8_module.f90 - COMPLETE
    • utf8_char_count() - Count UTF-8 characters
    • utf8_char_at() - Extract character at position
    • utf8_char_to_byte_index() - Convert char pos → byte pos
    • utf8_byte_to_char_index() - Convert byte pos → char pos
    • utf8_display_width() - Calculate screen columns needed
    • utf8_char_byte_length() - Get byte length of UTF-8 char
    • ✅ Handles 1-4 byte UTF-8 sequences
    • ✅ Handles wide characters (CJK = 2 columns)
    • ✅ Handles combining characters (0 width)

2. Cursor Semantics

  • src/editor_state_module.f90 - COMPLETE
    • ✅ Documented: cursor%column = UTF-8 character position (NOT byte index)
    • ✅ Added detailed comments explaining the semantics
    • ✅ Example: In "├──", column=2 refers to second (byte 4)

3. Text Buffer UTF-8 Helpers

  • src/buffer/text_buffer_module.f90 - COMPLETE
    • ✅ Added use utf8_module
    • buffer_get_line_char_count() - Get character count of line
    • buffer_char_at() - Get character at char position in line
    • buffer_byte_to_char_col() - Convert byte col → char col
    • buffer_char_to_byte_col() - Convert char col → byte col

4. Basic Cursor Movement

  • src/commands/command_handler_module.f90 - PARTIAL
    • move_cursor_left() - Uses buffer_get_line_char_count()
    • move_cursor_right() - Uses buffer_get_line_char_count()
    • ✅ Both functions now work with character positions

5. Module Imports

  • src/terminal/renderer_module.f90 - PARTIAL
    • ✅ Added use utf8_module
    • ✅ Added buffer_get_line_char_count to imports

6. Renderer Display (HIGH PRIORITY)

  • src/terminal/renderer_module.f90 - COMPLETE
    • render_line() - Uses UTF-8 character positions and display width
    • ✅ Converts character positions to byte positions for slicing
    • ✅ Uses utf8_display_width() for padding calculations
    • ✅ Cursor screen positioning uses display width calculations
    • ✅ Both active and inactive cursors positioned correctly

Impact: UTF-8 characters now display correctly!

📋 TODO (Remaining Work)

HIGH PRIORITY - Renderer Fixes

Files: src/terminal/renderer_module.f90

Specific locations that need fixing:

  • Line 83: len(line_content) → needs UTF-8 char count
  • Line 208: len(line) → needs UTF-8 char count
  • Line 219-220: Padding calculation needs display width
  • Line 245: len(line) → needs UTF-8 char count
  • Line 480, 487, 504, 517: Cursor screen position calculations
  • Line 570-573, 597-600: Viewport scrolling with character positions
  • Line 754: len(line_content) → needs UTF-8 char count
  • Line 959-960: Viewport range calculation
  • Line 1036, 1129, 1136, 1156, 1197: More cursor positioning

MEDIUM PRIORITY - Word Movement

Files: src/commands/command_handler_module.f90

Functions to update:

  • move_cursor_word_left() (line ~1105)
  • move_cursor_word_right() (line ~1176)
  • extend_selection_word_left() (line ~3447)
  • extend_selection_word_right() (line ~3521)
  • delete_word_backward() (line ~3680)
  • delete_word_forward() (line ~690)

Issue: Word boundaries detected by byte operations, breaks on UTF-8

MEDIUM PRIORITY - Editing Operations

Files: src/commands/command_handler_module.f90

Functions to update:

  • insert_char() - Insert at character position
  • delete_char() - Delete character (not byte)
  • delete_selection() - Use character positions
  • insert_newline() - Character position aware
  • All text manipulation that uses line(i:i) slicing

Issue: Inserting/deleting can break UTF-8 sequences

MEDIUM PRIORITY - Selection Operations

Files: src/commands/command_handler_module.f90

Functions to update:

  • extend_selection_left/right/up/down() - Character boundaries
  • select_word_at_cursor() - UTF-8 word boundaries
  • get_selected_text() - Extract text by character positions
  • Selection rendering in renderer_module

Issue: Selection ranges use byte positions, breaks UTF-8

LOWER PRIORITY - Search & Find

Files: src/prompts/*.f90, src/commands/command_handler_module.f90

Functions to update:

  • find_next_occurrence() - Search with UTF-8 awareness
  • select_next_match() - Match by characters
  • Search prompt operations

Issue: Pattern matching needs UTF-8 awareness

LOWER PRIORITY - Other Operations

Various files:

  • Smart home: Character-based indentation detection
  • Go to column: User enters character position
  • Transpose characters: Swap UTF-8 characters
  • Bracket matching: Find brackets in UTF-8 text
  • Line operations (move, duplicate): Should already work

Testing Strategy

Test Files

  • /tmp/test_unicode.txt - Box drawing characters
  • /tmp/ctrl_d_pagination_test.txt - For ctrl-d testing

Test Cases

  1. Display: Open UTF-8 file, verify box chars show correctly
  2. Cursor Movement: Arrow keys move by character (not byte)
  3. Editing: Type at UTF-8 char boundaries
  4. Selection: Select text containing UTF-8 chars
  5. Search: Find UTF-8 characters with ctrl-d
  6. Word Movement: Alt-left/right across UTF-8 words

Success Criteria

  • Box drawing characters (├─│└) display correctly
  • Cursor doesn't get "stuck" in middle of UTF-8 sequence
  • Typing doesn't corrupt UTF-8 sequences
  • Selections work across UTF-8 boundaries
  • File saves/loads preserve UTF-8 content

Notes

Design Decisions

  1. Cursor column = character position (not byte position)

    • More intuitive for users
    • Matches behavior of other editors
  2. Display width vs character count

    • Most chars: 1 char = 1 column
    • CJK chars: 1 char = 2 columns
    • Combining: 1 char = 0 columns
  3. Viewport in character positions

    • Viewport uses character positions
    • Converted to byte positions when rendering

Performance Considerations

  • UTF-8 operations have overhead vs byte operations
  • Caching line char counts could help
  • Most operations stay O(n) in line length

Edge Cases to Handle

  • Cursor at end of line (column = char_count + 1)
  • Empty lines (char_count = 0)
  • Files with invalid UTF-8 (treat as bytes)
  • Mixed width characters (CJK)
  • Combining characters

Current Build Status

✅ Builds successfully ✅ UTF-8 module complete and tested (10/10 tests passing) ✅ Basic cursor movement works (character-based, not byte-based) ✅ Display rendering works (box chars render correctly) ✅ Character insertion works at UTF-8 boundaries ⏳ Remaining: viewport, word movement, editing ops, selections

Test Results

Unit Tests

Created test/test_utf8_integration.f90 with 10 comprehensive tests:

  • ✅ All 10 tests passing
  • Covers: char counting, byte↔char conversion, display width, buffer integration

Manual Testing

Tested with /tmp/test_utf8_simple.txt containing box-drawing chars (├──):

  • ✅ Box characters display correctly in editor
  • ✅ Cursor moves by CHARACTER positions (not bytes)
    • Moving right through (3 bytes) increments column by 1
    • Moving right through (3 bytes) increments column by 1
  • ✅ Character insertion works at correct UTF-8 boundaries

Last updated: 2025-11-04

View source
1 # UTF-8 Migration Progress
2
3 **Goal:** Make facsimile fully UTF-8 aware so box-drawing characters (├─│└) and other multi-byte UTF-8 sequences display and edit correctly.
4
5 ## Problem
6 Fortran's string operations work on bytes, not characters. A UTF-8 character like `├` is 3 bytes but should be treated as 1 character and displayed as 1 column.
7
8 **Example:**
9 - `"Hello"` → 5 bytes, 5 chars, 5 display columns ✓ (works)
10 - `"├──"` → 9 bytes, 3 chars, 3 display columns ✗ (broken before migration)
11
12 ## ✅ Completed
13
14 ### 1. Core UTF-8 Infrastructure
15 - **`src/utils/utf8_module.f90`** - COMPLETE
16 -`utf8_char_count()` - Count UTF-8 characters
17 -`utf8_char_at()` - Extract character at position
18 -`utf8_char_to_byte_index()` - Convert char pos → byte pos
19 -`utf8_byte_to_char_index()` - Convert byte pos → char pos
20 -`utf8_display_width()` - Calculate screen columns needed
21 -`utf8_char_byte_length()` - Get byte length of UTF-8 char
22 - ✅ Handles 1-4 byte UTF-8 sequences
23 - ✅ Handles wide characters (CJK = 2 columns)
24 - ✅ Handles combining characters (0 width)
25
26 ### 2. Cursor Semantics
27 - **`src/editor_state_module.f90`** - COMPLETE
28 - ✅ Documented: `cursor%column` = UTF-8 character position (NOT byte index)
29 - ✅ Added detailed comments explaining the semantics
30 - ✅ Example: In `"├──"`, column=2 refers to second `─` (byte 4)
31
32 ### 3. Text Buffer UTF-8 Helpers
33 - **`src/buffer/text_buffer_module.f90`** - COMPLETE
34 - ✅ Added `use utf8_module`
35 -`buffer_get_line_char_count()` - Get character count of line
36 -`buffer_char_at()` - Get character at char position in line
37 -`buffer_byte_to_char_col()` - Convert byte col → char col
38 -`buffer_char_to_byte_col()` - Convert char col → byte col
39
40 ### 4. Basic Cursor Movement
41 - **`src/commands/command_handler_module.f90`** - PARTIAL
42 -`move_cursor_left()` - Uses `buffer_get_line_char_count()`
43 -`move_cursor_right()` - Uses `buffer_get_line_char_count()`
44 - ✅ Both functions now work with character positions
45
46 ### 5. Module Imports
47 - **`src/terminal/renderer_module.f90`** - PARTIAL
48 - ✅ Added `use utf8_module`
49 - ✅ Added `buffer_get_line_char_count` to imports
50
51 ### 6. Renderer Display (HIGH PRIORITY)
52 - **`src/terminal/renderer_module.f90`** - COMPLETE
53 -`render_line()` - Uses UTF-8 character positions and display width
54 - ✅ Converts character positions to byte positions for slicing
55 - ✅ Uses `utf8_display_width()` for padding calculations
56 - ✅ Cursor screen positioning uses display width calculations
57 - ✅ Both active and inactive cursors positioned correctly
58
59 **Impact:** UTF-8 characters now display correctly!
60
61 ## 📋 TODO (Remaining Work)
62
63 ### HIGH PRIORITY - Renderer Fixes
64 Files: `src/terminal/renderer_module.f90`
65
66 **Specific locations that need fixing:**
67 - Line 83: `len(line_content)` → needs UTF-8 char count
68 - Line 208: `len(line)` → needs UTF-8 char count
69 - Line 219-220: Padding calculation needs display width
70 - Line 245: `len(line)` → needs UTF-8 char count
71 - Line 480, 487, 504, 517: Cursor screen position calculations
72 - Line 570-573, 597-600: Viewport scrolling with character positions
73 - Line 754: `len(line_content)` → needs UTF-8 char count
74 - Line 959-960: Viewport range calculation
75 - Line 1036, 1129, 1136, 1156, 1197: More cursor positioning
76
77 ### MEDIUM PRIORITY - Word Movement
78 Files: `src/commands/command_handler_module.f90`
79
80 Functions to update:
81 - `move_cursor_word_left()` (line ~1105)
82 - `move_cursor_word_right()` (line ~1176)
83 - `extend_selection_word_left()` (line ~3447)
84 - `extend_selection_word_right()` (line ~3521)
85 - `delete_word_backward()` (line ~3680)
86 - `delete_word_forward()` (line ~690)
87
88 **Issue:** Word boundaries detected by byte operations, breaks on UTF-8
89
90 ### MEDIUM PRIORITY - Editing Operations
91 Files: `src/commands/command_handler_module.f90`
92
93 Functions to update:
94 - `insert_char()` - Insert at character position
95 - `delete_char()` - Delete character (not byte)
96 - `delete_selection()` - Use character positions
97 - `insert_newline()` - Character position aware
98 - All text manipulation that uses `line(i:i)` slicing
99
100 **Issue:** Inserting/deleting can break UTF-8 sequences
101
102 ### MEDIUM PRIORITY - Selection Operations
103 Files: `src/commands/command_handler_module.f90`
104
105 Functions to update:
106 - `extend_selection_left/right/up/down()` - Character boundaries
107 - `select_word_at_cursor()` - UTF-8 word boundaries
108 - `get_selected_text()` - Extract text by character positions
109 - Selection rendering in renderer_module
110
111 **Issue:** Selection ranges use byte positions, breaks UTF-8
112
113 ### LOWER PRIORITY - Search & Find
114 Files: `src/prompts/*.f90`, `src/commands/command_handler_module.f90`
115
116 Functions to update:
117 - `find_next_occurrence()` - Search with UTF-8 awareness
118 - `select_next_match()` - Match by characters
119 - Search prompt operations
120
121 **Issue:** Pattern matching needs UTF-8 awareness
122
123 ### LOWER PRIORITY - Other Operations
124 Various files:
125
126 - Smart home: Character-based indentation detection
127 - Go to column: User enters character position
128 - Transpose characters: Swap UTF-8 characters
129 - Bracket matching: Find brackets in UTF-8 text
130 - Line operations (move, duplicate): Should already work
131
132 ## Testing Strategy
133
134 ### Test Files
135 - `/tmp/test_unicode.txt` - Box drawing characters
136 - `/tmp/ctrl_d_pagination_test.txt` - For ctrl-d testing
137
138 ### Test Cases
139 1. **Display:** Open UTF-8 file, verify box chars show correctly
140 2. **Cursor Movement:** Arrow keys move by character (not byte)
141 3. **Editing:** Type at UTF-8 char boundaries
142 4. **Selection:** Select text containing UTF-8 chars
143 5. **Search:** Find UTF-8 characters with ctrl-d
144 6. **Word Movement:** Alt-left/right across UTF-8 words
145
146 ### Success Criteria
147 - Box drawing characters (├─│└) display correctly
148 - Cursor doesn't get "stuck" in middle of UTF-8 sequence
149 - Typing doesn't corrupt UTF-8 sequences
150 - Selections work across UTF-8 boundaries
151 - File saves/loads preserve UTF-8 content
152
153 ## Notes
154
155 ### Design Decisions
156 1. **Cursor column = character position** (not byte position)
157 - More intuitive for users
158 - Matches behavior of other editors
159
160 2. **Display width vs character count**
161 - Most chars: 1 char = 1 column
162 - CJK chars: 1 char = 2 columns
163 - Combining: 1 char = 0 columns
164
165 3. **Viewport in character positions**
166 - Viewport uses character positions
167 - Converted to byte positions when rendering
168
169 ### Performance Considerations
170 - UTF-8 operations have overhead vs byte operations
171 - Caching line char counts could help
172 - Most operations stay O(n) in line length
173
174 ### Edge Cases to Handle
175 - Cursor at end of line (column = char_count + 1)
176 - Empty lines (char_count = 0)
177 - Files with invalid UTF-8 (treat as bytes)
178 - Mixed width characters (CJK)
179 - Combining characters
180
181 ## Current Build Status
182 ✅ Builds successfully
183 ✅ UTF-8 module complete and tested (10/10 tests passing)
184 ✅ Basic cursor movement works (character-based, not byte-based)
185 ✅ Display rendering works (box chars render correctly)
186 ✅ Character insertion works at UTF-8 boundaries
187 ⏳ Remaining: viewport, word movement, editing ops, selections
188
189 ## Test Results
190
191 ### Unit Tests
192 Created `test/test_utf8_integration.f90` with 10 comprehensive tests:
193 - ✅ All 10 tests passing
194 - Covers: char counting, byte↔char conversion, display width, buffer integration
195
196 ### Manual Testing
197 Tested with `/tmp/test_utf8_simple.txt` containing box-drawing chars (├──):
198 - ✅ Box characters display correctly in editor
199 - ✅ Cursor moves by CHARACTER positions (not bytes)
200 - Moving right through `├` (3 bytes) increments column by 1
201 - Moving right through `─` (3 bytes) increments column by 1
202 - ✅ Character insertion works at correct UTF-8 boundaries
203
204 Last updated: 2025-11-04