Architecture Guide
What this page covers: UBI internals — on-flash layout, in-RAM data structures, initialization, wear-leveling, dual-bank metadata redundancy, recovery, and failure handling.
Prerequisites: Read the Overview first for the mental model (PEB, LEB, EC, VID, EBA).
What you will learn: How UBI maps logical blocks to physical blocks, how it recovers from crashes, and how wear is distributed across the flash.
30-Second Summary
UBI divides a flash partition into Physical Erase Blocks (PEBs). The first N PEBs (configurable, default 2) store mirrored device and volume metadata. All remaining PEBs hold user data. Each data PEB carries an Erase Counter (EC) header and a Volume Identifier (VID) header followed by the payload. At init, UBI scans every PEB and builds an in-RAM red-black tree cache of free, dirty, and bad blocks plus per-volume LEB-to-PEB mappings. Writes always pick the free PEB with the lowest erase count (wear-leveling). Crash recovery relies on monotonically increasing sequence numbers in VID headers — the higher sqnum always wins.
Core Invariants
These rules hold at all times after a successful ubi_device_init():
Invariant |
Description |
|---|---|
One LEB, one PEB |
Each mapped LEB points to exactly one active PEB. No two LEBs share a PEB. |
Higher sqnum wins |
During init, if two PEBs claim the same (vol_id, lnum), the one with the higher sequence number is kept; the other becomes dirty. |
Erase before reuse |
A dirty PEB must be erased before it can return to the free pool. No in-place overwrites. |
Bad PEBs are terminal |
Once a PEB is classified as bad, it never returns to the free or dirty pool (unless torture recovery succeeds). |
Reserved PEBs are mirrors |
The first N reserved PEBs hold identical copies of device + volume metadata. They are never used for data. |
Free pool is EC-ordered |
|
Mutex serialization |
All public API calls acquire a per-device mutex. UBI is thread-safe but not ISR-safe. |
Secure extension: For authenticated encryption of all on-flash structures, see the Secure Architecture Guide.
Flash Storage Primer
Raw flash memory (NAND or NOR) differs from block devices like SD cards or eMMC in several important ways:
Erase before write — a flash cell must be erased before it can be written. Erasing sets all bytes to the hardware-defined erased value (typically
0xFFfor NOR flash, but this may differ on other technologies).Erase granularity — erasure operates on large blocks (erase blocks), typically 4 KB to 256 KB.
Write granularity — writes operate on smaller units (write blocks), typically 1 to 16 bytes.
Limited endurance — each erase block supports a finite number of erase cycles (typically 10,000 to 100,000) before it becomes unreliable.
Bad blocks — blocks can fail at any point during the device lifetime.
Without wear-leveling, repeatedly writing to the same logical location would exhaust a small set of physical blocks while the rest remain unused. UBI solves this by dynamically remapping logical blocks to physical blocks, always choosing the least-worn block for new writes.
Key terminology:
Term |
Meaning |
|---|---|
PEB |
Physical Erase Block — a hardware erase unit on the flash chip |
LEB |
Logical Erase Block — a virtual block exposed to the application |
EC |
Erase Counter — tracks how many times a PEB has been erased |
VID |
Volume Identifier — metadata linking a PEB to a volume and LEB |
EBA |
Erase Block Association — the mapping table from LEBs to PEBs |
Architecture Overview
+-----------------------------------------------------+
| Application |
+-----------------------------------------------------+
| ^
| ubi_leb_write() | ubi_leb_read()
| ubi_volume_create() | ubi_volume_get_info()
| ubi_device_init() | ubi_device_get_info()
v |
+-----------------------------------------------------+
| UBI Layer |
| |
| +---------------+ +---------------+ +---------+ |
| | Volume Mgmt | | LEB I/O | | Wear- | |
| | create/remove | | read/write | | Level | |
| | resize/info | | map/unmap | | Engine | |
| +---------------+ +---------------+ +---------+ |
| |
| +----------------------------------------------+ |
| | PEB Management (RBT Cache) | |
| | free_pebs | dirty_pebs | bad_pebs | vols | |
| +----------------------------------------------+ |
+-----------------------------------------------------+
| ^
| flash_area_write() | flash_area_read()
| flash_area_erase() |
v |
+-----------------------------------------------------+
| Zephyr Flash Area API (Flash Map) |
+-----------------------------------------------------+
| ^
v |
+-----------------------------------------------------+
| Flash Hardware (NOR / NAND) |
+-----------------------------------------------------+
Source files:
File |
Role |
|---|---|
|
Public API — all structures and function declarations |
|
Device initialization — format, scan, mount |
|
Device runtime — get_info, erase_peb, deinit, test API |
|
Volume management — create, resize, remove, get_info |
|
LEB operations — read, write (copy-on-write), map, unmap (idempotent), is_mapped, get_size |
|
Red-black tree comparator and search helpers |
|
Shared internal types ( |
|
RBT and linked-list item types |
|
On-flash header structures and constants |
|
Metadata I/O — device and volume header read/write |
|
Data I/O — EC/VID header and LEB data read/write, flash write/erase fault injection |
|
Reserved PEB state types and API declarations |
|
Reserved PEB scanning, recovery, overwrite, and commit |
|
Single-handle-per-partition registry API |
|
Static bitfield registry preventing double-init of the same partition |
|
Memory abstraction layer API — device, volume, leaf, scratch allocators |
|
Static (k_mem_slab) and heap (k_malloc) backend implementations |
On-Flash Layout
UBI reserves the first N PEBs for device and volume metadata, stored in a dual-bank configuration for crash resilience. N is configurable via CONFIG_UBI_DEV_HDR_NR_OF_RES_PEBS (default 2, range 2–4). The remaining PEBs (N through total-1) are data blocks available for volume use.
Flash Partition (default: N=2 reserved PEBs)
+====================+====================+=====+====================+
| PEB 0 (Reserved) | PEB 1 (Reserved) | ... | PEB total-1 |
| Device Header Bank | Device Header Bank | | Data Block |
+====================+====================+=====+====================+
Reserved PEB Layout (reserved PEBs are mirrors):
Offset 0x000 +----------------------+
| Device Header (32 B) | magic, version, revision, vol_count, CRC
+----------------------+
Offset 0x020 | Volume 0 Hdr (48 B) | magic, vol_id, name, type, leb_count, CRC
+----------------------+
Offset 0x050 | Volume 1 Hdr (48 B) |
+----------------------+
| ... | (up to CONFIG_UBI_MAX_NR_OF_VOLUMES)
+----------------------+
Data PEB Layout (PEB N through PEB total-1):
Offset 0x000 +----------------------+
| EC Header (16 B) | magic, version, erase_counter, CRC
+----------------------+
Offset 0x010 | VID Header (32 B) | magic, vol_id, leb_num, sqnum, data_size, CRC
+----------------------+
Offset 0x030 | |
| User Data | up to (erase_block_size - 48) bytes
| |
+----------------------+
When a data PEB is free (not assigned to any volume), its VID header area is erased (filled with the hardware-reported erased byte value). The EC header is always present on valid PEBs.
Header Structures
All headers are aligned to 16 bytes and protected by CRC-32/IEEE (crc32_ieee() from Zephyr’s <zephyr/sys/crc.h>). The CRC covers all fields except the hdr_crc field itself.
Erase Counter (EC) Header — 16 bytes
Present on every data PEB. Tracks how many times this block has been erased.
Offset Size Field
------ ---- -----
0x00 4 magic (0x55424923)
0x04 1 version (1)
0x05 3 padding
0x08 4 ec erase counter value
0x0C 4 hdr_crc CRC-32 of bytes 0x00..0x0B
Volume Identifier (VID) Header — 32 bytes
Present on data PEBs that are mapped to a volume. Links a PEB to a specific volume and LEB.
Offset Size Field
------ ---- -----
0x00 4 magic (0x55424921)
0x04 1 version (1)
0x05 3 padding
0x08 4 lnum logical erase block number within the volume
0x0C 4 vol_id volume identifier
0x10 8 sqnum global sequence number (monotonically increasing)
0x18 4 data_size size of user data in bytes
0x1C 4 hdr_crc CRC-32 of bytes 0x00..0x1B
The sqnum field is critical for crash recovery. During the PEB scan at init, if two PEBs claim the same (vol_id, lnum) pair, the one with the higher sqnum wins.
Device Header — 32 bytes
Stored on all reserved PEBs (default: PEB 0 and PEB 1). Describes the overall UBI device.
Offset Size Field
------ ---- -----
0x00 4 magic (0x55424925)
0x04 1 version (1)
0x05 3 padding
0x08 4 offset offset of the first volume header
0x0C 4 size device size
0x10 4 revision header revision counter (incremented on each metadata update)
0x14 4 vol_count number of volumes
0x18 4 vol_id_watermark monotonic volume ID counter (never reused)
0x1C 4 hdr_crc CRC-32 of bytes 0x00..0x1B
Volume Header — 48 bytes
One per volume, stored sequentially after the device header on the reserved PEBs.
Offset Size Field
------ ---- -----
0x00 4 magic (0x55424926)
0x04 1 version (1)
0x05 1 vol_type 0 = static, 1 = dynamic
0x06 2 padding
0x08 4 vol_id unique volume identifier
0x0C 4 leb_count number of LEBs allocated to this volume
0x10 12 padding
0x1C 16 name null-terminated volume name (max 16 bytes including '\0')
0x2C 4 hdr_crc CRC-32 of bytes 0x00..0x2B
In-RAM Data Structures
When ubi_device_init() runs, it scans the flash and builds an in-RAM cache of PEB states. This cache is the heart of UBI — all runtime decisions (which PEB to write to, which blocks are dirty, etc.) are made from these structures without re-reading flash.
Overview
struct ubi_device (128 B)
|
|-- mutex Zephyr mutex for thread safety
|-- mtd Flash partition config (partition_id, block sizes)
|
|-- free_pebs (Red-Black Tree, keyed by erase counter)
| |
| | Holds PEBs that are erased and available for new writes.
| | The minimum node (lowest EC) is selected for writes (wear-leveling).
| |
| | ec:3 Nodes are struct ubi_rbt_item {
| | / \ .key = erase_counter,
| | ec:1 ec:7 .value.pnum = PEB index
| | / \ }
| | ec:5 ec:12
| |
| `-- Each node points to a physical PEB on flash:
| ec:1 --> PEB 5 [EC hdr: ec=1 | VID: 0xFF (empty) | ...]
| ec:3 --> PEB 8 [EC hdr: ec=3 | VID: 0xFF (empty) | ...]
| ec:5 --> PEB 14 [EC hdr: ec=5 | VID: 0xFF (empty) | ...]
|
|-- dirty_pebs (Red-Black Tree, keyed by erase counter)
| |
| | Holds PEBs that contain stale data and need erasure before reuse.
| | Populated when a LEB is overwritten or unmapped.
| |
| | ec:4
| | / \
| | ec:2 ec:9
| |
| `-- Each node points to a PEB with outdated data:
| ec:2 --> PEB 3 [EC hdr: ec=2 | VID: old data | ...]
| ec:4 --> PEB 11 [EC hdr: ec=4 | VID: old data | ...]
|
|-- bad_pebs (Singly-Linked List)
| |
| | Holds PEBs with I/O errors (invalid EC headers, failed erases/writes).
| | Entries are struct ubi_list_item { .pnum, .erase_count }
| |
| `-- [PEB 22, ec:~7] --> [PEB 45, ec:~3] --> NULL
|
| NOTE: Bad block list is NOT persisted to flash.
| It is lost on reboot and rebuilt during the next init scan.
|
|-- vols (Red-Black Tree, keyed by volume ID)
| |
| | Maps volume IDs to struct ubi_volume pointers.
| |
| | vol_id:0 Nodes are struct ubi_rbt_item {
| | / \ .key = volume_id,
| | vol_id:1 vol_id:5 .value.vol = &ubi_volume
| | }
| |
| `-- Each ubi_volume (44 B) contains:
|
| struct ubi_volume
| |-- vol_id Unique volume identifier
| |-- cfg { name[16], type (static|dynamic), leb_count }
| |-- eba_tbl_count Number of mapped LEBs
| `-- eba_tbl (Red-Black Tree, keyed by LEB number)
| |
| | Per-volume mapping from logical to physical blocks.
| |
| | leb:2 Nodes are struct ubi_rbt_item {
| | / \ .key = LEB_number,
| | leb:0 leb:5 .value.pnum = PEB_index
| | }
| |
| `-- Each node points to the PEB holding that LEB's data:
| leb:0 --> PEB 7 [EC hdr | VID: vol=0,leb=0,sq=42 | payload]
| leb:2 --> PEB 19 [EC hdr | VID: vol=0,leb=2,sq=50 | payload]
| leb:5 --> PEB 31 [EC hdr | VID: vol=0,leb=5,sq=55 | payload]
|
`-- global_sqnum Monotonically increasing sequence number for writes
`-- vol_id_watermark Monotonic volume ID counter (mirrors dev_hdr.vol_id_watermark)
How the Structures Relate to Flash
Every PEB on flash is tracked by exactly one of these structures at any time:
+------------------+
| Physical Flash |
+------------------+
| PEB 0 (reserved)|----> Device + Volume headers (Bank 1) \
| PEB 1 (reserved)|----> Device + Volume headers (Bank 2) > N reserved
| ... (if N > 2) |----> Cold spares /
|------------------|
free_pebs RBT --------->| PEB N (free) | EC hdr present, VID = 0xFF
free_pebs RBT --------->| PEB N+1 (free) | EC hdr present, VID = 0xFF
|------------------|
vol[0].eba_tbl -------->| PEB 4 (vol0/L0) | EC hdr + VID(vol=0,leb=0) + data
vol[0].eba_tbl -------->| PEB 5 (vol0/L1) | EC hdr + VID(vol=0,leb=1) + data
|------------------|
vol[1].eba_tbl -------->| PEB 6 (vol1/L0) | EC hdr + VID(vol=1,leb=0) + data
|------------------|
dirty_pebs RBT -------->| PEB 7 (dirty) | EC hdr + VID (stale data)
|------------------|
bad_pebs list --------->| PEB 8 (bad) | Unreadable or failed I/O
+------------------+
Rule: PEB 0..N-1 are always reserved (N = CONFIG_UBI_DEV_HDR_NR_OF_RES_PEBS).
Every other PEB is in exactly ONE of:
- free_pebs (erased, ready for use)
- Some volume's eba_tbl (in use, holds live data)
- dirty_pebs (contains stale data, awaiting erasure)
- bad_pebs (defective, excluded from use)
Memory Usage
Structure |
Size per entry |
Allocated via |
|---|---|---|
|
136 B |
|
|
44 B |
|
|
16 B |
|
|
12 B |
|
Under the static backend (CONFIG_UBI_MEM_BACKEND_STATIC, default), all pools are pre-allocated at compile time. Under the heap backend, allocations are dynamic. See Configuration — Memory Sizing Guide for pool sizing details.
Memory Backends
All UBI runtime allocations route through the ubi_mem abstraction layer (lib/src/ubi_mem.h), which supports two backends selected via Kconfig:
ubi_mem (CONFIG_UBI_MEM_BACKEND_STATIC)
|
|-- device_slab [K_MEM_SLAB: D blocks of sizeof(ubi_device)]
|-- volume_slab [K_MEM_SLAB: D×V blocks of sizeof(ubi_volume)]
|-- leaf_slab [K_MEM_SLAB: D×(P+V) blocks of sizeof(ubi_leaf_item)]
`-- scratch_slab [K_MEM_SLAB: 1 block of DEV_HDR_SIZE + V×VOL_HDR_SIZE]
D = CONFIG_UBI_MAX_NR_OF_DEVICES
V = CONFIG_UBI_MAX_NR_OF_VOLUMES
P = CONFIG_UBI_MAX_NR_OF_DATA_PEBS
ubi_rbt_item (16 B) and ubi_list_item (12 B) share 16-byte blocks via union ubi_leaf_item. PEB state transitions (dirty→bad, bad→free, mapped→bad) retype items in-place rather than freeing and re-allocating, eliminating allocation failures on critical error paths.
When the static backend is used, ubi_device_init() validates that the flash geometry fits within the configured pool limits before scanning PEBs.
PEB Lifecycle
A Physical Erase Block moves through the following states during normal operation:
stateDiagram-v2
[*] --> Free : ubi_device_init() (fresh flash)
Free --> Allocated : leb_write() / leb_map()
Allocated --> Dirty : leb_write() (overwrite) / leb_unmap()
Dirty --> Free : ubi_device_erase_peb() (ec += 1)
Free --> Bad : I/O error
Allocated --> Bad : I/O error
Dirty --> Bad : I/O error
Bad --> Free : Torture recovery (rare)
Detailed ASCII reference:
+-------+
ubi_device_ | | ubi_device_init()
erase_peb() -->| FREE |<-- (fresh flash: all PEBs start here)
(ec += 1) | |
+---+---+
|
| leb_write() or leb_map()
| (rb_get_min selects lowest EC)
v
+-----------+
| |
| ALLOCATED | In a volume's eba_tbl
| (in use) | VID header links to vol_id + leb_num
| |
+-----+-----+
|
| leb_write() (overwrite) or leb_unmap()
| Old PEB moved to dirty_pebs
v
+-------+
| |
| DIRTY | Stale data, awaiting erasure
| |
+---+---+
|
| ubi_device_erase_peb()
| (erase flash, increment EC, write new EC hdr)
v
+-------+
| FREE | Back in free_pebs, ready for reuse
+-------+
At ANY point, if a flash I/O operation fails:
+-------+
I/O error | |
------------> | BAD | Moved to bad_pebs linked list
| | Excluded from all future operations
+-------+
Device Initialization
ubi_device_init() is the most complex function in UBI. It handles two fundamentally different scenarios: initializing a brand-new (never-used) flash device, and re-mounting an existing device after a reboot.
Flow Overview
ubi_device_init(mtd, NULL, &ubi)
|
v
Allocate ubi_device, init mutex, init RBTs
|
v
Check: is device mounted?
(read reserved PEBs 0..N-1, look for valid device headers)
|
+--- NO (fresh flash) -------> Phase 0: First-Time Mount
| |
+--- YES (reboot) --+ |
| | v
| | Write device header to reserved PEBs
| | Erase data PEBs N..total-1
| | Write EC headers (ec=0) to each
| | |
v v |
+--------------------------------------------+
| Phase 1: Read Device Header |
| Read device header from reserved PEBs |
| For each volume in vol_count: |
| Read volume header |
| Allocate ubi_volume + ubi_rbt_item |
| Insert into vols RBT |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Phase 2: Compute Average Erase Count |
| Scan PEBs N..total-1 |
| Read EC headers, sum valid erase counts |
| ec_avg = ec_sum / ec_count |
| (Used as fallback EC for bad blocks) |
+--------------------------------------------+
|
v
+--------------------------------------------+
| Phase 3: PEB Scan & Classification |
| For each PEB from N to total-1: |
| |
| 3.1 EC header invalid? |
| --> bad_pebs (ec = ec_avg) |
| |
| 3.2 EC valid, VID erased (empty)? |
| Probe data area prefix: |
| - prefix erased → free_pebs |
| - prefix non-erased → dirty_pebs |
| (uncommitted write) |
| |
| 3.3 EC valid, VID invalid CRC? |
| --> bad_pebs (ec from EC hdr) |
| |
| 3.4 EC valid, VID valid: |
| 3.4.1 Track max sqnum for global_seqnr|
| 3.4.2 Volume not found in vols RBT? |
| --> dirty_pebs (orphaned) |
| 3.4.3 LEB >= vol.leb_count? |
| --> dirty_pebs (out of range) |
| 3.4.4 LEB not in vol.eba_tbl? |
| --> insert into vol.eba_tbl |
| 3.4.5 LEB already in vol.eba_tbl? |
| Compare sqnum: |
| - new < existing: new-->dirty |
| - new > existing: old-->dirty, |
| new replaces in eba_tbl |
+--------------------------------------------+
|
v
Return ubi_device*
First-Time Mount vs. Reboot
Aspect |
First-Time Mount |
Reboot (Re-mount) |
|---|---|---|
Device header on reserved PEBs |
Not present |
Already written |
Phase 0 |
Erase all data PEBs, write EC headers with |
Skipped entirely |
Phase 1–3 |
Runs (all PEBs will be free) |
Runs (reconstructs volumes from existing data) |
Volume data |
None — empty EBA tables |
Reconstructed from VID headers on flash |
Dirty PEBs |
None |
May exist from incomplete writes before reboot |
Bad PEBs |
Detected from Phase 3 scan |
Detected fresh (previous list was in RAM only) |
Sequence Number Conflict Resolution
When two PEBs claim the same (vol_id, leb_num) pair (e.g., a write was interrupted and both the old and new PEB survive), UBI resolves the conflict using the sqnum field in the VID header:
The PEB with the higher
sqnumis the newer write and is kept in the EBA table.The PEB with the lower
sqnumis moved todirty_pebsfor later erasure.
This ensures that even after an unexpected power loss, the most recent successful write survives.
Erased-State Detection
UBI does not assume that erased flash reads as 0xFF. The erased byte value is queried at runtime via Zephyr’s flash_area_erased_val() API. Two internal helpers abstract all erased-state checks:
ubi_get_erased_val(mtd, &val)— queries the hardware-reported erased byte value for the partition, once.ubi_buf_is_erased(buf, len, val)— returnstrueif every byte inbufequalsval.
During PEB scan, the erased value is obtained once and passed to all classification helpers. Reserved PEB scan likewise derives the erased magic pattern from the actual erased byte value.
Thread Safety
Since v0.5.0, all public API functions acquire a per-device Zephyr mutex (struct k_mutex) before accessing any shared state. This means:
Multiple threads can safely call UBI functions on the same device concurrently.
The mutex provides mutual exclusion (one thread at a time), not read-write differentiation.
The mutex is initialized in
ubi_device_init()and held for the duration of each API call.Callers do not need to provide their own locking.
Single Handle Per Partition
Only one struct ubi_device * handle may be active per flash partition at any time. ubi_device_init() returns -EBUSY if a handle for the given partition_id already exists. The guard is released when ubi_device_deinit() completes.
Deinit Contract
ubi_device_deinit() acquires the device mutex before freeing resources. Any in-flight operations that already hold the mutex will complete before teardown proceeds. The caller must ensure that no other thread will start new operations after calling deinit.
Wear-Leveling
UBI implements a greedy minimum-erase-count wear-leveling strategy.
Write Path
When writing to a LEB, UBI always selects the free PEB with the lowest erase counter:
struct rbnode *min = rb_get_min(&ubi->free_pebs);
Since free_pebs is a red-black tree keyed by erase count, rb_get_min() returns the least-worn block in O(log n) time.
Erase Path
When erasing dirty PEBs, UBI also processes the one with the lowest erase counter first:
struct rbnode *min = rb_get_min(&ubi->dirty_pebs);
After erasing, the PEB’s erase counter is incremented and it is moved back to free_pebs.
Effect
This two-sided greedy approach naturally distributes wear across all PEBs:
Least-worn blocks are consumed first for writes, giving them more cycles.
Least-worn dirty blocks are recycled first, keeping the counter distribution tight.
Over time, all PEBs converge toward a similar erase count.
Write Flow (Mermaid)
Copy-on-write: the new PEB is fully written before the old mapping is swapped. On write failure, the previous mapping and data remain intact. The write order is EC → DATA → VID; the VID header acts as the commit point that makes the new mapping visible.
flowchart TD
Start["ubi_leb_write(vol_id, lnum, buf, len)"]
Lookup["Look up LEB in volume EBA table"]
SelectFree["Select free PEB with lowest EC\n(rb_get_min on free_pebs)"]
NoFree{"Free PEB available?"}
ErrNospc["Return -ENOSPC"]
WriteEC["Write EC header on new PEB"]
WriteData["Write user data payload"]
WriteVID["Write VID header\n(vol_id, lnum, sqnum++, data_size)\n— commit point —"]
WriteFail{"Write succeeded?"}
MarkBad["Mark new PEB as bad\nRetry with next free PEB"]
SwapEBA["Swap EBA: LEB → new PEB"]
WasOverwrite{"Was overwrite?"}
OldDirty["Move old PEB to dirty_pebs"]
Done["Return 0"]
Start --> Lookup --> SelectFree
SelectFree --> NoFree
NoFree -- No --> ErrNospc
NoFree -- Yes --> WriteEC --> WriteData --> WriteVID --> WriteFail
WriteFail -- No --> MarkBad --> SelectFree
WriteFail -- Yes --> SwapEBA --> WasOverwrite
WasOverwrite -- Yes --> OldDirty --> Done
WasOverwrite -- No --> Done
Read Flow (Mermaid)
flowchart TD
Start["ubi_leb_read(vol_id, lnum, offset, buf, len)"]
FindVol["Find volume in vols RBT"]
FindLEB["Look up LEB in volume EBA table"]
IsMapped{"LEB mapped?"}
ErrInval["Return -EINVAL"]
ReadFlash["Read from PEB at data offset + user offset"]
Done["Return 0"]
Start --> FindVol --> FindLEB --> IsMapped
IsMapped -- No --> ErrInval
IsMapped -- Yes --> ReadFlash --> Done
Erase / Reclaim Flow (Mermaid)
flowchart TD
Start["ubi_device_erase_peb()"]
HasDirty{"dirty_pebs non-empty?"}
NoDirty["Return 0 (nothing to reclaim)"]
SelectMin["Select dirty PEB with lowest EC\n(rb_get_min on dirty_pebs)"]
Erase["Erase PEB on flash"]
EraseFail{"Erase succeeded?"}
MarkBad["Mark PEB as bad"]
IncEC["Increment erase counter"]
WriteEC["Write new EC header"]
MoveToFree["Move PEB to free_pebs"]
Done["Return 0"]
Start --> HasDirty
HasDirty -- No --> NoDirty
HasDirty -- Yes --> SelectMin --> Erase --> EraseFail
EraseFail -- No --> MarkBad --> Done
EraseFail -- Yes --> IncEC --> WriteEC --> MoveToFree --> Done
Dual-Bank Mechanism
UBI stores device and volume metadata on reserved PEBs as mirrors. The number of reserved PEBs is configurable via CONFIG_UBI_DEV_HDR_NR_OF_RES_PEBS (default 2, range 2–4). Two PEBs are always kept active (containing identical copies); additional PEBs serve as cold spares that are promoted when an active PEB fails.
PEB Classification
State |
Description |
|---|---|
Active |
Contains a valid device header (correct magic + CRC). Participates in dual-bank writes. |
Spare |
Erased/empty (hardware erased value). Never written until an active PEB fails. |
Corrupt |
Contains invalid data (bad magic or CRC). Candidate for in-place recovery or abandonment. |
Write Sequence
When metadata changes (volume created, removed, or resized), UBI writes to both active PEBs sequentially:
1. Erase active reserved PEB (bank 1)
2. Write updated headers to active reserved PEB (bank 1)
3. Erase active reserved PEB (bank 2)
4. Write updated headers to active reserved PEB (bank 2)
If a write fails (dead PEB), UBI promotes a cold spare to replace it.
Init-Time Recovery
At ubi_device_init(), UBI scans all reserved PEBs (indices 0..N-1):
scan_reserved_pebs()
|
+-- All N PEBs valid? --> Normal init (no recovery needed)
|
+-- >= 1 active + corrupt or spare PEBs?
| |
| +-- Read full content from active PEB (highest revision)
| +-- For each corrupt PEB: erase + write canonical data
| | +-- Erase/write succeeds --> PEB recovered in-place
| | +-- Erase/write fails ----> PEB is dead, promote spare
| +-- At least 2 active PEBs after recovery? --> Init succeeds
|
+-- 0 active PEBs? --> Init fails (unrecoverable)
Runtime Recovery
Volume operations (ubi_vol_hdr_append, ubi_vol_hdr_remove, ubi_vol_hdr_update) call validate_reserved_pebs() before committing. If a degraded state is detected, recovery is attempted transparently.
Read-Only Degraded Mode
When only 1 active PEB remains and 0 spares are available, the system enters read-only degraded mode.
All public mutators pass through a central mutation gate (ubi_mutation_allowed() in ubi_internal.h) before performing any flash I/O. The gate classifies each operation into one of three mutation classes and applies the degraded-mode policy:
Mutation class |
Operations |
Degraded-mode policy |
|---|---|---|
|
|
Blocked ( |
|
|
Allowed |
|
|
Allowed |
ubi_device_erase_peb() is intentionally allowed in degraded mode. After its normal dirty-PEB maintenance cycle, it attempts to recover the reserved PEB bank by calling ubi_dev_hdr_read(), which internally scans all reserved PEBs and attempts erase+rewrite of any corrupt copies. If recovery succeeds, the read_only_degraded flag is cleared and the device returns to normal operation. This allows self-healing without requiring a reboot — the application’s regular garbage-collection loop serves as the recovery trigger.
Read-only operations are not gated and always succeed:
Operation |
Degraded mode behavior |
|---|---|
|
Works normally |
|
Works normally |
|
Works normally |
|
Works normally ( |
|
Works normally |
State Summary
Active PEBs |
Spares |
State |
Can update metadata? |
|---|---|---|---|
2 |
N−2 |
Healthy |
Yes |
1 |
≥1 |
Degraded |
Yes (spare promoted during recovery) |
1 |
0 |
Critical |
No — read-only mode |
0 |
any |
Dead |
No — cannot init |
PEB State Transitions
+-------------------+
| SPARE (empty) |
| erased |
+--------+----------+
|
| (promoted during recovery
| or overwrite when active fails)
v
+-------------------+ power loss / bit rot +-------------------+
| ACTIVE | --------------------------> | CORRUPT |
| valid dev hdr + | | bad magic or CRC |
| valid vol hdrs | | |
+--------+----------+ +--------+----------+
^ |
| erase + write canonical content |
+<------------------------------------------------+
| (in-place recovery from other active)
|
+-- erase/write fails --> PEB is DEAD (stays corrupt)
Volume Management
Volume Types
Type |
Enum |
Description |
|---|---|---|
Static |
|
Fixed content. Cannot be resized after creation. |
Dynamic |
|
Content can change. Supports runtime resizing. |
Create
ubi_volume_create() reads the persisted vol_id_watermark from the device header, assigns it as the new volume’s ID, bumps the watermark, and writes the updated device header plus new volume header to both active reserved PEBs atomically. The watermark is monotonic — IDs are never reused, even after volume removal. If vol_id_watermark reaches UINT32_MAX, create returns -ENOSPC. The volume is then added to the in-RAM vols RBT. The PEBs for the volume are not pre-allocated — they are claimed from free_pebs on-demand when LEBs are written or mapped.
If a volume with the same name and identical configuration (type, leb_count) already exists, the function returns successfully with the existing volume’s ID (idempotent behavior). If a volume with the same name but different configuration exists, the function returns -EEXIST. Volume creation is transactional: RAM structures are allocated before the flash commit, so a failed create leaves no persistent metadata.
Resize
ubi_volume_resize() is only supported for dynamic volumes and rejects leb_count == 0. It updates the leb_count in the volume header on both active reserved PEBs and adjusts the in-RAM configuration. Shrink is transactional: the flash metadata update commits before trimming EBA entries and reclaiming PEBs to dirty. Grow checks capacity accounting (bad_peb_count subtracted from usable PEBs).
Remove
ubi_volume_remove() removes the volume header from the active reserved PEBs, then reclaims mapped PEBs to dirty_pebs and frees in-RAM structures. Reclaim and index cleanup after a successful metadata remove are best-effort — errors are logged but the operation returns success once the flash metadata is gone.