Skip to content

fix(cam_hal): replace recursive cam_take() with bounded loop prevents stack-overflow in cam_task #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

RubenKelevra
Copy link

@RubenKelevra RubenKelevra commented Jun 30, 2025

Description

The camera driver’s ISR (ll_cam_send_event) writes incoming DMA/VSYNC events into a queue. When that queue overflows, it logs the message below and stops the camera:

105  void IRAM_ATTR ll_cam_send_event(cam_obj_t *cam, cam_event_t cam_event, BaseType_t * HPTaskAwoken)
106  {
107      if (xQueueSendFromISR(cam->event_queue, (void *)&cam_event, HPTaskAwoken) != pdTRUE) {
108          ll_cam_stop(cam);
109          cam->state = CAM_STATE_IDLE;
110          ESP_CAMERA_ETS_PRINTF(DRAM_STR("cam_hal: EV-%s-OVF\r\n"), cam_event==CAM_IN_SUC_EOF_EVENT ? DRAM_STR("EOF") : DRAM_STR("VSYNC"));
111      }

If a frame is incomplete (for example after such an overflow), cam_take() will recursively retry fetching a frame until it finds a JPEG end marker:

492      if (dma_buffer) {
493          if(cam_obj->jpeg_mode){
...
501              ESP_LOGW(TAG, "NO-EOI");
502              cam_give(dma_buffer);
503              TickType_t ticks_spent = xTaskGetTickCount() - start;
504              if (ticks_spent >= timeout) {
505                  return NULL; /* We are out of time */
506              }
507              return cam_take(timeout - ticks_spent);//recurse!!!!

Each recursive call consumes additional stack space. Because the camera task stack size defaults to only 2048 bytes:

160      config CAMERA_TASK_STACK_SIZE
161          int "CAM task stack size"
162          default 2048

repeated recursion after several “EV‑EOF‑OVF” messages can exhaust the stack of the cam_task. Once the guard bytes at the end of the task’s stack are overwritten, FreeRTOS reports a stack overflow, as seen in the logs reported by @turenkomv.

So the stack overflow likely occurs when the camera generates bad frames (due to event queue overflow) and cam_take() repeatedly recurses while trying to recover.

To fix this we switch to a loop instead:

  • Re-wrote cam_take() as a single while-loop – stack depth is now constant and independent of retry count.
  • Added strict timeout tracking (remaining = timeout - elapsed); function can never block longer than the caller’s timeout.
  • ESP32-S3 only
    • capped GDMA reset storms to 3 attempts (MAX_GDMA_RESETS)
    • logs one “giving up” warning, then yields (vTaskDelay(1)) to avoid busy-spin after hardware is wedged.
  • Non-S3 targets
    • emit early ESP_LOGW when a NULL frame ever appears, then yield one tick per loop to prevent CPU thrash.
  • Maintained existing JPEG-EOI trimming and YUV→GRAY memcpy paths; behavior unchanged on successful frames.
  • Inline comment links to esp32-camera commit 984999f / issue esp_camera_fb_get() fails on ESP32S3 after device joins an AP #620 for future context.

Related

Fixes crash reported by @turenkomv here: esphome/esphome#8832 (comment)

Testing

The ESP32S I got seems to be less affected, but runs the new code fine: I've patched the version 2.0.15 currently used in ESPHome with my changes.

@turenkomv would you be so kind and would test this on your hardware?

Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass. (?)
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@RubenKelevra
Copy link
Author

@Fexiven a review would be appreciated!

@me-no-dev me-no-dev requested a review from Copilot July 1, 2025 07:33
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a stack overflow issue in the camera driver by replacing the recursive cam_take() calls with a bounded loop.

  • Replaces recursive frame fetching with a loop that tracks elapsed time.
  • Introduces GDMA reset attempts (capped at 3) for ESP32-S3 and adjusts behavior accordingly.

@me-no-dev
Copy link
Member

@RubenKelevra please have a look at the copilot suggestion, this error in CI: https://github.com/espressif/esp32-camera/actions/runs/15981710752/job/45109710238?pr=758#step:4:824

LGTM otherwise

@turenkomv
Copy link

turenkomv commented Jul 1, 2025

I'd like to test it, but I'm not sure how to update the esp32_camera driver in ESPHome to use the version from your pull request.

I tried doing it like this: esphome/esphome@a72405c — but it doesn't seem like anything actually changed. Could you please guide me on how to properly test your version?

@me-no-dev
Copy link
Member

@turenkomv that is because the PR has not yet been merged. When that happens and new version is released, you will be able to grab it from the component manager

               – prevents stack-overflow in cam_task

The old cam_take() used recursion to retry if
* no JPEG EOI was found or
* a NULL frame pointer was returned (GDMA glitch on ESP32-S3).

Under heavy loss conditions this could overflow the cam_task stack
and reboot the whole system.

* Re-wrote cam_take() as a single while-loop – stack depth is now
  constant and independent of retry count.
* Added strict timeout tracking (`remaining = timeout - elapsed`);
  function can never block longer than the caller’s timeout.
* ESP32-S3 only
  * capped GDMA reset storms to 3 attempts (`MAX_GDMA_RESETS`)
  * logs one “giving up” warning, then yields (`vTaskDelay(1)`)
    to avoid busy-spin after hardware is wedged.
* Non-S3 targets
  * emit early `ESP_LOGW` when a NULL frame ever appears,
    then yield one tick per loop to prevent CPU thrash.
* Maintained existing JPEG-EOI trimming and YUV→GRAY memcpy paths;
  behaviour unchanged on successful frames.
* Inline comment links to esp32-camera commit 984999f / issue espressif#620
  for future context.
@RubenKelevra RubenKelevra force-pushed the fix_cam-task_stack-overflow branch from 887b719 to c4204d3 Compare July 1, 2025 13:09
@RubenKelevra
Copy link
Author

@RubenKelevra please have a look at the copilot suggestion, this error in CI: https://github.com/espressif/esp32-camera/actions/runs/15981710752/job/45109710238?pr=758#step:4:824

LGTM otherwise

Woops, good catch, thank you!

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 1, 2025

I'd like to test it, but I'm not sure how to update the esp32_camera driver in ESPHome to use the version from your pull request.

I tried doing it like this: esphome/esphome@a72405c — but it doesn't seem like anything actually changed. Could you please guide me on how to properly test your version?

Thanks for your report, but I don't an error in your configuration. Note that this should not make the camera correctly record frames (yet), but only solve the crash by stack overflow.

So it's expected that you see still "EV‑EOF‑OVF" messages, but no crash.

Edit: Maybe it doesn't like building from a branch, try this tag instead:

https://github.com/RubenKelevra/espressif_esp32-camera/releases/tag/fix_cam-task-stack-overflow-0.1

@turenkomv
Copy link

turenkomv commented Jul 1, 2025

I tried setting the tag fix_cam-task-stack-overflow-0.1, but nothing changed.
turenkomv/esphome@61bc003

After a variable number of messages, I still get the same error:

cam_hal: EV-EOF-OVF
***ERROR*** A stack overflow in task cam_task has been detected.

Is it possible that the new driver version isn't actually being pulled into the build?

log:

[16:30:21]ESP-ROM:esp32s3-20210327
[16:30:21]Build:Mar 27 2021
[16:30:21]rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
[16:30:21]Saved PC:0x40376f5d
[16:30:21]SPIWP:0xee
[16:30:21]mode:DIO, clock div:1
[16:30:21]load:0x3fce3818,len:0x1750
[16:30:21]load:0x403c9700,len:0x4
[16:30:21]load:0x403c9704,len:0xbe4
[16:30:21]load:0x403cc700,len:0x2d34
[16:30:21]entry 0x403c9908
[16:30:21]I (26) boot: ESP-IDF 5.1.6 2nd stage bootloader
[16:30:21]I (26) boot: compile time May 21 2025 10:42:20
[16:30:21]I (26) boot: Multicore bootloader
[16:30:21]I (29) boot: chip revision: v0.2
[16:30:21]I (33) boot.esp32s3: Boot SPI Speed : 80MHz
[16:30:21]I (38) boot.esp32s3: SPI Mode       : DIO
[16:30:21]I (42) boot.esp32s3: SPI Flash Size : 16MB
[16:30:21]I (47) boot: Enabling RNG early entropy source...
[16:30:21]I (53) boot: Partition Table:
[16:30:21]I (56) boot: ## Label            Usage          Type ST Offset   Length
[16:30:21]I (64) boot:  0 otadata          OTA data         01 00 00009000 00002000
[16:30:21]I (71) boot:  1 phy_init         RF data          01 01 0000b000 00001000
[16:30:21]I (78) boot:  2 app0             OTA app          00 10 00010000 007c0000
[16:30:21]I (86) boot:  3 app1             OTA app          00 11 007d0000 007c0000
[16:30:21]I (93) boot:  4 nvs              WiFi data        01 02 00f90000 0006d000
[16:30:21]I (101) boot: End of partition table
[16:30:21]I (105) esp_image: segment 0: paddr=00010020 vaddr=3c0b0020 size=2b28ch (176780) map
[16:30:21]I (145) esp_image: segment 1: paddr=0003b2b4 vaddr=3fc9a800 size=04d64h ( 19812) load
[16:30:21]I (150) esp_image: segment 2: paddr=00040020 vaddr=42000020 size=a4a00h (674304) map
[16:30:21]I (272) esp_image: segment 3: paddr=000e4a28 vaddr=3fc9f564 size=012b8h (  4792) load
[16:30:21]I (274) esp_image: segment 4: paddr=000e5ce8 vaddr=40374000 size=16784h ( 92036) load
[16:30:21]I (308) boot: Loaded app from partition at offset 0x10000
[16:30:21]I (308) boot: Disabling RNG early entropy source...
[16:30:21]I (309) octal_psram: vendor id    : 0x0d (AP)
[16:30:21]I (313) octal_psram: dev id       : 0x02 (generation 3)
[16:30:21]I (319) octal_psram: density      : 0x03 (64 Mbit)
[16:30:21]I (325) octal_psram: good-die     : 0x01 (Pass)
[16:30:21]I (330) octal_psram: Latency      : 0x01 (Fixed)
[16:30:21]I (335) octal_psram: VCC          : 0x01 (3V)
[16:30:21]I (340) octal_psram: SRF          : 0x01 (Fast Refresh)
[16:30:21]I (346) octal_psram: BurstType    : 0x01 (Hybrid Wrap)
[16:30:21]I (352) octal_psram: BurstLen     : 0x01 (32 Byte)
[16:30:21]I (357) octal_psram: Readlatency  : 0x02 (10 cycles@Fixed)
[16:30:21]I (363) octal_psram: DriveStrength: 0x00 (1/1)
[16:30:21]I (369) MSPI Timing: PSRAM timing tuning index: 5
[16:30:21]I (374) esp_psram: Found 8MB PSRAM device
[16:30:21]I (379) esp_psram: Speed: 80MHz
[16:30:21]I (383) cpu_start: Multicore app
[16:30:21]I (809) esp_psram: SPI SRAM memory test OK
[16:30:21]I (818) cpu_start: Pro cpu start user code
[16:30:21]I (818) cpu_start: cpu freq: 240000000 Hz
[16:30:21]I (819) app_init: Application information:
[16:30:21]I (819) app_init: Project name:     m5stack-camera-1
[16:30:21]I (819) app_init: App version:      2025.7.0-dev
[16:30:21]I (819) app_init: Compile time:     Jul  1 2025 16:27:31
[16:30:21]I (819) app_init: ELF file SHA256:  7bc2f0d95...
[16:30:21]I (819) app_init: ESP-IDF:          5.3.2
[16:30:21]I (820) efuse_init: Min chip rev:     v0.0
[16:30:21]I (820) efuse_init: Max chip rev:     v0.99 
[16:30:21]I (820) efuse_init: Chip rev:         v0.2
[16:30:21]I (820) heap_init: Initializing. RAM available for dynamic allocation:
[16:30:21]I (820) heap_init: At 3FCA49B0 len 00044D60 (275 KiB): RAM
[16:30:21]I (821) heap_init: At 3FCE9710 len 00005724 (21 KiB): RAM
[16:30:21]I (821) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
[16:30:21]I (821) heap_init: At 600FE100 len 00001EE8 (7 KiB): RTCRAM
[16:30:21]I (822) esp_psram: Adding pool of 8192K of PSRAM memory to heap allocator
[16:30:21]I (822) spi_flash: detected chip: gd
[16:30:21]I (823) spi_flash: flash io: dio
[16:30:21]W (823) i2c: This driver is an old driver, please migrate your application code to adapt `driver/i2c_master.h`
[16:30:21]I (823) sleep: Configure to isolate all GPIO pins in sleep state
[16:30:21]I (824) sleep: Enable automatic switching of GPIO sleep configuration
[16:30:21]I (824) main_task: Started on CPU0
[16:30:21]I (825) main_task: Calling app_main()
[16:30:21]I (899) main_task: Returned from app_main()
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:23]cam_hal: EV-EOF-OVF
[16:30:24]cam_hal: EV-EOF-OVF
[16:30:24]cam_hal: EV-EOF-OVF
[16:30:24]cam_hal: EV-EOF-OVF
[16:30:24]
[16:30:24]***ERROR*** A stack overflow in task cam_task has been detected.
[16:30:24]
[16:30:24]
[16:30:24]Backtrace: 0x4037701e:0x3fcb2af0 0x4037e1ed:0x3fcb2b10 0x4037f162:0x3fcb2b30 0x4038009b:0x3fcb2bb0 0x4037f294:0x3fcb2be0 0x4037f28a:0xa5a5a5a5 |<-CORRUPTED
[16:30:24]
[16:30:24]
[16:30:24]
[16:30:24]
[16:30:24]ELF file SHA256: 7bc2f0d95
[16:30:24]
[16:30:24]Rebooting...

config:

esphome:
  name: m5stack-camera-1
  friendly_name: M5Stack Camera 1

external_components:
  # - source: github://pr#8843
  #   components: [ esp32_camera ]
  #   refresh: 0s
  - source:      
      type: git
      url: https://github.com/turenkomv/esphome
      ref: dev
    components: [ esp32_camera ]
    refresh: 0s

esp32:
  board: esp32-s3-devkitm-1
  flash_size: 16MB
  framework:
    type: esp-idf
    version: latest
    sdkconfig_options:
      CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ_240: y
      #CONFIG_CAMERA_JPEG_MODE_FRAME_SIZE_AUTO: n
      #CONFIG_JPEG_MODE_FRAME_SIZE_CUSTOM: y
      #CONFIG_CAMERA_JPEG_MODE_FRAME_SIZE: '2097152'

psram:
  mode: octal
  speed: 80MHz

# Enable logging
logger:
  baud_rate: 0

# Enable Home Assistant API
api:
  encryption:
    key: !secret api_encryption_key

ota:
  - platform: esphome
    id: ota_esphome
    password: !secret ota_password

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

captive_portal:

esp32_camera:
  name: M5Stack Camera 1
  external_clock:
    pin: GPIO11
    frequency: 20MHz
  i2c_pins:
    sda: GPIO17
    scl: GPIO41
  data_pins:
    - GPIO6  # D0
    - GPIO15 # D1
    - GPIO16 # D2
    - GPIO7  # D3
    - GPIO5  # D4
    - GPIO10 # D5
    - GPIO4  # D6
    - GPIO13 # D7
  vsync_pin: GPIO42
  href_pin: GPIO18
  pixel_clock_pin: GPIO12
  reset_pin: GPIO21
  resolution: FHD
  jpeg_quality: 2
  frame_buffer_count: 2
  max_framerate: 3 fps

esp32_camera_web_server:
  - port: 8080
    mode: SNAPSHOT

light:
  - platform: status_led
    name: "Status LED"
    pin:
      number: GPIO14
      inverted: true

Also, a few times right after flashing, I got this error:

[16:29:13]cam_hal: EV-EOF-OVF
[16:29:13]Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
[16:29:13]
[16:29:13]Core  1 register dump:
[16:29:13]PC      : 0x40056f8c  PS      : 0x00060833  A0      : 0x8037e2d5  A1      : 0x3fcb2da0  
[16:29:13]A2      : 0x3fcb2e10  A3      : 0x00001800  A4      : 0x00000004  A5      : 0x3fcb2e10  
[16:29:13]A6      : 0x00000000  A7      : 0x00000000  A8      : 0x00000000  A9      : 0x03cb2b74  
[16:29:13]A10     : 0x01ffffff  A11     : 0x3c0b35ec  A12     : 0x3c0b3994  A13     : 0x3fcb2d90  
[16:29:13]A14     : 0x3fcb2d70  A15     : 0x0000000c  SAR     : 0x00000018  EXCCAUSE: 0x0000001c  
[16:29:13]EXCVADDR: 0x00001800  LBEG    : 0x40056f5c  LEND    : 0x40056f72  LCOUNT  : 0xffffffff  
[16:29:13]
[16:29:13]
[16:29:13]Backtrace: 0x40056f89:0x3fcb2da0 0x4037e2d2:0x3fcb2db0 0x4037e924:0x3fcb2dd0 0x4200bc14:0x3fcb2e10
[16:29:13]
[16:29:13]
[16:29:13]
[16:29:13]
[16:29:13]ELF file SHA256: 7bc2f0d95
[16:29:13]
[16:29:13]Rebooting...

@RubenKelevra
Copy link
Author

Is it possible that the new driver version isn't actually being pulled into the build?

Yeah. There can't be a stack overflow happening with the patch, because there's no longer recursion happening.

So the patch is not applied by ESPHome, for whatever reason.

Try removing the override in esphome/components/esp32_camera/init.py:

 # esphome/components/esp32_camera/__init__.py
-    if CORE.using_esp_idf:
-        add_idf_component(
-            name="espressif/esp32-camera",
-            repo="https://github.com/RubenKelevra/espressif_esp32-camera.git",
-            ref="fix_cam-task_stack-overflow")

Keep only the YAML override.

Then clean and remove the folders build and .idf-component-manager (the folder names are from memory right now, I'm on the go - hope that's correct) and retry. Hope that helps :)

@turenkomv
Copy link

I rebuilt the firmware on my home computer, and I'm sure that this time the driver version from the PR is actually being pulled in.
image

@RubenKelevra
Copy link
Author

I rebuilt the firmware on my home computer, and I'm sure that this time the driver version from the PR is actually being pulled in. image

And, what's the result? :)

@turenkomv
Copy link

With exactly same result 😔

@RubenKelevra
Copy link
Author

@turenkomv Thanks.

I'll investigate further where the second issue may be lying.

@RubenKelevra
Copy link
Author

@turenkomv let's continue the discussion here. I'm not very hopeful, that this is the fix for the issue, but it's worth a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants