Intermittent Issues when trying to run a test, adding startup delay fixes it, but don't know why

I can’t figure out how to tag my post, so I’m hoping this works:
#iOS#
#Appium 1.20.2#
#iOS Simulator#
#Java#
#Catalina#
#Mac OS 10.15.5#

Howdy,

Summary:

I’m trying to set up automated UI testing for our app, which I’m trying to click on a button. This normally works but I’m running into 4 intermittent issues about 10-50% of the time.

Basic details:

iOS app on iOS Simulator, Appium v1.20.2, using Java, on Mac OS 10.15.5
The App is a Hybrid app, and we want to automate clicks on the hybrid app.

Here’s the context:

We have an iOS app we are trying to make UI tests for. We’ve been making a proof of concept that we’ve been building out, and are now at a place where it is “working” but is very unstable.

We use Java as the testing language, so using the org.openqa.selenium package.

Here’s a pastebin for what my code generally looks like (with changes to help protect IP): https://pastebin.com/AQjCnkhw

Behavior:

50% of the time: Everything works as expected

However, the rest of the time, we will get one of 4 behaviors:

  1. this.driver.context(this.webViewId) will return NoSuchContextExceptions for all 5 retries, even when getContextHandles works properly. This will persist when we get rid of the retry limit and try it many many times. (The message is usually “Target not found.”
  2. The findElement call will fail: with “org.openqa.selenium.support.ui.ExpectedConditions WARNING WebDriverException thrown by findElement(By.id: MyTopLevelWebViewDivId)”. In these cases, I call “getPageSource” before calling getWebViewDom, and the html clearly shows that MyTopLevelWebViewDivId exists.
  3. Did not get any response for atom execution after 12000 ms.
  4. Sometimes no exceptions get thrown, but the button is not clicked, and the java test script returns with exit code 0.

However, if in the part I marked “POST_INIT”, if I put in this line of code:

Thread.sleep(15000);
it seemingly works 100% of the time (I’ve tested it twice 20 times in a row, and several other times for less than that and never got any of the the issues.

The lower the amount I sleep, the more often the errors seem to occur. For instance, at 10000 ms, I get a failure rate of about 10-20%, whereas without a sleep, it happens 20-50% of the time.

What I’ve tried:

In the “Context” section, I’ve put some variations on the code that I’ve tried to address the issues, including just looping forever when I got an exception, but the only thing that would do is infinite loop. This leads me to believe that interacting with the Appium server at all to change contexts or wait on elements before it was ready would make it get into a bad state.

I also tried comparing the logs during the times where it was working and during the times were I would get exceptions, but maybe I just wasn’t analyzing it right, but I couldn’t find anything consistently different, except for the obvious points of failures.

I then tried a bunch of different capabilities, but none of these made an appreciable difference:
(“appium:waitForIdleTimeout”, 30)
(“appium:webviewConnectTimeout”, 30)
(“webviewConnectTimeout”, 30)
(“waitForIdleTimeout”, 30)
(“waitForQuiescence”, true)
There are a few I can’t remember that I tried, and I tried different combinations of the ones I listed as well. None of them got rid of the instability issues.
I also tried increasing NEW_COMMAND_TIMEOUT to 30 as well, which also didn’t get rid of the instability.

I then tried to debug the Appium server. I cloned the “appium” repo, and started the server using node (command “node .” from command line after builds). I used VS Code to breakpoint into the server. In particular, I wanted to narrow in on the setContext call failing, so the stack trace seems ROUGHLY to be the following:
Entry: lib/appium.js AppiumDriver.executeCommand line 547 res.value = await dstSession.executeCommand(cmd, …args);
node_modules/appium-xcuitest-driver/build/lib/commands/context.js setContext line 147 await this.remote.selectPage(appIdKey, pageIdKey, skipReadyCheck);
node_modules/appium-remote-debugger/build/lib/mixins/connect.js selectPage line 229 if (!skipReadyCheck && !(await this.checkPageIsReady())) {
node_modules/appium-remote-debugger/build/lib/mixins/navigate.js checkPageIsReady line 119 readyState = await _bluebird.default.resolve(this.execute(readyCmd, true)).timeout(this.pageReadyTimeout);
node_modules/appium-remote-debugger/build/lib/mixins/execute.js execute line 136 const res = await this.rpcClient.send(‘Runtime.evaluate’, {
node_modules/appium-remote-debugger/build/lib/rpc/rpc-client.js RpcClient.send() line 181 return await this.sendToDevice(command, opts, waitForResponse);
At this point it was difficult to dig further. Stepping into fuctions doesn’t seem that straightforward as it didn’t necessarily step me into the functions I was trying to step into directly (I found myself in the function emit_hooks a lot). In addition, I was getting non-deterministic behavior of when the Exceptions were being throw, sometimes on one line and sometimes on other lines, mostly based on where I put my breakpoints (the exceptions were thrown sometime during RpcClient.initializePage). I do know that I eventually ended up in the catch statement for “RpcClient.send”, with the “Target not found.” message and so forth.
This lead me to believe that the issue was in the XCUITest end, and that I was in over my head.

Does anyone have any idea what might be going on? Why does it work better when I put in a delay after initialization? 15 seconds is a long time, and slows down the ability for us to test our UI, so it’s not an ideal solution.

Thank you all for any help you can provide.

Example Logs for the different failure cases:

working example #1: https://paste.ee/p/8npT6
working example #2: 4 link limit for new users, let me know if this is needed
setContext failure: https://paste.ee/p/IcCp3
findElement failure: https://paste.ee/p/RcFqf
atam execution failure: this is really rare, don’t have a link
exit code 0, but doesn’t click button: 4 link limit
A new timeout failure I found while getting new logs: 4 link limit