Rule-based | 83.8 | 25 | 85.2 | 79 | 100 | 83.3 | https://arxiv.org/abs/2504.08942 | Lù et al. |
WebJudge | 73.7 | 66.7 | 69.8 | 72.6 | 92.3 | 75 | https://arxiv.org/pdf/2504.01382 | Xue et al. |
AER-C (GPT-4o) | 67.7 | 83.3 | 56 | 68.8 | 100 | 66.7 | https://arxiv.org/abs/2504.08942 | Lù et al. |
AER-V (GPT-4o) | 67.6 | 83.3 | 61.2 | 67.6 | 96.4 | 59.3 | https://arxiv.org/abs/2504.08942 | Lù et al. |
NNetNav (Llama-3.3 70B) | 52.5 | 20.8 | 54.5 | 54.3 | 77.3 | 43.2 | https://arxiv.org/abs/2504.08942 | Lù et al. |
Claude 3.7 S. (A) | 68.8 | 87.5 | 61 | 69.3 | 85 | 66.7 | https://arxiv.org/abs/2504.08942 | Lù et al. |
GPT-4o (A) | 69.8 | 77.8 | 63 | 70.2 | 94.6 | 63 | https://arxiv.org/abs/2504.08942 | Lù et al. |
GPT-4o Mini (A) | 61.5 | 80 | 57.9 | 63.5 | 84.2 | 49.4 | https://arxiv.org/abs/2504.08942 | Lù et al. |
Llama 3.3 (A) | 67.7 | 75 | 59.6 | 68.2 | 94.3 | 62.7 | https://arxiv.org/abs/2504.08942 | Lù et al. |
Qwen2.5-VL (A) | 64.3 | 72.7 | 59.3 | 63.6 | 87.2 | 60.3 | https://arxiv.org/abs/2504.08942 | Lù et al. |
Claude 3.7 S. (S) | 69.4 | 71.4 | 64.8 | 69.3 | 85.3 | 66.7 | https://arxiv.org/abs/2504.08942 | Lù et al. |
GPT-4o (S) | 68.1 | 77.8 | 60.7 | 69.9 | 93.8 | 59.6 | https://arxiv.org/abs/2504.08942 | Lù et al. |
GPT-4o Mini (S) | 64.5 | 80 | 57.4 | 66.9 | 90.3 | 54.8 | https://arxiv.org/abs/2504.08942 | Lù et al. |
Qwen2.5-VL (S) | 64.5 | 70 | 58.5 | 62.9 | 93.8 | 64.4 | https://arxiv.org/abs/2504.08942 | Lù et al. |