This post is the final entry in our three-part series on building an AI-powered end-to-end testing framework, and it’s where we truly take you behind the curtain. Click here to catch up with part 1 and here for part 2.
A quick heads-up before you dive in: this is a code-heavy installment, even more than part 2.
If you enjoy seeing how software is actually built and tested in real life, you’ll feel right at home. We get into real examples, real workflows, and real implementation details that show exactly how this framework works under the hood.
If you’re a wealth firm CEO, COO, or financial advisor, you don’t need to read every line of code to get value from this post. In fact, this is an excellent resource to pass directly to your IT or development team.
It gives them a concrete, real-world blueprint for how modern, AI-assisted testing can be implemented responsibly and without sacrificing stability or trust.
Most importantly, even if you skim the technical sections, the business implications are easily understandable no matter your experience with software development. If you want to get straight to the takeaways, click “Wrapping Up: The Problem We Solved” in the table of contents just to the right.
The final sections of this post translate the technical work into practical takeaways: what this kind of framework enables, how it supports faster innovation, and why it ultimately benefits our advisors and our team alike.
This is the inside view. And if you want to understand not just what your technology partners promise, but how great systems are actually built, this is where it all comes together.
With that said, let’s dive in.

Multi-User Authentication
Real apps have multiple user roles. Admins, regular users, special access accounts. Our framework handles this with worker-scoped authentication.
How It Works
// fixtures/auth-fixtures.ts
import { test as base } from '@playwright/test';
type UserType = 'DEFAULT' | 'ADMIN' | 'INTAKE_USER';
interface TestFixtures {
userType: UserType;
}
export const test = base.extend<TestFixtures>({
userType: ['DEFAULT', { option: true }],
page: async ({ browser, userType }, use, workerInfo) => {
const authFile = `tests/.auth/${userType}-worker-${workerInfo.workerIndex}.json`;
// Check if auth is cached and valid
if (isAuthCacheValid(authFile)) {
const context = await browser.newContext({ storageState: authFile });
const page = await context.newPage();
await use(page);
await context.close();
return;
}
// Authenticate fresh
const context = await browser.newContext();
const page = await context.newPage();
await authenticateUser(page, userType);
await context.storageState({ path: authFile });
await use(page);
await context.close();
},
});
async function authenticateUser(page: Page, userType: UserType) {
const username = process.env[`${userType}_USERNAME`] || process.env.DEFAULT_USERNAME;
const password = process.env[`${userType}_PASSWORD`] || process.env.DEFAULT_PASSWORD;
await page.goto('/signin');
await page.getByPlaceholder('Email').fill(username);
await page.getByPlaceholder('Password').fill(password);
await page.getByRole('button', { name: 'Login' }).click();
await page.waitForURL(url => !url.pathname.includes('signin'));
}
Each Playwright worker maintains its own auth state:
// fixtures/auth-fixtures.ts
import { test as base } from '@playwright/test';
type UserType = 'DEFAULT' | 'ADMIN' | 'INTAKE_USER';
interface TestFixtures {
userType: UserType;
}
export const test = base.extend<TestFixtures>({
userType: ['DEFAULT', { option: true }],
page: async ({ browser, userType }, use, workerInfo) => {
const authFile = `tests/.auth/${userType}-worker-${workerInfo.workerIndex}.json`;
// Check if auth is cached and valid
if (isAuthCacheValid(authFile)) {
const context = await browser.newContext({ storageState: authFile });
const page = await context.newPage();
await use(page);
await context.close();
return;
}
// Authenticate fresh
const context = await browser.newContext();
const page = await context.newPage();
await authenticateUser(page, userType);
await context.storageState({ path: authFile });
await use(page);
await context.close();
},
});
async function authenticateUser(page: Page, userType: UserType) {
const username = process.env[`${userType}_USERNAME`] || process.env.DEFAULT_USERNAME;
const password = process.env[`${userType}_PASSWORD`] || process.env.DEFAULT_PASSWORD;
await page.goto('/signin');
await page.getByPlaceholder('Email').fill(username);
await page.getByPlaceholder('Password').fill(password);
await page.getByRole('button', { name: 'Login' }).click();
await page.waitForURL(url => !url.pathname.includes('signin'));
}
Using It in Tests
// Switch users at file level
test.use({ userType: 'ADMIN' });
// Or at describe block level
test.describe('Admin Features', () => {
test.use({ userType: 'ADMIN' });
test('admin can approve requests', async ({ page }) => {
// page is authenticated as ADMIN
});
});
// Multi-user in single test
test('cross-user workflow', async ({ page, userPage }) => {
const adminPage = await userPage('ADMIN');
const userPage = await userPage('INTAKE_USER');
// Interact with both simultaneously
});
Smart Test Reporting
The Problem with Default Reporters
Standard Playwright reporters show retried tests as separate entries:
✗ TLH.1: Harvest Account (failed)
✓ TLH.1: Harvest Account (retry #1) (passed)
This is confusing. Did the test pass or fail? You have to read carefully to figure it out.
Our Fix: Consolidated Flaky Test Reporting
We built a custom reporter that groups retry attempts and shows what actually matters, which is the final outcome:
// reporters/summary-reporter.ts
interface FinalTestResult {
name: string;
finalStatus: 'passed' | 'failed' | 'flaky'; // flaky = passed on retry
attempts: number;
passedOnRetry: boolean;
}
class SummaryReporter implements Reporter {
private tests: TestSummary[] = [];
onTestEnd(test: TestCase, result: TestResult) {
// Track each attempt with unique test ID
const testId = `${test.parent.project()?.name}::${test.location.file}::${test.title}`;
this.tests.push({
testId,
name: test.title,
status: result.status,
retry: result.retry,
duration: result.duration,
});
}
private consolidateResults(): FinalTestResult[] {
// Group attempts by test ID
const byTestId = new Map<string, TestSummary[]>();
for (const test of this.tests) {
const existing = byTestId.get(test.testId) || [];
existing.push(test);
byTestId.set(test.testId, existing);
}
const results: FinalTestResult[] = [];
byTestId.forEach((attempts) => {
attempts.sort((a, b) => a.retry - b.retry);
const first = attempts[0];
const last = attempts[attempts.length - 1];
const failedFirst = first.status === 'failed';
const passedLater = last.status === 'passed';
const passedOnRetry = failedFirst && passedLater;
results.push({
name: last.name,
finalStatus: passedOnRetry ? 'flaky' : last.status,
attempts: attempts.length,
passedOnRetry,
});
});
return results;
}
onEnd(result: FullResult) {
const finalResults = this.consolidateResults();
const passed = finalResults.filter(t => t.finalStatus === 'passed');
const flaky = finalResults.filter(t => t.finalStatus === 'flaky');
const failed = finalResults.filter(t => t.finalStatus === 'failed');
// Flaky tests are counted as successes
const totalSuccessful = passed.length + flaky.length;
console.log(`
═══════════════════════════════════════════════════════════════
TEST RUN SUMMARY
═══════════════════════════════════════════════════════════════
Total Tests: ${finalResults.length}
Passed: ${passed.length}
Flaky: ${flaky.length} ← passed on retry
Failed: ${failed.length}
Success Rate: ${totalSuccessful}/${finalResults.length}
`);
// Highlight flaky tests as SUCCESS with warning
if (flaky.length > 0) {
console.log(`
─── FLAKY TESTS (passed on retry) ───`);
flaky.forEach((test, i) => {
console.log(`
⚡ ${i + 1}. ${test.name} ✓ PASSED
Attempts: ${test.attempts}
Result: Failed initially, passed on retry`);
});
}
}
}
Sample Output
═══════════════════════════════════════════════════════════════
TEST RUN SUMMARY
═══════════════════════════════════════════════════════════════
Overall Status: ✓ PASSED
─── TEST COUNTS ───
Total Tests: 5
Passed: 3 (60%)
Flaky: 1 (20%) ← passed on retry
Failed: 1 (20%)
Success Rate: 4/5 (80%) (includes 1 flaky)
─── FLAKY TESTS (passed on retry) ───
⚡ 1. TLH.1: Harvest Account ✓ PASSED
Project: chromium
Total Duration: 1m 45.2s (2 attempts)
Result: Failed initially, passed on retry #1
─── PASSED TESTS ───
✓ INVEST.1: Invest Index (29.9s)
✓ TLH.1: Harvest Account ⚡ (1m 45.2s)
═══════════════════════════════════════════════════════════════
TEST RUN PASSED - 4/5 tests passed (1 flaky) in 3m 12.5s
═══════════════════════════════════════════════════════════════
—
Real Example: Tax Loss Harvesting Test
The Challenge
We needed to test a complex financial workflow. Here’s what made it hard:
- Multi-step process with slow API calls (60-120 seconds each)
- Dynamic UI that changes based on account data
- React-select dropdowns
- Tables with unpredictable data
- Multiple valid outcomes (opportunities might exist or not)
The Spec
Written in plain English so anyone can read it (even the people on our team with no engineering experience).
— TEST TLH.1: Harvest Account —
Steps:
1. Navigate to harvest
2. Look for Select Account and search for “Haley Fuller”
3. Confirm Account Summary shows positive values for Equities
4. Click on ‘Harvest Account’. This might take up to 60 seconds
5. Confirm ‘Selected Capital:’ shows positive value
6. Click on ‘Optimize Replacements’. This might take up to 120 seconds
7. Click on ‘Preview Orders’ to verify orders
8. Confirm there are some sells and buys in the orders preview
The Generated Page Object
Claude Code produced this from the spec:
export class TLHPage extends BasePage {
readonly url = '/harvest';
async selectAccount(accountName: string): Promise<void> {
const accountCombobox = this.page.getByRole('combobox').first();
await accountCombobox.waitFor({ state: 'visible', timeout: 15000 });
await accountCombobox.click();
await this.page.keyboard.type(accountName);
await this.page.waitForSelector('[role="option"]');
const option = this.page
.locator('[role="option"]')
.filter({ hasText: new RegExp(accountName, 'i') })
.first();
if (await option.isVisible()) {
await option.click();
} else {
await this.page.keyboard.press('Enter');
}
await this.page.waitForTimeout(2000); // Wait for data load
}
async clickHarvestAccount(): Promise<void> {
const harvestButton = this.page.locator('button:has-text("Harvest Account")').first();
// Scroll into view to handle overlapping elements
const tlhSection = this.page.locator('text=Tax Loss Harvesting Options')
.locator('xpath=ancestor::*[3]');
await tlhSection.scrollIntoViewIfNeeded();
await harvestButton.scrollIntoViewIfNeeded();
// Try normal click, fall back to JS click if intercepted
try {
await harvestButton.click({ timeout: 5000 });
} catch {
await harvestButton.evaluate((btn) => (btn as HTMLButtonElement).click());
}
await this.waitForHarvestComplete();
}
async waitForHarvestComplete(): Promise<void> {
// Wait for "Generating optimized replacements..." to appear and disappear
const generatingText = this.page.locator('text=/Generating.*replacement/i');
try {
await generatingText.waitFor({ state: 'visible', timeout: 5000 });
await generatingText.waitFor({ state: 'hidden', timeout: 120000 });
} catch {
// May complete before we observe it
}
// Also check Preview Orders button state
const previewButton = this.page.locator('button:has-text("Preview Orders")');
const isEnabled = await previewButton.isEnabled().catch(() => false);
console.log(`Preview Orders button enabled: ${isEnabled}`);
}
async hasHarvestingOpportunities(): Promise<boolean> {
// Check multiple indicators
const tabpanel = this.page.locator('[role="tabpanel"]').first();
const tabpanelText = await tabpanel.textContent().catch(() => '');
if (tabpanelText && tabpanelText.trim().length > 10) {
return true;
}
const lots = await this.getLotsCount();
return lots > 0;
}
}
The Generated Test
test('TLH.1: Harvest Account', async ({ page }) => {
await tlhPage.goto();
await tlhPage.selectAccount('Haley Fuller');
const equities = await tlhPage.getEquitiesValue();
expect.soft(equities, 'Equities should be positive').toBeGreaterThan(0);
await tlhPage.clickHarvestAccount();
const hasOpportunities = await tlhPage.hasHarvestingOpportunities();
if (!hasOpportunities) {
// No opportunities is valid, since market conditions vary
console.log('No harvesting opportunities found');
const tabVisible = await page
.locator('tab:has-text("Harvesting Opportunities")')
.isVisible();
expect(tabVisible).toBeTruthy();
return;
}
const selectedCapital = await tlhPage.getSelectedCapitalValue();
expect.soft(selectedCapital).toBeGreaterThan(0);
const hasOptimizeButton = await tlhPage.isOptimizeReplacementsVisible();
if (hasOptimizeButton) {
await tlhPage.clickOptimizeReplacements();
}
const previewEnabled = await tlhPage.isPreviewOrdersEnabled();
if (!previewEnabled) {
console.log('Preview Orders disabled - no orders to preview');
return;
}
const orderCounts = await tlhPage.clickPreviewOrders();
expect.soft(orderCounts.buys).toBeGreaterThan(0);
expect.soft(orderCounts.sells).toBeGreaterThan(0);
});
Patterns Worth Noting
1. Flexible selectors with regex and fallbacks
2. Graceful handling when no opportunities exist (test still passes)
3. Explicit timeouts for slow operations
4. Soft assertions for non-critical checks
5. JavaScript click fallback for intercepted elements
6. Console logging to help with debugging
What We Learned
1. Flexible Selectors Matter Most
This was the single biggest win for test stability:
// Before: Breaks when text changes
page.locator('button:has-text("Submit Request")');
// After: Survives minor changes
page.getByRole('button', { name: /submit.*request/i })
.or(page.locator('[data-testid="submit-btn"]'));
2. Accept Some Flakiness
E2E tests will be flaky sometimes. That’s just reality. Instead of fighting it:
- Configure 1-2 retries
- Track flaky tests separately
- Count them as passed (with a warning)
- Fix patterns that cause repeated flakiness
3. Page Objects Pay Off
The upfront investment is worth it:
- Fix a selector once, it works everywhere
- Tests read like documentation
- Non-developers can understand what tests do
4. Claude Code Works Well As a Partner
It’s good at:
- Generating boilerplate from patterns
- Finding alternative selectors when one fails
- Explaining error contexts
- Suggesting fixes based on DOM snapshots
5. Specs Double as Documentation
Test specs serve two purposes:
- Define what to test
- Document expected behavior
- Let non-technical people contribute
Wrapping Up: The Problem We Solved
We started with a simple tension: Our ability to innovate was outpacing our ability to validate. Our QA team, essential to shipping quality software, had become an unintentional bottleneck.
We reached the point where manual testing couldn’t scale with our ambitions.
The solution wasn’t to replace our QA team. Rather, it was to amplify what they could do. And we did that with the AI-enabled framework laid out in this three-part blog series.
How the Framework Meets Our Requirements
| Requirement | How We Met It |
| User interaction level | Playwright automates real browser interactions. Clicks, navigation, form fills. Exactly as users experience them. |
| Human-readable specs | Markdown specifications that QA, PMs, and developers can all write and understand |
| AI-powered translation | Claude Code converts plain English into robust Playwright tests with proper waits, assertions, and error handling |
| Resilience to changes | Regex selectors with `.or()` fallbacks survive UI updates without breaking |
| Scalable execution | 4+ parallel workers with cached authentication run full suites in minutes |
The Business Impact
| Metric | Before | After |
| Time to write new test | 2-4 hours | 15-30 minutes |
| Test maintenance overhead | 40% of sprint | 10% of sprint |
| Test flakiness confusion | High | Eliminated |
| Non-developer contributions | 0 | 30% of specs |
| Regression suite runtime | Hours | Minutes |
| QA bottleneck | Severe | Eliminated |
What Changed for Our Team
For Our QA Engineers:
- Focus shifted from repetitive regression to exploratory testing
- They write specs in plain English, not code
- More time for edge case discovery and UX validation
For Our Developers:
- Tests don’t break with every UI change
- Clear specs document expected behavior
- Faster feedback loops in CI/CD
For Our Business:
- Innovation velocity restored
- Quality maintained (actually improved)
- Confidence in every release
The Core Idea
The framework works because it respects who does what: Humans define WHAT to test. AI figures out HOW to test it.
QA engineers understand user journeys. They know what matters. Claude Code handles the tedious translation to executable tests: Selectors, waits, assertions, error handling.
For us, creating this framework was never about automation replacing humans. It has always been about finding a way to create automation that makes humans more effective.
Getting Started
1. Set up the infrastructure. Clone the test folder structure and install dependencies.
2. Configure the generation prompt. Customize selector strategies for your application.
3. Write your first spec. Plain English test steps.
4. Generate and run. Let Claude translate, then iterate.
# Example workflow
echo "your test steps" > tests/specs/my-feature.spec.md
./tests/scripts/generate-tests.sh my-feature
npx playwright test tests/generated/my-feature.spec.ts
The framework adapts to your application’s patterns over time. Each test you generate teaches Claude Code more about your codebase, making subsequent tests easier.
Final Thoughts
We no longer choose between velocity and quality and our QA team isn’t a bottleneck anymore; they’re now a force multiplier. They define testing strategy while automation handles execution at scale.
As much code as there is in this series, the real insight we gained wasn’t technical. It was recognizing that AI should augment human expertise, not replace it. Our QA team is more valuable than ever, freed from repetitive tasks to focus on what they do best.
