PORTFOLIO
DES303 JOURNAL

DES303 Week 7: Testing Focus Integrity With Users - Fairness, Privacy, and Trust

Using model comparison as a research instrument

Seung Beom YangDES303 – Design Research Practice
INTEGRATED REFLECTIVE CYCLE
Experience
What did I test, build, and change?
Reflection
What did I realise about my method?
Theory
Why did this method fail, and what does that mean for my research?
Preparation
What will I test next?

Introduction

In Week 6, user testing and crit feedback revealed that the central problem in Tickers was not simply whether the prototype worked, but whether users could trust the system's judgement. The feedback showed that Focus Integrity could not be treated as a simple score. If it affected arena results, proof, money, reputation, or social pressure, then the score needed to feel explainable, fair, and contestable.

This week I used the model comparison system from Week 6 as a research instrument. The goal was not to prove that the score was correct, but to test whether users understood and trusted the judgement. Instead of adding more arena features, I focused on whether the scoring layer felt fair, explainable, and acceptable to people using it.

By the end of the week, I realised that this assumption was limited. The system could collect data, replay sessions, compare models, and generate structured evidence, but more data did not automatically create better understanding. The experiment showed that Ticker was becoming better at watching the user, but not necessarily better at understanding the task.

The main learning from Week 7 is that Focus Integrity should not start from surveillance data. It has to start from task understanding.

Experience

Why I chose to investigate Focus Integrity

The first crit made the Focus Integrity problem very clear. Even though the prototype was still early, people immediately questioned the accuracy and ethics of the score. This mattered because the whole Tickers system depends on Focus Integrity being believable. If the score is wrong, the arena result becomes unfair. If the score feels hidden or invasive, users may reject the system. If the score is connected to money, reputation, proof, or social pressure, then wrong judgement has real consequences.

So I tried to make my version task-based rather than only activity-based. Instead of asking whether the user was active, Ticker should ask whether the user's observable behaviour matches the task they claimed they were doing.

What is new this week?

Week 7 overlaps with Week 6 because it continues the Focus Integrity problem, but the purpose changed. Week 6 established the trust issue. Week 7 tested whether my first technical answer to that issue was strong enough.

Week 6 establishedWeek 7 tested
Focus Integrity had a trust problemWhether tick-based evidence could make the score more trustworthy
Users questioned fairness, privacy, and judgementI built TriScore, collectors, replay comparison, and user correction labels
The dashboard began as a response to critiqueThe dashboard became a research tool for testing the limits of the method
The next step was Trustworthy Focus IntegrityThe result was that evidence alone was not enough without task context
FIGURE 1 · WEEK 6 PROBLEM TO WEEK 7 EXPERIMENT
Crit
Week 6 feedback

Trust in Focus Integrity became the central risk.

Problem
Score feels inaccurate

Wrong judgement could feel unfair, hidden, or ethically risky.

Decision
Test the score

The next cycle needed to test the method, not add arena features.

Week 7
TriScore + Research Tick

Live scoring and replay comparison became the experiment.

Figure 1. Week 6 critique revealed that the biggest risk in Tickers was trust in Focus Integrity. Week 7 therefore focused on testing the scoring method itself rather than expanding the arena system. Source: author's own process diagram, 2026.

Experiment 2: Testing Focus Integrity With Friends

Research question

After Week 6, I realised that improving Focus Integrity was not only a technical problem. It was a trust problem. This week I tested the system with a small group of friends to understand whether the score felt fair, which evidence felt acceptable, and where the system misread real study behaviour.

Test detailWhat I did
Participants4 friends / classmates
Session length10 minute focus sessions
Taskscoding, essay writing, lecture watching, research browsing
Evidence collectedapp/window context, input activity, screen change, optional camera presence, model score
Human labelI manually labelled each segment as aligned, partial, distracted, or unclear
User feedbackshort post-test questions about fairness, trust, and privacy

Evidence of what I actually worked on this week

To make the week more evidential, I treated Focus Integrity as a technical design experiment rather than only a conceptual reflection. I documented the work in four layers: implementation evidence, session evidence, model-comparison evidence, and peer-interpretation evidence. This mattered because the question was not only whether I could describe a scoring system, but whether I could show how the system was built, what data it produced, where it failed, and how that failure changed the direction of the project.

WorkstreamWhat I built or changedEvidence to includeWhat it proves
Live trackerBuilt the macOS Focus Integrity overlay with live score and signal rowsScreenshot of tracker running during a real sessionThe system was not only conceptual
Collector pipelineAdded or refined active window, input, camera, screen diff, Chrome, and system state collectorsAnnotated architecture and development workspaceFocus Integrity was built from several evidence channels
Raw tick loggingCaptured repeated session ticks with timestamps, app context, signal values, and verdictsRaw JSON / log screenshot with labelsThe system produced machine-readable behavioural evidence
Research replayReplayed the same session through multiple modelsResearch Tick dashboard screenshotI was comparing evidence strategies, not trusting one score
Segment labellingAdded aligned / partial / unclear / not aligned correction labelsUser correction UI screenshotThe score became contestable
Model comparisonCompared model output against human labelsExploratory model-label agreement graphThe models still struggled with ambiguity
Peer interpretationAsked peers what the score meant and what felt invasiveSmall feedback tableThe issue was not only accuracy, but trust and explanation
FIGURE 2 · DEVELOPMENT WORKSPACE FOR THE WEEK 7 BUILD
Focus Integrity files
focus-integrity/
  collectors/
    ActiveWindowCollector.ts
    InputActivityCollector.ts
    CameraPresenceCollector.ts
    ScreenDiffCollector.ts
    ChromeActivityCollector.ts
    SystemStateCollector.ts
  pipeline/
    FeatureExtractor.ts
    CertaintyGate.ts
    AIJudge.ts
  scoring/
    TriScoreEngine.ts
    SessionStateMachine.ts
  research/
    ResearchTickReplay.ts
    modelComparison.ts
Local run log
[tracker] session started: DES303 Week 7 blog
[collector] active_window=Chrome title="DES303 brief"
[collector] input=low idle=38s screen_diff=0.04
[ai-judge] verdict=partial confidence=0.61
[triscore] integrity=74 engagement=42 confidence=58
[research] saved tick #034 to replay buffer
[replay] models=M0, MA, MS, MB, MAI, CPRES, MPRES
Figure 2. Development workspace for the Week 7 Focus Integrity experiment. This build receipt shows the implementation layer behind the live tracker, including collector modules, scoring logic, and replay/debug tooling. I included this because the Week 7 experiment was not only a visual interface test. It involved building the technical pipeline that allowed session evidence to be collected, scored, replayed, and compared. Source: author's own development environment, 2026.

TriScore model — building my first live Focus Integrity architecture

The main research instrument I used this week was the TriScore model. This was my first working version of Focus Integrity as a live system rather than only a product idea. The purpose of the model was to replay and inspect focus sessions in small repeated “ticks”, collect behavioural evidence from the desktop, and test whether the user's current activity still matched the task they had committed to.

This mattered because the Week 6 critique showed that Focus Integrity could not remain as a vague score. If Ticker was going to use the score to affect streaks, arena results, proof, or accountability, then I needed to understand how the judgement was being made. Building the TriScore model was therefore not just technical development. It was a design research experiment into whether focus can be judged fairly through observable signals.

The live system ran in this loop:

Collectors → Feature Extractor → Certainty Gate / AI Judge → TriScore Engine → State Machine → Intervention UI

This structure helped me test what would happen if Ticker judged focus continuously during a session. It also made the weakness of the approach more visible. The system could collect more evidence, but it still struggled to know what the evidence meant.

FIGURE 3 · LIVE TRISCORE MODEL ARCHITECTURE
Input
Raw signals
  • -window
  • -input
  • -camera
  • -screen
  • -Chrome
  • -system
Processing
Feature extractor
  • -rolling buffer
  • -app role
  • -idle time
  • -title match
Local gate
Certainty gate

Clear on-task, clear off-task, or uncertain.

Escalation
AI judge

Called only when local evidence is uncertain.

Score
TriScore engine
  • -Integrity
  • -Engagement
  • -Confidence
OUTPUTS DURING THE LIVE SESSION
State
State machine
  • -focused
  • -drift
  • -distracted
  • -failed
Interface
Intervention UI
  • -warning
  • -grace period
  • -session fail
Figure 3. The live TriScore model. This system was designed to run during a session and intervene in real time when the user appeared to drift from the declared task. Source: author's own architecture diagram, 2026.
FIGURE 4 · LIVE FOCUS INTEGRITY TRACKER
Live macOS Focus Integrity tracker showing a session score, active app evidence, input activity, camera presence, screen change, and AI judgement.
Figure 4. Live Focus Integrity tracker running during a focus session. This screenshot is included as evidence that Week 7 was not only a conceptual architecture exercise. The live tracker tested whether Ticker could collect session evidence in real time, calculate focus-related signals, and respond while the user was still working. However, this also revealed a design problem: a live judgement system becomes risky when the evidence is ambiguous.

The system collected raw signals from the desktop every few seconds through several collectors:

CollectorWhat it checkedWhy it mattered
ActiveWindowCollectorForeground app, bundle ID, window title, active durationShows what digital context the user is in
InputActivityCollectorTyping, mouse, scroll, idle timeShows whether the user is actively interacting
CameraPresenceCollectorFace presence and gaze stabilityShows whether the user is physically present
ScreenDiffCollectorScreen visual changeShows whether the workspace is changing or static
ChromeActivityCollectorBrowser / extension violationsShows possible distracting web use
SystemStateCollectorScreen lock, display sleepShows whether the session is abandoned or inactive

This collector table was useful because it made the system's assumptions visible. Each collector translated part of the user's behaviour into data, but none of them could fully explain intention. For example, the ActiveWindowCollector could tell that Chrome was open, but not whether I was researching, watching a lecture, checking documentation, or drifting away from the task. The InputActivityCollector could detect low typing, but low typing might mean reading, planning, thinking, or being distracted. This showed me that adding collectors improved visibility, but did not automatically improve understanding.

What TriScore provedWhat TriScore did not prove
A live Focus Integrity tracker can collect desktop evidence during a sessionThat the evidence correctly understands intention
The system can calculate Integrity, Engagement, and ConfidenceThat those scores feel fair to the user
A certainty gate can reduce unnecessary AI callsThat ambiguous work can be judged safely
Live intervention is technically possibleThat live intervention is always helpful
FIGURE 5 · RAW TICK EVIDENCE
Segment evidence list showing machine-readable focus session evidence such as active app, window evidence, input activity, screen change, and confidence.
timestampactive appinput statecamera statescreen diffAI verdictintegrity / engagement / confidencehuman label
{
  "tick": 34,
  "timestamp": "2026-04-18T14:22:40+12:00",
  "declaredTask": "write DES303 Week 7 blog",
  "activeApp": "Google Chrome",
  "windowTitle": "DES303 Week 7 2026 - Canvas",
  "input": { "typing": "low", "idleSeconds": 38 },
  "presence": { "camera": "face_not_detected" },
  "screen": { "diffRatio": 0.04, "state": "static_reading" },
  "aiVerdict": { "label": "drift_risk", "confidence": 0.61 },
  "triScore": { "integrity": 74, "engagement": 42, "confidence": 58 },
  "humanLabel": "aligned / partial"
}
Figure 5. Raw tick evidence from one focus session. Each tick records a small behavioural snapshot: time, active app, input activity, screen change, presence signal, and model verdict. Annotating the log helped me see that the system could produce structured evidence, but the meaning of that evidence still depended on the declared task. Source: author's own Focus Integrity session log, 2026.

The Feature Extractor then converted this raw telemetry into higher-level features using a rolling buffer of recent ticks — approximately a few minutes of history — and calculated things like app switching rate, dwell time, focus continuity, idle seconds, doom scrolling, typing bursts, face stability, downward glances, screen-diff ratios, and task keyword matching.

How the live TriScore model used AI

I added a certainty gate before the AI Judge so I wouldn't call AI on every tick. If local evidence was clear, the system decided locally. If evidence was unclear, the backend formatted an evidence bundle and Claude returned a structured judgement — on-task yes/no, confidence, and a short reason. That verdict fed back into the TriScore Engine and could move the user through warning, recovery, or failed states.

At this stage the system felt promising — it could respond while the user was working. But it created a serious issue: if the judgement was wrong, the system could interrupt the user unfairly.

One ambiguous test case: when the system could watch but not understand

This ambiguous case became the most important evidence of the week. It showed that the problem was not only model accuracy. The problem was meaning. The same signal could be interpreted in opposite ways depending on the task. This meant the system could be technically detailed and still unfair.

Observed signalWhat the system might assumeWhat it could actually mean
Chrome is activeThe user is browsing instead of workingThe user is researching, reading documentation, or checking references
YouTube is openThe user is distractedThe user is watching a lecture, tutorial, or design reference
Screen is staticThe user has stopped workingThe user is reading, thinking, or analysing
Low typing activityThe user is idleThe user is planning, reading, or reviewing
Camera does not see the faceThe user left the sessionThe user may be writing on paper or looking at another device

This was the point where the experiment failed productively. The failure was useful because it changed the direction of the project. I no longer wanted to keep adding more surveillance signals. I needed to redesign the system so that it understood the work before judging the behaviour.

FIGURE 6 · AMBIGUOUS SEGMENT DRILL-DOWN
SegmentEvidenceModel readHuman labelWhy it was ambiguous
03/10Chrome active, low typing, static screenDrift riskAligned / partialI was reading course material for the blog
04/10YouTube open, no typingDistractedPartialIt could be tutorial/reference, but the system lacked task context
05/10Camera absentOff-taskUnclearI may have been writing notes on paper
Figure 6. Ambiguous segment drill-down. This segment became the strongest evidence that tick-based Focus Integrity was incomplete. The model could detect behaviour, but it could not reliably interpret whether that behaviour made sense for the task. Source: author's own replay test and human labelling, 2026.

Research Tick / Replay model — the investigation system

I therefore created a separate Research Tick / Replay system. This was not designed to control the user live. It was designed to help me investigate which data was truly meaningful.

Because Week 6 had already introduced the score-check dashboard, I used Week 7 to push the dashboard further as a research tool. The main change was not simply adding more models, but using the replay system to ask why each model still failed in ambiguous task situations.

ModelEvidence usedWhy I tested itWeek 7 learning
M0Duration onlyBaselineToo private to be meaningful
MAApp, input, dwell timePrivacy-preserving telemetryUseful but weak for intention
MSScreen heuristicsTask evidenceStronger but privacy-heavy
MAIMetadata + screen + AIAI judgementHelpful but can sound too certain
CPRESCamera presencePhysical presencePresence is not focus
MTRAJSequence over timePrevious/next segment contextBetter than single ticks, still not task-aware
MLLMLocal LLM judgePrivate AI judgementPromising, but only if task context is rich

This changed Focus Integrity from a single-score problem into a comparison problem. I was no longer asking whether one model was “right”. I was asking which evidence types improved judgement, which were too invasive, and which were too expensive or unreliable to use.

FIGURE 7 · RESEARCH TICK MODEL COMPARISON SCREEN
Research Tick score-check dashboard comparing Focus Integrity model outputs across different evidence strategies.
The same session replayed through different evidence models to compare score behaviour.
FIGURE 8 · TRISCORE VS RESEARCH TICK / REPLAY MODEL
Live TriScore
Runs during focus session

Checks the user while the session is active.

Intervenes in real time

Warnings can affect the experience immediately.

Uses current tick features

Works from the newest behavioural evidence.

Outputs three live scores

Integrity, engagement, and confidence.

Risk: wrong intervention

A bad reading can interrupt the user unfairly.

Goal: keep user focused

The product outcome is live support.

Research Tick / Replay
Runs after the session

Uses saved data instead of controlling the user.

Compares models offline

Several evidence strategies can be tested safely.

Uses saved session logs

The same session can be replayed repeatedly.

Outputs model results

The result is comparison, not a live judgement.

Risk: wrong model analysis

Mistakes affect research, not the user session.

Goal: discover what matters

The design outcome is a better evidence strategy.

Figure 8. Difference between the live TriScore model and the Research Tick / Replay model. TriScore controls the live experience, while the research model investigates which evidence strategy should inform future versions. Source: author's own comparison diagram, 2026.
FIGURE 9 · EXPLORATORY MODEL-LABEL AGREEMENT
Research Tick model results table showing model-label agreement scores, label distributions, per-segment labels, warnings, and confidence changes.
Research Tick score-over-time graph comparing Focus Integrity model outputs against human-labelled background bands.
Figure 9. Exploratory model-label agreement and score-over-time graph across the labelled replay segments. This is not a final accuracy claim because the sample is small. It is a design research check showing how different evidence models behave against human labels, and why ambiguous task interpretation still needs user correction. Source: author's own Research Tick replay test, 2026.

How I tried to correct the data

My idea was that users could correct segments when the system was wrong — labelling them as aligned, partial, unclear, or not aligned. These corrections would become ground truth so the system could learn each user's focus pattern over time. This connects to data labelling in machine learning, where models need labelled examples to interpret raw data correctly.

FIGURE 10 · USER CORRECTION UI
Ticker overlay screen for quickly labelling a focus segment as aligned, partial, unclear, or not aligned.
Figure 10. User correction overlay for ambiguous Focus Integrity segments. The aligned, partial, unclear, and not aligned labels were designed to let the user contest the system's judgement. This helped, but it also showed that correction labels cannot solve the problem if the system does not understand the task context before judging.
FIGURE 11 · ORIGINAL RESEARCH METHOD
  1. 01
    User starts focus session
  2. 02
    Ticker records 10-second ticks
  3. 03
    Ticks are grouped into segments
  4. 04
    Each model predicts aligned, partial, unclear, or not aligned
  5. 05
    User corrects wrong labels
  6. 06
    Corrections become ground truth
  7. 07
    System learns user behaviour over time
Figure 11. My original Research Tick method. I thought user-corrected labels could help Focus Integrity learn each user's focus pattern over time. Source: author's own method diagram, 2026.

Friend testing: fairness, privacy, and trust

After building the Research Tick replay system and user correction overlay, I ran a small formative test with four friends / classmates. This was not intended to prove that the model was accurate. The aim was to test whether people could understand what the Focus Integrity score was trying to communicate, which evidence signals felt acceptable, and where the judgement felt unfair or invasive.

I showed users the live tracker, replay dashboard, and ambiguous segments from short focus sessions. I then asked them to explain what they thought the score meant, which evidence they trusted, which evidence felt too invasive, and what they would want to happen if the system judged them incorrectly. This was a small qualitative check. NN/g argues that small formative usability tests are useful for finding design problems and supporting iteration, while larger samples are needed for quantitative claims (Nielsen, 2000, 2012).

EXPERIMENT 2 VISUAL · PARTICIPANT SESSION
A friend studying at a laptop during a Focus Integrity test session, with their face blurred for privacy.
Participant running a focus session with the desktop Focus Integrity tracker active. Face blurred for privacy.
EXPERIMENT 2 VISUAL · PLANNED DISTRACTION MOMENT
A friend checking a phone during a planned distraction moment in a Focus Integrity test session, with their face blurred for privacy.
Planned distraction moment used to test whether the system could identify drift without over-claiming attention.
QuestionScale
Did the score feel fair?1 to 5
Did the system explain itself clearly?1 to 5
Did any evidence feel too invasive?1 to 5
Would you trust this score if money/reputation was involved?1 to 5
What should happen when the system is uncertain?Short answer
ParticipantTaskSystem resultHuman labelFairness ratingPrivacy comfortKey comment
P1CodingMostly alignedAligned4/53/5The score felt okay, but I wanted to know why it dropped.
P2Essay writingPartialAligned/unclear3/54/5Reading looks inactive, so the system needs a pause/reading mode.
P3Lecture watchingDistractedAligned2/53/5YouTube can be work depending on the task.
P4Research browsingPartialPartial4/52/5Screen evidence feels strong, but I would not want raw screenshots shared.
EXPERIMENT 2 VISUAL · TRUST AND FAIRNESS RATINGS
Average fairness
3.3/5
Average trust with stakes
2.0/5
P1
Fairness
4/5
Trust
3/5
P2
Fairness
3/5
Trust
2/5
P3
Fairness
2/5
Trust
1/5
P4
Fairness
4/5
Trust
2/5
Post-session feedback showed that users cared less about perfect accuracy and more about whether the score could explain itself.
EXPERIMENT 2 VISUAL · BEFORE AND AFTER TABLE
Week 6 model assumptionWeek 7 user responseNext design change
More evidence will make the score feel fairerUsers accepted monitoring only when the system could explain why a score changedShow score reasons beside the number
Low typing or low input means driftEssay writing and reading can look inactive even when the user is alignedAdd a pause / reading mode before warning the user
YouTube or browser use is usually distractionLecture watching and research browsing can be legitimate workCompare evidence against the declared task, not against generic app categories
Camera presence makes the score strongerPresence felt sensitive and did not always prove focusMake camera evidence optional and clearly separate from core scoring
The model can make a final judgementUsers wanted correction, appeal, and an uncertain stateMake Focus Integrity explainable, contestable, and privacy-aware
Week 6 model assumptions became Week 7 user responses, then turned into concrete design changes for the next version of Focus Integrity.
QuestionUser response patternWhat I learnt
What do you think the score means?Some users read it as an attention score rather than a task-alignment estimate.The system needs clearer language. Focus Integrity should not sound like mind-reading.
Which evidence feels reasonable?Active app, window title, and task-related screen context felt easiest to understand.Users are more willing to accept evidence when it clearly connects to the declared task.
Which evidence feels invasive?Camera presence felt more sensitive than app or window data.Presence checking needs stronger consent, explanation, or an alternative.
What should happen if the score is wrong?Users wanted correction, appeal, or clarification.Focus Integrity needs to be contestable, not final.
Would you trust this in an arena?Users said trust depends on seeing the reason behind the score.Explanation matters more than the number alone.

The user test confirmed that the problem was not only technical accuracy. Even if the model produced a score, users still needed to understand why the score happened and what they could do when it was wrong. This supported my Week 7 finding that tick-based evidence is not enough by itself. The next version needs task context, uncertainty, and clarification before judgement. Combining implementation evidence, replay evidence, and peer interpretation also helped me triangulate the finding instead of relying on one evidence source (Whitenton, 2021).

Reflection on Action

My main learning

The biggest learning this week was that my Focus Integrity research method was too focused on accuracy and not enough on meaning. At the start of the week, I thought the main problem was that the score needed more evidence. My logic was:

more ticks → more collectors → more model comparison → more user correction → more reliable Focus Integrity

After building the TriScore model and the Research Tick / Replay system, I realised that this assumption was limited. The system could become more technically detailed, but that did not mean it became more fair or more useful.

The issue was not only whether the model had enough data. The issue was whether the model understood what the user was trying to do. Chrome could be distraction or research. A static screen could be inactivity or deep reading. Low typing could be avoidance or thinking. YouTube could be entertainment or learning. Without task context, the same behaviour could produce the wrong judgement.

This changed my understanding of Focus Integrity. I no longer saw it as a simple detection problem. I began to see it as a task-alignment problem. The system should not ask only, “Is the user active?” or “Does this behaviour look focused?” It should ask, “Does this behaviour make sense for the task the user committed to?”

FIGURE 13 · WHAT I THOUGHT VS WHAT I LEARNT
More ticks collected
What I thought
More data
Better model
More accurate Focus Integrity
What I learnt
More surveillance
Still weak task context
Unfair or unhelpful judgement
Figure 13. Week 7 reflection shift. I moved from assuming that more behavioural data would improve Focus Integrity to realising that data without task context can become surveillance without support. Source: author's own reflection diagram, 2026.

This shift was important because it changed the direction of the project. Before Week 7, I was trying to make Ticker a better focus detector. After Week 7, I started to question whether detection was the right starting point at all. The experiment showed that surveillance can create confidence without creating understanding. A system may look more intelligent because it collects more data, but if it does not understand the task, it can still make unfair or unhelpful judgements.

FIGURE 14 · METHOD SHIFT AFTER THE WEEK 7 EXPERIMENT
Version A: Tick detectorVersion B: Task-aware verifier
Starts with behavioural dataStarts with task context
Watches app, input, camera, screenUnderstands expected work pattern first
Warns when signals look suspiciousAsks clarification when evidence is ambiguous
Treats labels as correction after judgementUses task contract before judgement
Risk: surveillance without meaningGoal: verification with explanation
Figure 14. Method shift after the Week 7 experiment. The failure of the tick-based method changed the project from a focus detector into a task-aware verifier. This is the main design movement produced by the week. Source: author's own reflection diagram, 2026.

Why this failed as a design direction

Ticker was not made only so that AI could check whether a user is focused. The larger purpose was to help the user focus further. The tick-based method did not do that well. It mainly monitored the user and reacted after it suspected drift. It did not properly structure the work before the session. It did not understand why the user chose a task, or help them decide what to do next.

My broader project tension is about AI increasing efficiency while surveillance becomes normalised. But this version mostly increased surveillance without enough improvement in efficiency. The more believable future is a system that becomes useful enough that people accept being watched.

OLD QUESTION

How can I make the score more accurate?

BETTER QUESTION

What does the system need to understand before it has the right to judge?

My positionality as a designer

This week also showed me something about my own design pattern. I am comfortable building systems — infrastructure, data models, backend logic, scoring engines, full product flows. This is a strength. It is also a weakness because I sometimes treat design uncertainty as an engineering problem. I responded to the trust problem by building more architecture: more collectors, more models, more replay systems. The deeper I went into tick-based detection, the more I realised the issue was not missing logic. It was missing context.

This connects back to my wider design pattern. Because I am comfortable building full systems, I often respond to uncertainty by adding infrastructure. In Week 7, that strength became a blind spot. I made the system more complex before fully questioning whether behavioural detection was the right starting point.

Theory

Why behaviour data alone was not enough

The Week 7 experiment needed research grounding because Focus Integrity sits between productivity tracking, artificial intelligence, and surveillance. At first, I treated Focus Integrity as a detection problem: if the system collected enough behavioural signals, it could decide whether the user was focused or distracted. The experiment showed that this was too simple.

Context-aware computing

Dey (2001) defines context as information that helps characterise the situation of a person, place, or object. For Ticker, this means active app, window title, typing activity, and screen movement are not meaningful by themselves. They only become meaningful when connected to the user's actual task.

Dourish (2004) also reframes context as something produced through activity and interpretation, not a fixed background variable. This directly matched my ambiguous test cases. Chrome, YouTube, a static screen, or low typing cannot automatically be classified as distraction because the same signal can mean research, learning, planning, reading, or avoidance depending on the task.

Human-AI interaction

Human-AI interaction research helped me question how confident Focus Integrity should appear to be. If Ticker gives a score without showing uncertainty, the interface may feel more objective than it actually is. Amershi et al. (2019) argue that AI systems should make capabilities and limits clear, support correction, and adapt cautiously. This connects directly to my correction-label idea and to the next step: clarification questions when the evidence is ambiguous.

Surveillance and workplace monitoring

The surveillance literature made the project ethically more serious. Ajunwa et al. (2017) show how worker monitoring can become expansive when productivity behaviour is turned into judgement. Urquhart et al. (2022) also show that AI-enabled workplace surveillance raises questions about privacy, agency, and trust. This matters for Ticker because Focus Integrity could easily become a system that watches more simply because more signals are technically available.

TheoryWhat it changed in my design
Context is interpreted, not simply capturedTicker needs task context before judging behaviour
AI systems should show limits and support correctionFocus Integrity needs user correction and clarification
Monitoring can affect privacy, agency, and trustTicker should avoid adding signals just because it can

Why this is not just a machine learning problem

Focus Integrity cannot be treated as a simple classification problem because the labels are not stable. “Focused” or “not focused” depends on task context, intention, tool choice, working style, and sometimes the user's internal state. The same behaviour can mean different things in different contexts, so the design problem is not only how to train a better model. It is how to define, explain, and contest the judgement before the system acts on it.

Preparation

Week 8 pivot: from detector to task-aware AI senior

Week 8 will not be another scoring experiment. It will be a framing and communication week. I need to prepare Crit 2 by showing the pivot clearly: Ticker is moving from a focus detector to a task-aware AI senior. The next prototype should show how the AI understands the task before the session starts, creates a focus contract, and asks clarification questions when the evidence is ambiguous.

FIGURE 15 · NEXT DIRECTION: FROM DETECTOR TO TASK-AWARE AI SENIOR
Old direction: AI as detector
User writes rough task
Ticker records ticks
AI judges behaviour
Warning or penalty
New direction: AI as task-aware senior
Task imported from existing tools
AI understands context
AI creates focus contract
Focus Integrity checks alignment
AI asks clarification when uncertain
Figure 15. Next direction after Week 7. Instead of starting from surveillance, the next model should start from task context and use Focus Integrity as one evidence layer. Source: author's own direction diagram, 2026.
NEXT EXPERIMENT QUESTION

Can Focus Integrity become more meaningful if Ticker understands the task before the session starts, instead of only judging behavioural ticks during the session?

Week 8 preparation plan

Week 8 taskWhy it matters
Refine reverse briefClarify that Ticker is about verified effort, not only productivity
Prepare Crit 2 slideCommunicate the pivot from detector to AI senior
Prototype task import / focus contractShow how task context enters the system
Add clarification flowShow how the system handles uncertainty without unfair warning
Test with peersAsk whether this feels more helpful, less invasive, or still controlling

Conclusion

Week 7 showed me that Focus Integrity cannot be designed as a simple accuracy problem. Users were willing to accept some monitoring when it helped them stay accountable, but they became less comfortable when the system could not explain why a score dropped. The most important finding was that trust depends on explanation, not only detection.

A lower-evidence model may be more acceptable if it is transparent, while a stronger evidence model may become too invasive if users cannot control what is captured. This shifts my next step from “make the model more powerful” to “make the judgement more explainable, contestable, and privacy-aware.”

Trust depends on explanation, not only detection.

References

  • Ajunwa, I., Crawford, K., & Schultz, J. M. (2017). Limitless worker surveillance. California Law Review, 105(3), 735–776. https://doi.org/10.15779/Z38BR8MF94
  • Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P. N., Inkpen, K., Teevan, J., Kikin-Gil, R., & Horvitz, E. (2019). Guidelines for human-AI interaction. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–13. https://doi.org/10.1145/3290605.3300233
  • Design Research Practice. (2026). DES303 Week 7 2026 [Course handout, University of Auckland].
  • Dey, A. K. (2001). Understanding and using context. Personal and Ubiquitous Computing, 5(1), 4–7. https://doi.org/10.1007/s007790170019
  • Dourish, P. (2004). What we talk about when we talk about context. Personal and Ubiquitous Computing, 8(1), 19–30. https://doi.org/10.1007/s00779-003-0253-8
  • Nielsen, J. (2000, March 18). Why you only need to test with 5 users. Nielsen Norman Group. https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/
  • Nielsen, J. (2012, June 3). How many test users in a usability study?Nielsen Norman Group. https://www.nngroup.com/articles/how-many-test-users/
  • Urquhart, L., Laffer, A., & Miranda, D. (2022). Working with affective computing: Exploring UK public perceptions of AI enabled workplace surveillance. In Proceedings of Ethicomp 2022, University of Turku (pp. 164–177). https://doi.org/10.48550/arXiv.2205.08264
  • Whitenton, K. (2021, February 21). Triangulation: Get better research results by using multiple UX methods. Nielsen Norman Group. https://www.nngroup.com/articles/triangulation-better-research-results-using-multiple-ux-methods/

Note on figures: Custom diagrams, reconstructed evidence views, and rating charts are based on the author's Week 7 development logs, replay tests, labelled segments, and peer testing notes. Figures 4, 7, and 10 are screenshots from the author's Ticker / Focus Integrity prototypes. Experiment 2 participant photographs are author-captured test documentation with face areas blurred before publication. All diagrams and prototype screenshots were built and rendered by the author, 2026.