Challenging the Concept of AI Detection in Academic Papers: Reliability, Fairness, and Academic Integrity in the Age of Generative Artificial Intelligence.

QM Services
7 hours ago
6 min read

AI in Theory: Enhancing Human Intelligence Through Ethical Innovation

By NK Gosine

Abstract

Since the public release of ChatGPT in November 2022, educational institutions have increasingly relied upon artificial intelligence (AI) detection software to identify suspected AI-generated academic work. These systems are often presented as tools capable of preserving academic integrity by identifying the use of generative AI in student submissions. However, recent academic scholarship and institutional guidance have raised serious concerns regarding the reliability of such technologies. This paper examines the development of AI writing technologies, evaluates the current academic debate regarding AI detection, and considers the implications of false-positive findings for procedural fairness in higher education. Particular attention is given to an experimental observation whereby academic papers authored between 1995 and 2010 were repeatedly classified as AI-generated despite predating publicly available generative AI systems. It is argued that AI-detection technologies identify statistical linguistic characteristics rather than actual AI authorship and should therefore not be regarded as definitive evidence of academic misconduct.

Introduction

The emergence of generative artificial intelligence represents one of the most significant developments in education since the widespread adoption of the internet. The public release of ChatGPT by OpenAI in November 2022 transformed academic discourse by enabling students and researchers to generate coherent essays, summaries, and analytical commentary within seconds.¹

Educational institutions responded swiftly. Universities, colleges, and schools increasingly adopted AI-detection software, including Turnitin AI Detection, GPTZero, Originality.ai and similar systems, to identify potential misuse of generative AI in academic assessments.² The rationale was understandable: if students could use AI to produce assignments, institutions required mechanisms to detect such conduct.

However, unlike plagiarism-detection software, which identifies matching text through direct comparison against existing sources, AI-detection software attempts to infer authorship by analysing statistical characteristics of language.³ This distinction is critical. Whereas plagiarism software identifies evidence of copying, AI-detection systems merely estimate the probability that a document resembles machine-generated text.

The consequence is that AI-detection systems may generate false positives, falsely identifying human-authored work as AI-generated. Recent academic literature increasingly suggests that this concern is neither theoretical nor insignificant.⁴

The Historical Impossibility Problem

One of the most significant challenges to the credibility of AI-detection systems emerges when historical documents are analysed.

During an independent experiment, academic papers written between 1995 and 2010 were submitted to several contemporary AI-detection platforms. Remarkably, numerous papers received substantial AI-authorship scores despite having been written years before modern generative AI became publicly available.

This observation raises a fundamental question: how can a document written in 1998 be identified as AI-generated when the technology in question did not exist?

This phenomenon may be described as the "Historical Impossibility Problem". If a document predating generative AI can be classified as AI-generated, then the detector cannot be identifying actual AI use. Instead, it must be identifying linguistic characteristics associated with AI-generated text.

Several explanations may account for this outcome.

First, AI systems were trained on vast quantities of human-generated content, including academic articles, textbooks, dissertations and journal publications.⁵ Consequently, modern AI writing often resembles traditional academic writing because academic writing itself formed part of the training data.

Second, many detection systems rely upon concepts such as perplexity and burstiness. Perplexity refers to the predictability of language, while burstiness concerns variations in sentence structure and complexity.⁶ Academic writing is often deliberately structured, formal and predictable. Ironically, these characteristics are also frequently associated with AI-generated text.

Third, highly skilled academic writers frequently produce work that is grammatically consistent, logically organised and stylistically restrained. Such characteristics may be interpreted by AI detectors as evidence of machine generation when they are, in reality, indicators of academic competence.

The implication is troubling. A system that identifies qualities traditionally associated with excellent academic writing may incorrectly classify genuine scholarship as artificial intelligence output.

Current Academic Perspectives on AI Detection

The reliability of AI-detection technologies has become a growing area of scholarly debate.

One of the most influential studies in this area was conducted by Weber-Wulff and colleagues, who evaluated a range of AI-detection systems and concluded that available tools produced inconsistent results and were susceptible to both false positives and false negatives.⁷ Their findings challenged claims that AI-generated text could be reliably distinguished from human-authored content.

Similarly, OpenAI discontinued its own AI classifier in 2023 after acknowledging that the system demonstrated insufficient accuracy for reliable deployment.⁸ The decision is noteworthy because it suggests that even the developers of advanced language models were unable to create a consistently accurate detection mechanism.

More recent scholarship has echoed these concerns. Giray argues that AI-detection tools may unfairly accuse students and academics of misconduct, particularly where writing exhibits characteristics associated with formal academic discourse.⁹ Other studies have highlighted concerns regarding transparency, methodological limitations and the disproportionate impact of false positives upon students whose writing styles differ from expected norms.¹⁰

Particularly concerning is evidence suggesting that non-native English speakers may be more likely to receive elevated AI-detection scores due to linguistic patterns commonly associated with second-language writing.¹¹ If accurate, such findings raise significant questions regarding equality, discrimination and procedural fairness within educational institutions.

Are Students Being Unfairly Targeted?

The existence of AI-assisted cheating cannot reasonably be denied. Numerous studies have demonstrated that generative AI can produce sophisticated academic submissions capable of receiving passing grades.

However, acknowledging the existence of misconduct does not justify reliance upon unreliable evidence.

Fundamental principles of fairness require that allegations of academic dishonesty be supported by credible and verifiable evidence. Yet many institutions continue to rely heavily upon AI-detection scores despite growing recognition that such scores represent probabilistic assessments rather than factual determinations.

A report indicating that a paper is "85% AI-generated" may create the appearance of scientific certainty. In reality, the percentage reflects an algorithmic prediction based upon linguistic patterns rather than direct evidence of authorship.

This distinction mirrors a longstanding principle in legal reasoning. Statistical probability may justify further investigation, but it does not necessarily establish proof. AI-detection reports should therefore be treated as indicators warranting review rather than definitive evidence of misconduct.

Towards a More Equitable Framework

If AI-detection systems remain unreliable, institutions must consider alternative approaches to safeguarding academic integrity.

First, AI-detection scores should never constitute sole evidence of academic misconduct. This position has increasingly been adopted by universities and educational bodies throughout the United Kingdom, the United States and other jurisdictions.

Second, greater emphasis should be placed upon process-based assessment. Draft submissions, research logs, annotated bibliographies, oral presentations and viva-style examinations provide stronger evidence of authentic student engagement than a final written submission alone.

Third, institutions should develop transparent policies distinguishing between acceptable and unacceptable uses of AI. The use of AI for proofreading, brainstorming and organisational assistance differs fundamentally from submitting wholly AI-generated work as one's own.

Finally, educational institutions should prioritise AI literacy over AI prohibition. Artificial intelligence is becoming an integral component of professional practice across numerous disciplines. Students should therefore be taught how to use AI ethically and responsibly rather than being encouraged to conceal its use.

Conclusion

The emergence of generative AI has undoubtedly altered the educational landscape. Nevertheless, the assumption that AI-generated writing can be reliably detected remains increasingly difficult to sustain.

The experiment involving academic papers written between 1995 and 2010 demonstrates a significant conceptual weakness within current detection methodologies. If documents

authored before the existence of modern AI systems can be classified as AI-generated, then AI-detection software cannot be regarded as identifying actual AI use. Instead, such systems appear to identify statistical patterns associated with academic writing itself.

While AI-assisted misconduct remains a legitimate concern, educational institutions must balance the pursuit of academic integrity against the equally important obligation to ensure fairness. Current evidence suggests that AI-detection systems should be treated as investigative tools rather than adjudicative instruments.

The future of academic integrity may therefore depend less upon increasingly sophisticated detection systems and more upon assessment practices that encourage transparency, critical thinking and ethical engagement with artificial intelligence.

Footnotes

¹ OpenAI, ‘Introducing ChatGPT’ (OpenAI, 30 November 2022).

² Turnitin, ‘AI Writing Detection Capabilities’ (Turnitin 2024).

³ Debora Weber-Wulff and others, ‘Testing of Detection Tools for AI-Generated Text’ (2023) arXiv:2306.15666.

⁴ PD Deep, ‘Evaluating the Effectiveness and Ethical Implications of AI Detection Tools in Academic Settings’ (2025) 16 Information 905.

⁵ Stephen Wolfram, What Is ChatGPT Doing ... and Why Does It Work? (Wolfram Media 2023).

⁶ Selin Dik, Osman Erdem and Mehmet Dik, ‘Assessing GPTZero's Accuracy in Identifying AI versus Human-Written Essays’ (2025) arXiv:2506.23517.

⁷ Weber-Wulff and others (n 3).

⁸ OpenAI, ‘AI Classifier: Overview’ (OpenAI, July 2023).

⁹ Louie Giray, ‘AI Detection Unfairly Accuses Scholars of AI Plagiarism’ (2024) 72 The Reference Librarian.

¹⁰ Deep (n 4).

¹¹ Ibid.