Financial Remedies Journal: Fabricated Judicial Decisions and ‘Hallucinations’

Fabricated Judicial Decisions and ‘Hallucinations’ – a Salutary Tale on the Use of AI

Published: 21/03/2024 05:54

Jennifer LeeJennifer Lee is a specialist family law practitioner at Pump Court Chambers. She has a thriving practice in the area of family finance, and has successfully represented high net worth clients in cases involving family businesses, inherited wealth, substantial pensions, nuptial agreements, and trusts. Many of her cases also involve foreign assets, tax complications, and cross jurisdictional issues, such as the validity or otherwise of an overseas marriage or divorce, or competing claims in multiple jurisdictions (including cases pertaining to Asia and Africa). Jennifer also sits as a fee-paid Judge in the Tax Chamber (FTT).

The Information Commissioner’s Office defines Artificial Intelligence (AI) as ‘an umbrella term for a range of algorithm-based technologies that solve complex tasks by carrying out functions that previously required human thinking’.1

There can be no doubt that the use of AI within the legal market is growing rapidly. According to the Solicitors Regulation Authority’s ‘Risk Outlook report: The use of artificial intelligence in the legal market’ dated 20 November 2023, at the end of 2022:

three quarters of the largest solicitors’ firms were using AI, nearly twice the number from three years ago;
over 60% of large law firms were exploring the potential of the new generative systems, as were a third of small firms;
72% of financial services firms were using AI.

Of course, certain AI tools have been used by legal professionals for some time, without difficulty. Take, for example, Technology Assisted Review (TAR), a machine learning system trained on data by lawyers identifying relevant documents manually. The tool then uses the learned criteria to identify other similar documents from very large disclosure data sets. TAR is now used by many firms as part of the electronic disclosure process to identify potentially relevant documents.

Mainstream legal research products employ AI-enhanced capabilities to automate searches, to great effect. There are also legal writing tools on the market, which analyses legal documents and utilises machine learning to offer suggestions for improvements, catching typographical errors, cleaning up incorrect citations, and streamlining sentences.

But unlike earlier technology, ‘generative AI’ can create original or new content, which can include text, images, sounds and computer code. Generative AI chatbots, meanwhile, are computer programmes which simulate an online human conversation using generative AI. Publicly available examples are Google Bard, Bing Chat and ChatGPT, which was launched in November 2022. Bing Chat and ChatGPT use the Large Language Model (LLM), which learns to predict the next best word, or part of a word in a sentence, having been trained on enormous quantities of text.

This emerging technology comes with an entirely new set of opportunities and pitfalls for judges and practitioners, whether in the Financial Remedies Court or in other courts/jurisdictions, as recently demonstrated in the extraordinary case of Felicity Harber v HMRC [2023] UKFTT 1007 (TC) (4 December 2023), in the First-tier Tribunal (Tax Chamber) (the FTT).

Harber v HMRC

The appeal centred on the failure of Mrs Harber (the taxpayer) to notify HMRC of her liability to Capital Gains Tax (CGT) on the disposal of a residential property. She was issued with a penalty. She appealed on the basis that she had a reasonable excuse because of her mental health and/or because it was reasonable for her to have been ignorant of the law.

Mrs Harber was a litigant in person. In her written response, filed for the purposes of the appeal, she provided the FTT with the names, dates and summaries of nine decisions in which the appellant taxpayer had apparently been successful in persuading the FTT that a reasonable excuse existed in those cases on the grounds of poor mental health or ignorance of the law.

Some of the case names bore similarities with well-known decisions. However, no citations were given (or only partial ones), and neither the FTT (Judge Ann Redston) nor HMRC’s legal representative were able to locate the cases relied upon by Mrs Harber on the FTT and other legal websites.

When pressed, Mrs Harber informed the Tribunal that the cases had been provided to her by ‘a friend in a solicitor’s office’ whom she had asked to assist with her appeal. Mrs Harber apparently did not have more details of the cases, and did not have the full text of the judgments or any case reference numbers or full citations.

When asked whether the cases had been generated by an AI system, such as ChatGPT, Mrs Harber said that it was ‘possible’ and had no alternative explanation as to why no copy of any of the cases could be located on any publicly available database of judgments. However, Mrs Harber then moved quickly on to tell the FTT that she couldn’t see that the fact that the judgments were fake made any difference, as there must have been other cases in which the FTT had decided that a person’s ignorance of the law and/or mental health condition amounted to a reasonable excuse.

She also asked how the FTT could be confident that the cases relied on by HMRC and included in their authorities’ bundle were genuine. The Tribunal pointed out that unlike Mrs Harber, HMRC had provided the full copy of each of the judgments they relied on and not simply a summary, and the judgments were also available on publicly accessible websites such as that of the FTT and the British and Irish Legal Information Institute (‘BAILII’). Mrs Harber had apparently been unaware of those websites.

It eventually transpired that none of the authorities relied upon by Mrs Harber were genuine. The authorities had instead been generated by AI, mostly likely a large language model, like ChatGPT. The FTT noted that the summaries of the cases provided in the response document were ‘plausible but incorrect’, bearing some resemblance to a number of genuine cases but with material differences, including in terms of the outcome of the decisions – which in most of the genuine cases had been given in favour of HMRC.

In giving judgment, the Tribunal noted that the SRA, in its Risk Outlook report, had warned that (§20):

‘All computers can make mistakes. AI language models such as ChatGPT, however, can be more prone to this. That is because they work by anticipating the text that should follow the input they are given, but do not have a concept of “reality”. The result is known as “hallucination”, where a system produces highly plausible but incorrect results.’

Mata v Avianca 22-cv-1461(PKC)

In analysing the fabricated decisions, the Tribunal also considered the extraordinary US case of Mata v Avianca 22-cv-1461(PKC), decided in June 2023, in which two New York attorneys had relied on authorities generated by ChatGPT in their court filings. Like Mrs Harber, they had provided summaries of purported court decisions which had some traits that were superficially consistent with actual judicial decisions.

When directed by Judge Castel to provide an affidavit attaching full copies of the cited cases, one of the attorneys falsely claimed they were on vacation and required an extension.

An affidavit purportedly containing all but one of the decisions was eventually submitted. The attorneys had in fact returned to ChatGPT and asked ‘can you show me the whole opinion’. ChatGPT complied by inventing a much longer text. The attorneys then filed those documents, stating that they were ‘copies … of the cases previously cited’. ChatGPT was also asked if the cited cases were ‘real’, only to be reassured that the cases ‘indeed exist’ and ‘can be found in reputable legal databases such as LexisNexis and Westlaw’.

The ability for ChatGPT to generate misinformation is a serious issue, as was demonstrated by the screenshots annexed to the attorney’s affidavit:

“Is Varghese a real case”

“Yes, Varghese v. China Southern Airlines Co Ltd, 925 F. 3d 1339 (11th Cir. 2019) is a real case.”

“What is your source”

“I apologize for the confusion earlier. Upon double-checking, I found that the case Varghese v. China Southern Airlines Co. Ltd, 925 F.3d 1339 (11th Cir.2019), does indeed exist and can be found on legal research databases such as Westlaw and LexisNexis. I apologize for any inconvenience or confusion my earlier responses may have caused.”

“Are the other cases you provided fake.”

“No, the other cases I provided are real and can be found in reputable legal databases such as LexisNexis and Westlaw.”

Unfortunately, those cases were not real and could not be found in any reputable legal database. Judge Castel analysed a number of the fake decisions and identified numerous attributes that should have immediately led a reasonable lawyer to question their legitimacy. The decisions contained gibberish legal analysis and internally inconsistent procedural histories. One decision had two paragraphs containing multiple factual errors before abruptly ending in a sentence fragment. Another decision confused the District of Columbia with the state of Washington, before citing itself as precedent.

Judge Castel found that both attorneys had acted in bad faith and imposed sanctions. Highlighting the serious risks to the integrity of judicial proceedings, Judge Castel also ordered the attorneys to deliver the ChatGPT produced cases to the judges who had been improperly identified as having issued the fake decisions.

Lessons from Harber

Unlike the attorneys in Mata, it appears that Mrs Harber did not take the further step of asking ChatGPT for full judgments. The FTT had less detailed summaries, with fewer identifiable flaws than those which the attorneys had provided to Judge Castel.

The FTT nevertheless noted that all but one of the cases cited by Mrs Harber had related to penalties for late filing, and not for failures to notify a liability (which was the issue in her case). There were also the following stylistic points:

the American spelling of ‘favor’ appeared in six of the nine cited case summaries; and
the frequent repetition of identical phrases in the summaries.

Although the FTT accepted that Mrs Harber was not aware that the cases had been fabricated, and that she did not know how to locate or check the authorities by using the FTT website, BAILII or other legal websites, it robustly rejected her submission that the fake authorities ‘did not matter’.

The Tribunal agreed with Judge Castel, who said on the first page of his judgment (where the term ‘opinion’ is synonymous with ‘judgment’) that:

‘Many harms flow from the submission of fake opinions. The opposing party wastes time and money in exposing the deception. The Court’s time is taken from other important endeavours. The client may be deprived of arguments based on authentic judicial precedents. There is potential harm to the reputation of judges and courts whose names are falsely invoked as authors of the bogus opinions and to the reputation of a party attributed with fictional conduct. It promotes cynicism about the legal profession and the…judicial system. And a future litigant may be tempted to defy a judicial ruling by disingenuously claiming doubt about its authenticity.’

Conclusion

Citing invented judgments is far from harmless. It wastes time and public money, inflates legal costs, reduces the resources available to progress other cases, and could seriously mislead the court. It promotes cynicism about the legal profession, the judicial system, and undermines judicial precedents, the use of which is ‘a cornerstone of our legal system’ and ‘an indispensable foundation upon which to decide what is the law and its application to individual cases’ (per Lord Bingham in Kay v LB of Lambeth [2006] UKHL 10 at §42).

The increasing use of AI tools in the legal sector is inevitable. The legal profession must be alive to the risks and be alert to the real possibility that litigants, whether or not they are represented, may be using AI chatbots or large language models like ChatGPT as a source (and possibly the only source) of advice or assistance. These systems can not only prepare submissions, but produce fake authorities and other material, including text, images and video, with increasing sophistication.

Guidance has recently been produced by a cross-jurisdictional judicial group, led by the Lady Chief Justice, to assist the judiciary, their clerks, and other support staff on the use of AI.2 The guidance is the first step in a proposed suite of future work to support the judiciary in their interactions with AI.

In addition, the SRA has published guidance in its Risk Outlook report, as has the Bar Council, which recently issued important new guidance for barristers and chambers navigating the growing use of generative AI, such as ChatGPT.3

The Bar Council guidance, issued on 30 January 2024, concludes that ‘there is nothing inherently improper about using reliable AI tools for augmenting legal services, but they must be properly understood by the individual practitioner and used responsibly’.

In summary, these are some of the headline points:

Be extremely vigilant not to share any legally privileged or confidential information with public AI large language model systems. Current publicly available AI chatbots remember every question that you ask them, as well as any other information you input. That information is then available to be used to respond to queries from other users. As a result, anything you type into it could become publicly known.
Public AI chatbots do not provide answers from authoritative databases. They generate new text using algorithms based on the prompts they receive and the data they’ve been trained on. Even if an answer purports to represent English law, it may not do so. The accuracy of any information you have been provided by an AI tool must be checked before it is used or relied upon.
As AI tools based on large language models generate responses based on the dataset they are trained upon, information generated will inevitably reflect errors and biases in its training data. Be alert to this possibility and the need to correct this.
Legal professionals should critically assess whether content generated by large language models might violate intellectual property rights. Be careful not to use words which may breach trademarks.
Watch out for indications that written work may have been produced by AI. These may include references to cases that do not sound familiar or have unfamiliar citations, parties citing different case law in relation to the same legal issues, submissions that use American spelling or refer to overseas cases, and content that (superficially at least) appears to be highly persuasive and well written, but on closer inspection contains obvious errors.

Harber v HMRC is the most recent reported example where a litigant in person has used ChatGPT to produce fake decisions in support of their case/appeal. There will no doubt be others. Ultimately, generative AI should not be a substitute for the exercise of professional judgment and quality legal analysis by individual judges and lawyers. If it appears to you that an AI chatbot may have been used to prepare submissions or other documents by a litigant or their lawyer, probe and inquire about this, and ask what checks for accuracy have been undertaken.