Row of light bulbs one is lit.

Applied Legal Ethics: Analyzing The Effectiveness Of Utilizing Legal Technology To Protect Privileged Documents From Disclosure

By Peter Gronvall, Nathaniel Huber-Fliflet

March 4, 2019

The Ethical Obligation of Protecting Privileged Documents from Disclosure

There may be no stronger imperative in litigation and investigation matters than protecting privileged communications and data from disclosure to adverse parties. Corporations and their legal teams are singularly focused on this objective, because of the substantial and potentially irreversible risks that could result from a failure to achieve this objective. The obligation to protect privileged documents supersedes case litigation considerations; it stems from fundamental ethical obligations of legal practice. The protections of legal privilege are central to effective client advocacy: they foster and protect creative and thoughtful discourse between client and attorney, and they remain an essential part of the U.S. legal advocacy process.[1]

Protecting privileged information from disclosure is a long established legal principle ensuring attorneys open and presumptively non-discoverable communication channels through which to render legal advice to clients.[2]

It is important to note, however, that attorneys invoking the protections of legal privilege to withhold client communications or related work-product materials are subject to errors and mistakes in designating materials as privileged. Claiming attorney-client privilege (or its related work-product doctrine protection) today requires a nuanced and thoughtful approach, subject to scrutiny by requesting parties. Determining when the privilege applies in the context of a document review requires an element of sophistication in assessing claims of privileged relationships between senders and recipients of communications. It is not surprising that in modern legal practice, the application or claim of privilege can occur in a number of circumstances, and attorneys must account for all of those scenarios in making privilege determinations.

In today’s legal practice, there are at least twenty-four scenarios in which the production of otherwise privileged documents could result in the nullification or “waiver” of privilege.[3] It has been observed that “few issues arise with greater frequency in civil litigation than whether a document is privileged, to prevent those communications from compelled disclosure by virtue of the attorney-client or work-product doctrine privilege.”[4]

In matters that involve massive volumes of potentially-relevant data, today’s legal teams risk nullifying or waiving privilege protections, if privileged documents are somehow ‘missed’ during the document review process and thus disclosed. As a result, attorneys could easily put privileged documents at risk to be produced to the requesting or opposing party. As any reader of this article knows, inadvertent production of privileged communications or work-product documents can have substantial implications; disclosure of privileged information could be devastating to the legal claims or defenses at hand, because those documents could provide an opposing party with insights into a client’s proposed legal strategy, case-planning decision-making process, or internal – and confidential – investigation findings.

So an essential question is, in today’s data-intensive matters, what can lawyers do to help ensure that privileged documents are identified, flagged, and protected from disclosure? An important part of the answer to that question, in our view, lies in the use of technology. To wit, technology solutions can be deployed in a defensible and repeatable manner – beyond simple search terms, for example – to identify and withhold privileged documents.

This article will briefly explore some of the key concerns around using technology to help with that important, practical requirement.

Today’s Data Volumes Place Critical Stress on Lawyers’ Ethical Obligations to Protect Privileged Documents from Disclosure, and the Answer to that Challenge May Be the Use of Technology Solutions

At this point in corporate law, data volumes have grown so large that it has been difficult for lawyers to remain apace with their obligations to keep their clients’ privileged data confidential. As a result, attorneys are turning to new tools to identify privileged documents to withhold from disclosure. Data volumes at issue in modern litigation are truly staggering. It is estimated that by current rates, by the year 2020, the accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes.[5] This statistic alone is daunting.

The practical challenge is clear, with data volumes now irreversibly large, how can attorneys abide their ethical obligations to protect privileged documents from disclosure? While the solution will not be simple, it will undoubtedly embrace the use of technology.

Using Advanced Analytics – Beyond Search Terms – to Protect Attorney-Client Privileged Material from Inadvertent Disclosure and Waiver

Legal teams today are increasingly relying upon text analytics to search for and identify privileged documents in order to withhold them from production to requesting parties. Using search terms is an important part of that. But to further improve their results, lawyers are now looking to predictive modeling techniques to enhance the results and to improve their productions.

Properly deployed, keyword searching and predictive modeling are used to identify and then eliminate privileged documents from disclosure. While keyword searching has been a core element of privilege review for decades, and while predictive modeling is becoming more popular as a technique for that type of filtering, the research indicating how well those separate technologies actually performed is still sparse.

To this end, Ankura performed a study to measure how effectively keyword searching and predictive modeling techniques performed when tasked with both targeting and segregating privileged content, as well as withholding the segregated content from disclosure.

Analysis of Common Market Methodologies Used to Identify and Protect Privileged Data from Disclosure

Ankura examined how effectively keyword searching and predictive modeling techniques protect against the disclosure of privileged information. The intent of our study was to evaluate the strengths and weaknesses of traditional and advanced approaches in litigation scenarios – to see how keyword searching performed as compared to predictive modeling – and not necessarily to determine whether one approach was more effective than the other. Further, our approach was to evaluate the efficacy of the two approaches measured against the ethical requirements to protect privileged data incumbent on legal practitioners. As such, the study examined the following important considerations:

  • Utilizing Keyword Searches to Identify Privileged Documents
    • How much of the privileged-document population do keywords successfully help identify?
    • In keyword searching scenarios, how many ‘not privileged’ documents are typically also reviewed to conduct a comprehensive privilege review?
  • Deploying Predictive Modeling Technology to Identify Privileged Documents
    • Can predictive models effectively target privileged content?
    • Can predictive models find privileged content that keyword searching cannot necessarily identify?

Ankura conducted this study by performing a ‘look-back’ on documents associated with three confidential, recently-active corporate legal matters.[6]

The data sets from these matters were comprised of an array of data types, including email, Microsoft Office documents, PDFs, and other text-based documents. Prior to the study, teams of attorneys had reviewed all documents in each of the data sets. Their coding designations of those documents as ‘privileged’ or ‘not privileged’ were used to measure the effectiveness of each privilege-targeting technology: keyword searching and predictive modeling.

Figure 1. Data Set Details

Data Set Name Total Documents Privileged Documents Coded by Attorneys Not Privileged Documents Coded by Attorneys Privilege Richness Rate
Matter A 360,531 46,756 313,775 12.97%
Matter B 397,289 14,326 382,963 3.61%
Matter C 8,715,165 536,788 8,178,377 6.16%

Experiments and Results

Keyword Searching

Keyword searching experiments evaluated the performance of each matter’s keyword terms using a comprehensive list of keywords developed by attorneys. Matters A, B, and C contained 845, 6,771, and 7,140 search terms, respectively. The search term lists consisted of words including ‘privileged’, ‘legal’, and ‘attorney-client,’ as well as search terms representing known email addresses, law firm names and email domains.

After applying each matter’s keyword search term list to its respective data set, we calculated the recall and precision of each search term list. These measurements were made possible because the attorney-review coding was transparently available for each matter. As industry participants know, recall and precision are two commonly-known metrics that are regularly used to evaluate the effectiveness of text analytics technologies within the legal industry. These metrics helped Ankura interpret the strengths and weaknesses of the privileged-document targeting technologies. The following are brief definitions:

  • Recall – This measurement quantifies the proportion of privileged documents in the data set that are identified by the privilege-targeting method. This metric helps to confirm the completeness of the privilege review. The higher the recall rate, for example, the better for the producing party. High recall rates in a document set result in more privileged documents being segregated and targeted for review by the attorneys, and ultimately protected by being withheld from production.
    • For example, in high-recall rate scenarios, if there are 100 privileged documents in a hypothetical data set of 1,000 documents and 75 privileged documents are identified by the targeting method, the recall rate would be 75%.
  • Precision – This measurement quantifies the proportion of documents identified by the privilege-targeting method that are actually privileged. This metric helps confirm the efficiency of the review for privileged documents. As with the recall rate, the higher the precision rate, the better, however, no privilege-targeting method is perfect. During privilege review this study found that there will be nonprivileged documents flagged as keyword term hits or by the predictive model – resulting in the review of some categorically not privileged documents by attorneys.

A high precision rate minimizes the number of nonprivileged documents that legal teams must review in order to identify the privileged documents. With a maximized precision rate thus enhancing the privilege review, the ultimate production results in better quality and reduced costs.

  • By way of example, if 150 documents are identified as potentially privileged by the targeting method within the hypothetical 1,000 document data set and 75 of those are coded privileged by the review team, then the resulting precision rate would be 50%.

Recall and precision are usually inversely proportionate measures: as recall rates increase, precision rates usually decrease, and vice versa. Ankura observed that it was unlikely that keyword searching and predictive modeling would maximize both metrics of recall and precision.

Figure 2. Keyword Searching Results

Data Set Name Recall Recall Details Precision Precision Details
Matter A 93.78% The term list hit on 43,848 of 46,756 privileged documents 22.72% 22 out of every 100 documents that hit on the term list were privileged
Matter B 94.73% The term list hit on 13,571 of 14,326 privileged documents 3.68% 3 out of every 100 documents that hit on the term list were privileged
Matter C 94.74% The term list hit on 508,553 of 536,788 privileged documents 20.39% 20 out of every 100 documents that hit on the term list were privileged

The results of these experiments demonstrated that keyword searching could be a very effective privilege-targeting technology in some circumstances. The search term lists from these matters contained an extensive list containing both broad as well as specific terms. Each matter’s search term list identified more than 93% of the privileged documents in the data set.

To achieve that level of recall, the precision rates resulting from these search term lists required extensive document review. For example, Matter C’s results required review of more than 6.5 million not privileged documents to achieve a 94.74% recall rate. These results also showed that roughly 6% of the privileged documents were missed by keyword searching alone. While no search technique is perfect, this became an important observation to consider, especially as data volumes increase.

Additional Finding: Less comprehensive and less thoughtfully-crafted search term lists may not provide recall results comparable to those in this study.

Predictive Modeling as a Technique to Identify Privileged Documents

Ankura’s experiments evaluated the performance of predictive modeling’s ability to effectively target privileged documents, as compared to keyword term searching. Specifically, we set out to answer an important question: could predictive modeling help find privileged documents that keyword searching techniques would have otherwise missed?

How Does Predictive Modeling Work?

Predictive modeling uses advanced machine learning techniques to automatically classify unreviewed documents into predefined categories of interest (e.g., in this instance, attorney-client privilege or workproduct documents). Predictive modeling techniques employ text classification, a form of supervised learning, to make a binary decision – to designate a document as privileged or not privileged. Utilizing training documents (documents previously identified by lawyers to teach the machine what is and what is not a privileged document), a predictive model analyzes the textual content of each ‘privileged’ and ‘not privileged’ training document.

When our model was trained and deployed, it was then used to rank each document in the unreviewed data set with a probability score (ranked from 0-100 in terms of likelihood of falling into those ‘privileged’ or ‘notprivileged’ categories) and that score indicated the likelihood that each document was either privileged or not privileged. We found that it follows from this approach that a higher ‘privileged’ score indicates a greater chance that a document contains privileged material, and thus could be withheld from disclosure.

During this study, Ankura developed a predictive model for each matter using a random sample of 5,000 training documents pulled from each data set. As stated previously, the data sets were reviewed by attorneys, and their coding was used to develop and test the predictive models.

Figure 3. Predictive Model Training Sets

Data Set Name Total Training Document Privileged Documents Coded by Attorneys Not Privileged Documents Coded by Attorneys Privileged Richness Rate
Matter A 5,000 689 4,311 13.78%
Matter B 5,000 170 4,830 3.40%
Matter C 5,000 326 4,674 6.52%

In this predictive modeling experiment, the following algorithm and parameters were selected to develop the predictive models:

  • Ankura deployed the Logistic Regression machine learning algorithm to create the predictive models. One of Ankura’s prior studies demonstrated that predictive models generated with the Logistic Regression algorithm perform quite well on legal matter documents.[7]
  • Ankura employed other parameters for modeling, including ‘bag of words’ with 1-gram and normalized frequency, and 20,000 tokens were used as features.

After generating the study’s predictive models, Ankura examined the precision of each model, measured against similar recall rates previously-achieved using keyword searching techniques. This analysis provided Ankura with the ability to accurately compare the results of the predictive modeling and keyword searching technologies.

Figure 4. Predictive Modeling Results and Keyword Searching Comparison

Data Set Name Predictive Modeling Keyword Searching Precision Comparison
Recall Precision Recall Precision
Matter A 93.78% 30.11% 93.78% 22.72% 7.39%
Matter B 94.74% 4.44% 94.73% 3.68% 0.76%
Matter C 94.74% 17.43% 94.74% 20.39% -2.96%

Analyzing the results from the figure above, Ankura found that predictive modeling performed just as well as, if not sometimes better than, keyword term searching. In Matter A, at a recall rate of 93.78%, the resulting precision rate was more than 7% higher when compared to the results produced by its corresponding keyword searching experiment. For Matter B, a higher predictive modeling precision rate was observed against its corresponding keyword searching experiment. For Matter C, the precision rate of predictive modeling was nearly 3% lower than its corresponding keyword searching experiment.

Another observation from these experiments stemmed from the significant impact that the resulting precision rates had on the resulting document review. For instance, in Matter A, a 7% increase in precision resulted in reviewing 25,0000 fewer documents. At an estimated cost of $1.00 a document for first pass review, this precision rate translated into an estimated saving of at least $25,000.

In sum, Ankura’s results demonstrated that predictive modeling, properly deployed, can be a very effective technology-based solution for identifying privileged documents, both from accuracy and cost-savings perspectives.

Figure 5. Documents Identified by Predictive Modeling That Did Not Hit on Keyword Search Terms

Data Set Name Total Documents at 50% Precision or Greater and Did Not Hit on a Keyword Search Term Coded Privileged by an Attorney Coded Not Privileged by an Attorney
Matter A 6,075 1,062 5,013
Matter B 2 2 0
Matter C 72,295 6,924 65,371

In recent years, keyword term searching was one of the only technologies available to target and withhold privileged documents from disclosure. The risk when deploying keyword searching is that a keyword term list could be too narrow, thus resulting in missing key documents, including by failing to find outside counsel domains or attorney names.

To evaluate a predictive model’s ability to address this risk, Ankura performed additional experiments to determine if predictive modeling could identify privileged documents that did not hit on each matter’s keyword search terms.

These additional experiments targeted a population of potentially privileged documents that fell within a precision rate of 50% or higher for each model and also did not hit on a keyword search term. As a result, Ankura observed that predictive modeling identified privileged documents that did not hit on a keyword search term utilized by the attorneys. For Matters A and C, more than 1,000 privileged documents were identified using predictive modeling that would have otherwise been missed by keyword searching.

These results were compelling. They evidenced how predictive modeling technologies enhance the ability to efficiently target privileged documents when compared to using keyword searching as a standalone privilege-targeting method.

Conclusion: Dynamic Use of Keyword Searching Combined with Predictive Modeling Can Enhance the Effective Protection of Inadvertent Disclosure of Privileged Documents in Legal Matters

Protecting privileged communications and data from inadvertent disclosure is a fundamental ethical obligation for counsel. Today, the challenges of privilege review require the use of intelligent solutions, combining keyword search terms and predictive modeling techniques.

This study demonstrated that keyword searching is a powerful privileged-document targeting technology by itself. Indeed, comprehensive search term lists can find the vast majority of privileged communications. But perhaps most importantly, this study also validated that predictive modeling as also a very effective privilege-targeting tool, and sometimes can outperform keyword searching. Importantly, predictive modeling can help to identify privileged documents that do not hit on keyword search terms, making it a powerful addition to any privilege review that sets out to utilize keyword search terms alone as a strategy.

The legal community should continue to develop a robust and comprehensive review strategy to manage the protection of privileged content, especially as it relates to disclosure and production issues. Lawyers are understandably concerned with meeting their ethical obligations and thus should be willing to advance those ethical goals through the use of technology. Technology is here to help lawyers, and lawyers in turn should embrace it.


©2018. Published in ABA Section of Antitrust Law – Compliance and Ethics Spotlight, Winter 2019, by the American Bar Association. Reproduced with permission. All rights reserved. This information or any portion thereof may not be copied or disseminated in any form or by any means or stored in an electronic database or retrieval system without the express written consent of the American Bar Association or the copyright holder.


[1]  J. Unger, “Maintaining the Privilege: A Refresher on Important Aspects of the Attorney-Client Privilege,” Business Law Today, Oct. 2013, Retrieved from https://www.americanbar.org/publications/blt/2013/10/01_unger.html.
[2] E. Epstein, “The Attorney-Client Privilege and the Work-Product Doctrine, Volume 1.” American Bar Association, 2007, pp. 4-5.
[3] Id. at 398.
[4] Id. at 2.
[5] B. Marr, “Big Data: 20 Mind-Boggling Facts Everyone Must Read,” Forbes, 2015, Retrieved from https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#3d83169417b1
[6] Those clients assented to our confidential use of those data sets, including the coding designations in those data sets.
[7] R. Chhatwal, N. Huber-Fliflet, R. Keeling, J. Zhang and H. Zhao, “Empirical Evaluations of Preprocessing Parameters’ Impact on Predictive Coding’s Effectiveness,” Proc. 2016 IEEE International Big Data Conference.