Massachusetts State House.
Boston Bar Journal

Predictive Coding: Process and Protocol

September 17, 2013
| Fall 2013, Vol. 57, #4

by Andrew Gallo and Sarah Kim

Practice Tips


Electronic discovery is a fact of life in most business litigation.  As use of email and text messaging increases, so has the burden of reviewing electronically stored information (“ESI”).  The tried and true method for reviewing voluminous ESI is for attorneys to examine on a document-by-document basis ESI identified through search terms.  However, in large cases a document-by-document manual review – even after search term culling – may be extremely expensive and time consuming.  Recently, courts have begun to embrace predictive coding as an acceptable method of technology assisted review (“TAR”), which may alleviate these burdens.  See, e.g., Da Silva Moore v. Publicis Groupe & MSL Group, 287 F.R.D. 182, 192 (S.D.N.Y. 2012) (Peck, Mag. J.), aff’d, No. 11 Civ. 1279, 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012) (Carter, J.).  In Cambridge Place Investment Management Inc. v. Morgan Stanley & Co., Inc., SUCV2010-2741-BLS1 and SUCV0211-0555-BLS1 (Billings, J.), a Massachusetts court recently approved the use of predictive coding.

As courts accept predictive coding, most (including the Cambridge Place court) will require parties to negotiate protocols for its use.  This article provides an overview of predictive coding and highlights issues likely to arise when negotiating such a protocol.


Predictive coding is a process by which a computer is “trained” through the use of an algorithm to rank documents by the likelihood of their responsiveness to discovery requests.  The technology is similar to that used by internet search engines, like Google, to recommend relevant web pages based upon the search terms identified by the user.

The first step in the predictive coding process is to choose a “Review Set” of ESI for review.  In larger cases, this means identifying custodians likely to have responsive documents and agreeing on a date range for the collected ESI.  Because of efficiencies (discussed below), predictive coding allows for a broader review set than a traditional document-by-document review of ESI.  Parties may also agree that only documents that contain certain search terms will be included in the Review Set.

Once the Review Set is collected and the documents aggregated, a subset of documents from the Review Set is randomly selected to form a “Seed Set.”  Attorneys familiar with the case will look at each document in the Seed Set and code them as either responsive or non-responsive to the case’s document requests.

Using an algorithm, the computer then analyzes the coded documents in the Seed Set.  The computer “learns” by identifying the characteristics that distinguish responsive documents from non-responsive ones, such as the frequency of particular words and phrases and their proximity to one another.

Once “trained,” the computer analyzes the remaining documents in the Review Set to rank and group them into “tiers” based upon the likelihood that they are responsive to the requests used to code the Seed Set.  The uppermost tiers contain the documents most likely to be responsive and the lowest tiers contain those least likely to be responsive.  By placing documents in tiers, predictive coding enables parties to prioritize the review of the documents most likely to be responsive.  A party may decide to produce all ESI in the uppermost tier with minimal additional review and to discard ESI in the lowest tier without additional review.  Thus, predictive coding can eliminate a significant amount of document-by-document review.

Further, studies show that predictive coding can find responsive documents more accurately than a manual review.  Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII Rich. J.L. Tech. 11, 4-5, 48 (2011); Herbert L. Roitblat, Anne Kershaw & Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, 61 J. Am. Soc’y for Info. Sci. & Tech. 70, 79 (2010).


The courts that have sanctioned predictive coding, including the BLS in Cambridge Place, have done so only after approving a (typically agreed-upon) protocol for its use.  Some or all of the following issues are likely to arise when negotiating a protocol with an opposing party:

1.         Search Terms.  Using search terms to select documents for the Review Set can aid the process by creating a Review Set that is “richer” with responsive documents.  The computer can then be trained more easily and accurately.  At least one recent case has sanctioned using search terms in this manner.  In re Biomet M2a Magnum Hip Implant Prod. Liab. Litig., No. 3:12-MD-2391, 2013 WL 1729682, *2 (N.D. Ind. Apr. 18, 2013).  Search terms can also be used when negotiating how to review the various document tiers.  One way to review the lower tiers is to agree to review only documents from those tiers that “hit” on agreed search terms.

2.         Non-Responsive Documents.  Some protocols allow the producing party’s responsiveness decisions to be “double checked” by permitting the opponent to review the documents marked as non-responsive in the Seed Set.  Litigants have legitimate concerns about exposing non-responsive documents to an opponent and the right to such secondary review is far from settled (such a review is a foreign concept in a standard manual review).  If review of non-responsive documents cannot be avoided, a compromise may be to agree to share a small, but statistically significant, subset of non-responsive documents.  If there are concerns about sensitive material, parties can use a log (similar to a privilege log) to identify rather than produce non-responsive documents.

3.         The Tiers.  How the various tiers of documents are treated is key to the predictive coding process.  Issues likely to arise are whether and to what extent documents in the uppermost tiers need to be manually reviewed prior to production and whether the lowest tiers can be disregarded without additional review.  If lower tiers are to be disregarded, then the requesting party will want to ensure that the predicted number of responsive documents in those tiers is below some statistical threshold.  Parties may agree upon a method to manually review a statistically significant sample from the lower tiers to test the computer’s predicted responsiveness rate.  Most e-discovery vendors employ statisticians who should be used to assist with the formation and negotiation of these technical aspects of a predictive coding protocol.

4.         Statistical Data.  The amount of statistical data provided to an opponent may be an area of negotiation.  Common metrics include the amount of ESI in each tier and a comparison of the computer predicted and the actual responsiveness rate (after manual review) of the ESI in tiers that are manually reviewed.

5.         Key Unique Terms.  Parties can agree that any ESI in the Review Set that “hits” on terms that are unique to a case (i.e., a party’s name) be automatically elevated to the highest tier.  These should be terms that are highly likely to be found only in responsive documents; otherwise, their use will undermine the predictive coding process.

6.         Enrichment Documents.  In addition to using the Seed Set, parties often pick “Enrichment Documents” central to the case (such as a contract) to train the computer.  Resolving disagreements about to the identity and number of such documents may involve each side identifying an equal number of such documents.  Parties should consult with e-discovery experts as to whether the use of such documents may statistically impact the review results.

The use of predictive coding is increasing as more courts accept the technology.  While it offers efficiencies for reviewing documents, the overall cost and time savings will depend upon how easily the technology is accepted in the case and the burdens associated with negotiating and implementing the protocol.  The costs of implementation should decrease as the technology becomes more widely accepted and protocols are standardized.


Andrew J. Gallo is a partner with the firm Bingham McCutchen LLP, where his practice focuses on complex commercial litigation and creditor representation in bankruptcy with an emphasis on the representation of financial institutions.

Sarah G. Kim is counsel with the firm Bingham McCutchen LLP, where her practice focuses on securities enforcement, securities litigation, and broker-dealer defense.