D4 eDiscovery Service Blog
Feb 24

A few weeks ago, we posted a blog post discussing dirty data and Conceptual Engines. This week, we’ll talk about how to deploy a conceptual engine for a review and what benefits you might expect to see if you should choose to do so.

Document reviews are typically the highest single expenditure for the discovery process because it involves actual people reviewing and coding data. As a result of the ever increasing sizes of data that are subject to requests for production, in-house and outside counsel are inevitably going to look for alternative methods of complying with discovery requests. In addition, in-house and outside counsel discovered long ago that when humans are involved, “human error” will also be involved and they must compensate for that error as part of their protocols.

So, how can you use a conceptual engine to help? We have seen these tools successfully deployed in three different ways: Conceptual Clustering, Categorization and Assisted Review.


Some of the conceptual engines allow you to use the engine to automatically build conceptual based clusters around the data before any work has been done in the system. The downside of this is that you have no control over what topics the system will center its construction around. That said, we’ve seen this tool be useful in two ways.

First, it can be a useful method for organizing batches of documents. If reviewers check out a batch and the concepts contained in the documents therein are all similar in nature, then the mental gears that the reviewer will have to shift through are less and their review speed should increase.

Second, you can use it to isolate useless data. We have worked with a set of data that returned a cluster that centered on correspondence to and from a particular airline all relating to travel. Since that was not the subject of the lawsuit, we were able to bulk tag that cluster (10% of the dataset) as not relevant, and eliminate it from the set in need of review.


There are two ways to use this tool. First, if you have a good idea of what you are looking for in the data, a subject matter expert can run targeted queries (in lieu of a statistical sample), code data using either a relevancy field or an issues field, and use the conceptual engine to develop categories of the remaining sets of data. This can allow you to easily find data in the system that is important to your case, and is a particularly good methodology for delving into a matter on an exploratory level, but where you have some knowledge about what is relevant to the litigation (i.e.: how much of this is there in my data?).

Second, the engine can be used to QC your reviewers work. You can run categorization across the data that they have already coded and turn it in on itself to check for inconsistencies. A good example of this would be to give the conceptual engine examples of privileged data and ask it to find data similar to the examples (usually you can set a percentage marker, say 95%) that aren’t coded as privileged. The same could be said of particularly sensitive data, such as business confidential, or responsive data.


The process is not difficult, and tools like Relativity have built in an application that makes it that much easier to use. Assisted Review uses categorization as described above, but in short, the system walks you through several iterations of statistical samples until you get to a comfort level with how the system is categorizing the remaining set of data. Statistical samples are a small fraction of the actual population size, so this gives you the ability to save the cost of reviewing the entire dataset.

There are some caveats:

• This can be a good method when you do not know, or are not clear what you are looking for in the dataset.
• You will want to use highly trained reviewers who on some level have become familiar with the types of products or data they will be looking through. In addition, you will want the reviewers to conference several times per day to discuss what they have been finding and how they have been coding certain data.
• You can at a minimum use this to separate the data that is likely relevant/responsive from the data that is not relevant/not responsive, somewhat similarly to a first pass broad sweeping review.


Almost all conceptual engines give the administrator control over how the index is constructed. The bottom line rule here is to always clean up your training set.

Your training set is what you use to train the index on the different types of concepts contained in your data. Remember the old adage, “Garbage in, garbage out”? It definitely applies here. Typically, you will see the following:

• Upper and lower size limitations on the files.
• Restriction that all files in the training set have text.
• Elimination of files with useless text from the training set. An example of files that contain useless text is non-OCR’d images. You do not want your index training itself on a bunch of files that have “Size = 1000 x 900” as the text of the data. The rule applies to any type of file that doesn’t present you with real, meaty, useful data in the text of the file. Failure to remove these files from your training set could ultimately lead to an index that returns skewed results due to dysfunctional concepts in the data.

The Bottom Line
There are a variety of ways to deploy conceptual indices on your project to help segregate data and remove the need for review of every document in your dataset. Examine your case, look at your options and figure out what will work best for your case.

Tags: , , , ,

Leave a Reply


Connect with D4