How To Do It For Little-To-No Additional Cost
As an eDiscovery consultant, I’ve managed and optimized hundreds of client environments. I’ve been fortunate to work with some truly talented technologists. After all these engagements, here’s what I’ve learned. The pressure to perform in this industry is incredibly high. I see organizations struggling with two core questions. First, how do we do more with less, given that budgets and resources are limited? Second, how do we handle a large new project that could overwhelm our existing capacities?
Both questions are about eDiscovery performance. Here’s the good news, as I see it; in virtually every environment I’ve assessed, there have been opportunities to dramatically increase performance for nominal additional costs. In this article I’d like to share five ideas with you that could really help you do more with less. In fact, it’s entirely possible that these ideas could double the performance of your eDiscovery environment for zero additional dollars.
Who Is This Counsel For?
Law firms, eDiscovery service providers, major accounting and consulting firms, government regulators and even corporations often own and manage eDiscovery environments in-house. While there are substantial differences in their service offerings and business models, all of these types of organizations want speed, throughput and reliability from their environments. If what I’m about to describe sounds like your organization, these ideas could be exactly what you’re looking for:
- Overall, you believe your eDiscovery environment should be able to perform much better than what you’re seeing today.
- You’ve made a large capital investment in systems but have not yet realized the performance improvements you were hoping to see.
- Your technology’s unpredictability causes you to struggle to accurately forecast the time required to complete and deliver against production deadlines.
- Your organization historically struggles to hit hard deadlines and SLA thresholds.
- You operate eDiscovery in-house and manage a high volume of client and in-house data.
- You’ve experienced a system failure (outage) with detrimental financial, reputational, and human consequences.
- You’re concerned about your capacity to handle a large new case.
Before we look at my five ideas, let me quickly describe what the word performance means to me:
- Speed. This is about systems running at optimal speed throughout the eDiscovery lifecycle, but particularly during Processing, Culling, Analytics, Review and Production.
- Power. This is about the environment hitting periods of high-intensity workloads without being overwhelmed, slowing down, crashing and causing platform outages.
- Scaling Capacity. This is about all systems, not just storage, easily handling peaks and valleys of workloads without being pushed to red-line status.
- Reliability. This is about your confidence in the systems working day in and day out in a way that is consistent with your expectations.
If that sounds like your organization and what you’d like to achieve, I believe these five ideas could really help:
- Get clarity about your environment’s current state.
- Provision virtual machine resources based upon workload requirements.
- Understand agent activity on your machines.
- Avoid SQL resource starvation.
- Conduct capacity planning to enhance environment predictability.
Let’s take a closer look at each of these ideas.
Get Clarity About Your Environment’s Current State Capabilities
The starting point for doubling performance is documenting current state. This is a relatively straightforward exercise, yet one I see most organizations not engaging in. Why do you need to do this?
- You need baseline performance metrics to help you document your current capabilities so you can define and recognize “poor” performance.
- You need documentation about all components within the system so you can easily spot obvious problems. There are several components within eDiscovery environments that could produce slow-downs, weak performance and outages. If you can compare actual performance, by way of log files and the like, with the manufacturer’s projected performance, you just might be able to narrow down the cause of your problems. This can expedite resolutions.
- You want to be predictive about future-state performance so your efforts are focused. For example, if your current Processing performance yields 2.5 Gigabytes (GB) per hour per worker but your expectation is to be at 5 GB per hour per worker, you know where to focus—enhancing Processing throughput.
- If you can spot problem areas, you can engage in intelligent fixes that might not cost you anything. The default solution we see organizations leaning toward, when there are performance issues, is throw more hardware at their problems. In some instances, that’s necessary. But in many, many other instances, that won’t really solve your problem.
I recommend that you document performance in two areas: systems and throughput (the actual amount of work you’re getting complete). For systems, I recommend this type of analysis:
- Physical Servers. How many servers do you have? How old are they? What are their specifications for CPU and RAM?
- Storage. How many systems do you have? How old are they? How much unused capacity do you have today? What are their specifications? We often find that the types of storage organizations use can significantly impact cost and performance. We recommend a tiered architecture that would likely include high-speed (flash) systems and lower-speed systems, based on the tasks they need to handle.
- Virtualization specifications. How many virtual machines are you running in total? How many are you running per physical server? How many physical CPUs are available versus allocated (virtual CPUs) on each host? Have you, perhaps, under-invested in the physical requirements necessary to support your virtual infrastructure?
- SQL databases. How many SQL databases are in your environment today? How many SQL servers do you have to support them? Do you have the right balance between SQL Standard versus SQL Enterprise licenses? We often find that organizations do not engineer their SQL environment to support their actual work-flows.
- eDiscovery application. What application portfolio is in use and how is it used in your workflow? How many licenses do you have? Which platform did you choose to go with? What are your current utilization trends? How does your actual performance compare to the application vendor’s benchmarks?
- Network. Is your eDiscovery environment segregated from your general IT environment? I recommend that you document your network systems and design. There are many tools that will allow you to create network maps quickly and cost effectively. I recommend that you also document the technical specifications of key network devices, primarily switches and routers. Network configuration also can have a substantial impact on your security posture.
To document throughput, I recommend these types of analyses:
- Average daily and weekly Reviewer productivity. What is the activity of your reviewers on a daily and weekly basis? How many concurrent reviewers will your environment support and do you see a degradation of performance when multiple reviewers are working simultaneously?
- Processing speeds. How long does it take to process 100 GB of data? How many people are involved in the process? What datasets are you leveraging to capture this benchmark and does this align with the make-up of datasets your team receives for productions? We often see this benchmark being set against a completely different dataset type than Productions, rendering the benchmark irrelevant.
- SQL database performance. How fast are your SQL databases today? How long does an average query take to produce a response? What is the average time to load workspaces? Is your storage performing fast enough to allow the application to be highly responsive?
- CPU availability and utilization. How taxed are the CPUs in your servers and review platform computer systems? Are you currently over-committing your CPU resources (allocating more CPU to machines on the server than the server itself has available)? Does the hypervisor ever prevent tasks from being scheduled simply because system resources are unavailable?
- RAM availability and utilization. How much RAM does your environment have today? Do you have enough RAM to support your environment? We often find that is not the case. Is your memory properly allocated to memory-hungry tasks and processes? Are you currently over-committing your RAM resources (allocating more RAM to machines on the server than the server itself has available)?
- Size of largest matter. What is the size of the largest matter you’ve been able to handle, as measured by the document table size? Is it 1TB, 500GB, something else? How did it go handling this large matter and what lessons did you learn?
- Average matter count. How many matters can you reasonably handle today? Are constraints from people, process, or technology? I recommend that you review average matter count by week, month and year. This will help you understand your current capacities today.
- Storage Utilization. What is your data footprint growth rate? When must you expand storage, or remove data from production to maintain healthy capacity? Are the most active matters sitting in the appropriate tier of storage?
- Storage Capacity Management. How long are inactive cases residing on your production storage? What are your data governance SLAs? Are your archive and restoration times for cases in “cold or nearline” storage in compliance with your contractual obligations?
This first step is essential to establish an accurate picture of your environment’s current state. This analysis will be helpful in targeting areas to address to improve performance. It will also help you take full advantage of the other points of counsel in this article.
Provision Virtual Machine Resources Based Upon Workload Requirements
If you want to double the performance of your current eDiscovery environment for little to no cost, I recommend that you pay close attention to your virtual machines (VMs). These are software-based servers that emulate the performance of an actual physical server. Why do I recommend this?
When a client tells me that their environment is experiencing sluggish performance and suboptimal speeds, I first examine the utilization rate and resource footprint of their VMs. The findings rarely surprise me. In my experience, 90% of eDiscovery virtual machines are configured using a standard IT practice called “over-committing the host.” This is the number one culprit for poor performance at the virtualization layer of the platform. In many VM configurations, over-commitment is the default setting out of the box. This is problematic for two very important reasons:
- VMs consume resources differently based upon their task, type, quantity and the evolving demands of practitioners.
- VMs, when using default IT configurations, often compete and cannibalize resources, degrading eDiscovery platform performance.
If you’re cannibalizing your resources, you essentially have two options today: scale out or scale up. In the scale up approach, you secure additional resources for your underlying hosts. This is about deploying fewer, yet more powerful, machines.
In the scale out approach, our recommended approach, you deploy more physical machines with fewer resources per machine. This allows each machine to handle a finite work-load. We find that this produces overall better performance because no one physical machine is being over-tasked. This allows for more effective load balancing across the eDiscovery environment. Clients who’ve done this have experienced up to a 100% uptick in compute efficiency. Moreover, the heightened performance is immediate and lasting. So how does this work?
VMs consume the resources of an actual physical server, particularly CPUs and RAM, based on their allocation. But not all VM functions require the same allocation and so do not need to be consuming resources at the same level. For example, if you have a VM involved in Processing, it will require substantially higher allocation of the physical server’s resources than say, a typical agent server (please note that some agents or processes require just as many or more resources as Processing). But if other VMs on that same physical machine are configured to consume the same number of resources as the VM that is Processing, it just might under-perform.
The problem with the default settings in VMs is that they then consume resources as if they were involved in heavy-compute tasks—even if they are not. If you have 10 VMs on one physical server and all 10 VMs are configured as if they are doing heavy work, they will try to pull more compute power than the physical server has available. That will produce sluggish overall performance, and unpredictable behavior as random VMs win the battle for resources.
Understand Agent Activity
If you want to substantially increase performance without buying a bunch of new technology, I recommend that you closely analyze your use of agents. These are features or functions that sit in the middleware of most eDiscovery applications. Agents deliver a lot of benefits and can handle thousands of different tasks within a work-flow.
In our experience, two agents in particular can really degrade performance if they are not properly configured: Search and Conversion (near-native rendering). They’re powerful, but they also can consume a lot of precious and limited resources. When they are on the same Virtual Machine and launch at the same time, the results can be disastrous.
My recommendation is that you DO NOT pair any of these agents on the same VM. In fact, I recommend that you balance all of the agents for your primary eDiscovery application across numerous VMs to ensure the most efficient operation. There are two reasons I say this:
- Similar to the way Virtual Machines compete for physical resources, these agents attempt to consume ALL available virtual resources and this undermines the performance of one or both agents. This means that, right in the middle of an important review, neither agent will function properly.
- The likelihood of machine or complete system failure increases exponentially when agents are not balanced appropriately across multiple machines. This seems to happen right in the middle of an active and heavy review.
Fortunately, there’s a simple solution to completely avoid these detrimental consequences. Move your Conversion and Search agents to their own dedicated VMs. This segregation means that agents no longer compete for the same resources and can be launched in tandem. Once you implement this configuration tweak, your organization will likely realize immediate financial and operational gains. Reviewers will increase the volume and speed of matter execution. Additionally, you curb the likelihood of VM performance degradation and costly outages.
Avoid SQL Resource Starvation
If you want to substantially increase performance without a lot of cost, I recommend that you analyze the memory usage of your SQL databases. Why do I recommend this? In nearly 99% of eDiscovery environment assessments I’ve conducted, Random Access Memory (RAM) is improperly sized to meet eDiscovery SQL database performance requirements. This phenomenon is colloquially called ‘SQL Memory Starvation,’ and its implications are huge, especially for reviewer speed and throughput capabilities.
In my experience, organizations typically have more CPU resources than required to properly provision their SQL servers. However, they typically do not have enough RAM. This leads to another problem as it relates to costs. Please allow me to explain this.
In eDiscovery, SQL databases fly through millions of rows of data to return to you the single data-point you are seeking. For SQL databases to do this, they need a lot of IO throughput. Fast disk-storage alone will not address this and neither will adding CPUs. In fact, adding CPUs could actually slow down the overall performance.
Here is how this works, from a technical perspective. Search queries are returned to reviewers based on the IO available to the system. When systems run out of RAM to analyze the data-sets, they turn to hard-disks to process the data. Hard disks are drastically slower than RAM. So if you want to really improve performance, add more RAM.
This actually solves a financial problem too. SQL server licensing, which is the largest cost component, is based on allocated CPUs. We often find that organizations are paying for SQL servers that are not being used to full capacity simply because there is not enough RAM on the systems to optimize their performance. So if you really want to improve your performance, acquire and provision more RAM (which is often quite a bit cheaper than CPUs) and right-size the number and type of SQL servers based on your actual needs. Once you’ve conducted this resource rebalancing exercise, your organization could quite possibly realize operational gains immediately. I’ve personally witnessed increases in reviewer speed by 50%, 70%, or even up to 100%.
Conduct Capacity Planning to Enhance Environment Predictability
If you want to be ready to perform well on your next big case, I recommend that you conduct a capacity planning exercise. This requires a data-driven analysis of both your sales pipeline and your environment’s current capabilities. There are three phases to capacity planning:
- Document your current-state capacities. In my first point in this article, I demonstrated how to do this.
- Document your historical matter performance. This is about understanding the ebbs and flows of matters being handled by your team. To do this, I recommend that you analyze, over the trailing 36 months, three primary factors:
- How many matters did we handle, per month and per year, for the trailing three years?
- What trends can we see about growth in matter count and matter size? In other words, were you handling 10 matters per month with an average of 25 GBs per matter? Were you handling 100 matters per month with an average of 50 GBs per matter? What’s important here is to recognize trends because that can help you plan for the future, so you’re not caught off-guard. And, most importantly, how did those matters grow on average by GBs over the first 3, 6, 9 and 12 months of your team taking them on?
- How long did it take us, on average, to conduct a first pass to Production on each matter?
- Analyze the pipeline of what you can see today for new matters about to enter your environment. Add to this a reasonable projection of what you think might happen over the next 24-36 months.
All of this activity will help you understand what your actual capacities are today and, therefore, what you would need to provision quickly should you exceed those capacities—especially if a great big new matter were all of a sudden introduced into your environment.
One of the biggest mistakes I see organizations make is to take on a matter that they’re really not prepared to handle. This can lead to all sorts of negative outcomes such as:
- Overwhelming existing staff and technology resources.
- Missing deadlines and all of the associated fall-out.
- Negatively impacting existing matters that might be delayed so you can focus on the large matter.
- Reputational harm with the client who brought you the large matter—especially if they are a long-term client that you want to retain.
You can avoid all of these outcomes with a capacity planning exercise. The information gleaned from this analysis can provide an essential vantage point from which your organization can proactively:
- Construct departmental budgets based upon technology and business requirement forecasts.
- Procure and provision hardware and software based upon budget and sales parameters.
- Manage client expectations regarding SLAs, engagement scope and delivery timetables.
- Build an emergency action plan to rapidly beef up resources when the new matter comes in. This will ensure you are putting your money toward the equipment that will most likely give you the performance you need now and in the future.
I recognize that capacity planning for eDiscovery environments isn’t an exact science. However, in my opinion, any exercise that positions key stakeholders to make informed decisions based upon resource alignment is a worthwhile endeavor. I liken lesser alternatives to throwing darts at a dart board with a blindfold on. Please don’t make this mistake.
How To Make This Actionable
This thought piece is a direct response to two very important questions for stakeholders responsible for the eDiscovery function. First, how do we do more with less, given that budgets and resources are limited? Second, how do handle a large new project that could overwhelm our existing capacities? My recommendations are that you:
- Get clarity about your environment’s current state.
- Provision virtual machine resources based upon workload requirements.
- Understand agent activity on your machines.
- Avoid SQL memory starvation.
- Conduct capacity planning to enhance environment predictability.
Organizations that have taken these steps have realized huge gains in performance and staff productivity for little-to-no additional costs. Most organizations that I’ve analyzed don’t need more equipment. They need a better approach to using the equipment they already own. If you have any questions about the points I’ve discussed here, please know my door is open.