Summary

The article presents a Java-based solution for processing and analyzing large log files that do not fit into memory, focusing on identifying the top 10 most frequently used services across multiple log files.

Abstract

The article addresses the challenge of processing large files in Java that exceed the available memory. It introduces a method for reading files partially and using additional structures to compile the necessary data. The scenario involves analyzing server log files to generate a report of the top 10 most frequently used applications, with the constraint that a service must appear in every log file to be considered. The initial version of the code loads too much data into memory, risking an OutOfMemoryError. The solution proposed involves processing files line-by-line and using a Map with service names as keys and a custom Counter object to track the number of calls and days of calls. The Counter class uses a BitSet to efficiently store daily call information. The processFiles method is refactored to handle each file sequentially, updating the Map and filtering services that meet the criteria. The article concludes that processing large files without loading the entire file into memory is feasible using Java's Files class and memory-efficient data structures.

Opinions

The author emphasizes the importance of processing large files in a memory-efficient manner to avoid OutOfMemoryError.
The use of Java's Files.lines method is highlighted for its lazy loading feature, which allows for line-by-line processing of files.
The Counter class is presented as a crucial component in the solution, leveraging Java's BitSet for memory-efficient tracking of daily service usage.
The article suggests that with careful design and the use of appropriate data structures, processing large files can be managed effectively in Java.
The author's approach to filtering and sorting the data stream indicates a preference for functional programming techniques, such as using streams and lambdas in Java.

How to Read Large Files in Java That Do Not Fit into Memory

Have you ever experienced the challenge of creating a program to process a file that exceeds your available memory?

If so, you’ll agree that the life would be much easier if we only need to deal with a file that fits into memory. In such cases, utilizing methods from the Files class allows us to read the file content into memory and use streams to process it smoothly.

However, when dealing with large files that do not fit to the memory, we need another approach: read it partially, and have additional structures to compile required data only.

In this article I will explore a solution to process large log files in Java which does not fit in the memory.

Scenario

Our task is to develop a program that analyzes log files from a server and generates a report listing the top 10 most frequently used applications.

Each day, a new log file is generated, containing information such as timestamps, host details, duration times, service calls, and other data that may not be relevant to our specific scenario.

2024-02-25T00:00:00.000+GMT host7 492 products 0.0.3 PUT 73.182.150.152 eff0fac5-b997-40a3-87d8-02ff2f397b44
2024-02-25T00:00:00.016+GMT host6 123 logout 2.0.3 GET 34.235.76.94 8b97acae-dd36-4e83-b423-12905a4ab38d
2024-02-25T00:00:00.033+GMT host6 50 payments/:id 0.4.6 PUT 148.241.146.59 ac3c9064-4782-46d9-a0b6-69e4d55a5b38
2024-02-25T00:00:00.050+GMT host2 547 orders 1.5.0 PUT 6.232.116.248 2285a81e-c511-41b9-b0ea-a475a0a45805
2024-02-25T00:00:00.067+GMT host4 400 suggestions 0.8.6 DELETE 149.138.227.154 8031b639-700e-4a7c-b257-fcbed0d029ce
2024-02-25T00:00:00.084+GMT host2 644 login 6.90 GET 208.158.145.204 3906a28c-56e4-4e5f-b548-591eab737aa7
2024-02-25T00:00:00.101+GMT host5 339 suggestions 0.8.9 PUT 173.109.21.97 c7dfec8a-5ca8-4d0d-b903-aaf65629fdd0
2024-02-25T00:00:00.118+GMT host9 87 products 2.6.3 POST 220.252.90.140 e5ceef67-2f0f-4c2d-a6d2-c698598aaef2
2024-02-25T00:00:00.134+GMT host0 845 products 9.4.6 GET 136.79.178.188 f28578c1-c37c-47a3-a473-4e65371e0245
2024-02-25T00:00:00.151+GMT host4 675 login 0.89 DELETE 32.159.65.239 d27ff353-e501-43e6-bdce-680d79a07c36

Our code will receive a list of log files, and our objective is to compile a report listing the top 10 most frequently used services. However, to be included in the report, a service must have at least one entry in each of the log files provided. In simpler terms, a service must be utilized every day to qualify for inclusion in the report.

Initial version

My initial approach to solving this problem is to consider the business requirements and create the following code:

public void processFiles(final List<File> fileList) {
  final Map<LocalDate, List<LogLine>> fileContent = getFileContent(fileList);
  final List<String> serviceList = getServiceList(fileContent);
  final List<Statistics> statisticsList = getStatistics(fileContent, serviceList);
  final List<Statistics> topCalls = getTop10(statisticsList);

  print(topCalls);
}

Let’s analyze the code of processFiles method receives as parameter the list of files, and:

Creates a map with an entry for each file, where the key is LocalDate and the value is a list of the file lines.
Creates a list of strings with the unique service names from all files.
Generates a list of statistics for all services, organizing the data from the files into a structured map.
Filters the statistics to obtain the top 10 service calls.
Print the result.

Before we analyze each one of the methods we can notice here that we are loading too much data into memory, inevitably leading to an OutOfMemoryError.

Solution

The solution involves processing the files line-by-line and creating a Map with the service name as the key and a Counter object with attributes such as the number of calls and days of calls.

The processFiles method will be as follow:

private void processFiles(final List<File> fileList) {
  final Map<String, Counter> compiledMap = new HashMap<>();

  for (int i = 0; i < fileList.size(); i++) {
    processFile(fileList, compiledMap, i);
  }

  final List<Counter> topCalls =
      compiledMap.values().stream()
          .filter(Counter::allDaysSet)
          .sorted(Comparator.comparing(Counter::getNumberOfCalls).reversed())
          .limit(10)
          .toList();

  print(topCalls);
}

Let’s analyze the code:

First, it declares a Map (compiledMap) with a String as key, representing the service name, and a Counter object (explained later), which will store the statistics.
Next, it processes the files one-by-one and updates the compiledMap accordingly.
Then it makes use of stream features to: filter only the counters that have data for all days; sort by the number of calls; and finally, retrieve the top 10.

Before take a look of the processFile method, which is the core of the entire processing, let’s analyze the Counter class, which also plays a crucial role in this process:

public class Counter {
  @Getter private String serviceName;
  @Getter private long numberOfCalls;
  private final BitSet daysWithCalls;

  public Counter(final String serviceName, final int numberOfDays) {
    this.serviceName = serviceName;
    this.numberOfCalls = 0L;
    daysWithCalls = new BitSet(numberOfDays);
  }

  public void add() {
    numberOfCalls++;
  }

  public void setDay(final int dayNumber) {
    daysWithCalls.set(dayNumber);
  }

  public boolean allDaysSet() {
    return daysWithCalls.stream()
        .mapToObj(index -> daysWithCalls.get(index))
        .reduce(Boolean.TRUE, Boolean::logicalAnd);
  }
}

It contains three attributes: serviceName, numberOfCalls, and daysWithCalls.
The numberOfCalls attribute is incremented by the add method, which is called for each processed line of the serviceName.
The daysWithCalls attribute is a Java BitSet, a memory-efficient structure for storing boolean attributes. It is initialized with the number of days to be processed, with each bit representing a day initialized to false.
The setDay method sets the bit corresponding to the given day position in the BitSet to true.

The allDaysSet method is responsible for checking if all days in the BitSet are set to true. It does this by transforming the BitSet into a stream of booleans and then reducing it using a logical AND operator.

private void processFile(final List<File> fileList, 
                         final Map<String, Counter> compiledMap, 
                         final int dayNumber) {
  try (Stream<String> lineStream = Files.lines(fileList.get(dayNumber).toPath())) {
    lineStream
        .map(this::toLogLine)
        .forEach(
            logLine -> {
              Counter counter = compiledMap.get(logLine.serviceName());
              if (counter == null) {
                counter = new Counter(logLine.serviceName(), fileList.size());
                compiledMap.put(logLine.serviceName(), counter);
              }
              counter.add();
              counter.setDay(dayNumber);
            });

  } catch (final IOException e) {
    throw new RuntimeException(e);
  }
}

The process uses the lines method of the Files class to read the file line-by-line, converting it into a stream. The key feature here is that the lines method is lazy, meaning it doesn’t read the entire file at once; rather, it reads the file as the stream is consumed.
The toLogLine method converts each string file line into an object with attributes for accessing the log line information.
The primary process of handling the file line is simpler than expected. It retrieves (or creates) the Counter from the compiledMap associated with the serviceName and then calls the add and setDay methods of the Counter.

As we can see, processing a large file in Java without loading the entire file into memory is not rocket science. The Files class provides methods to process files line-by-line, and we can also utilize a hash to store data during file processing, which helps conserve memory.