How to Read Large Files in Java That Do Not Fit into Memory
Have you ever experienced the challenge of creating a program to process a file that exceeds your available memory?

If so, you’ll agree that the life would be much easier if we only need to deal with a file that fits into memory. In such cases, utilizing methods from the Files class allows us to read the file content into memory and use streams to process it smoothly.
However, when dealing with large files that do not fit to the memory, we need another approach: read it partially, and have additional structures to compile required data only.
In this article I will explore a solution to process large log files in Java which does not fit in the memory.
Scenario
Our task is to develop a program that analyzes log files from a server and generates a report listing the top 10 most frequently used applications.
Each day, a new log file is generated, containing information such as timestamps, host details, duration times, service calls, and other data that may not be relevant to our specific scenario.
2024-02-25T00:00:00.000+GMT host7 492 products 0.0.3 PUT 73.182.150.152 eff0fac5-b997-40a3-87d8-02ff2f397b44 2024-02-25T00:00:00.016+GMT host6 123 logout 2.0.3 GET 34.235.76.94 8b97acae-dd36-4e83-b423-12905a4ab38d 2024-02-25T00:00:00.033+GMT host6 50 payments/:id 0.4.6 PUT 148.241.146.59 ac3c9064-4782-46d9-a0b6-69e4d55a5b38 2024-02-25T00:00:00.050+GMT host2 547 orders 1.5.0 PUT 6.232.116.248 2285a81e-c511-41b9-b0ea-a475a0a45805 2024-02-25T00:00:00.067+GMT host4 400 suggestions 0.8.6 DELETE 149.138.227.154 8031b639-700e-4a7c-b257-fcbed0d029ce 2024-02-25T00:00:00.084+GMT host2 644 login 6.90 GET 208.158.145.204 3906a28c-56e4-4e5f-b548-591eab737aa7 2024-02-25T00:00:00.101+GMT host5 339 suggestions 0.8.9 PUT 173.109.21.97 c7dfec8a-5ca8-4d0d-b903-aaf65629fdd0 2024-02-25T00:00:00.118+GMT host9 87 products 2.6.3 POST 220.252.90.140 e5ceef67-2f0f-4c2d-a6d2-c698598aaef2 2024-02-25T00:00:00.134+GMT host0 845 products 9.4.6 GET 136.79.178.188 f28578c1-c37c-47a3-a473-4e65371e0245 2024-02-25T00:00:00.151+GMT host4 675 login 0.89 DELETE 32.159.65.239 d27ff353-e501-43e6-bdce-680d79a07c36
Our code will receive a list of log files, and our objective is to compile a report listing the top 10 most frequently used services. However, to be included in the report, a service must have at least one entry in each of the log files provided. In simpler terms, a service must be utilized every day to qualify for inclusion in the report.
Initial version
My initial approach to solving this problem is to consider the business requirements and create the following code:
public void processFiles(final List<File> fileList) {
final Map<LocalDate, List<LogLine>> fileContent = getFileContent(fileList);
final List<String> serviceList = getServiceList(fileContent);
final List<Statistics> statisticsList = getStatistics(fileContent, serviceList);
final List<Statistics> topCalls = getTop10(statisticsList);
print(topCalls);
}Let’s analyze the code of processFiles method receives as parameter the list of files, and:
- Creates a map with an entry for each file, where the key is LocalDate and the value is a list of the file lines.
- Creates a list of strings with the unique service names from all files.
- Generates a list of statistics for all services, organizing the data from the files into a structured map.
- Filters the statistics to obtain the top 10 service calls.
- Print the result.
Before we analyze each one of the methods we can notice here that we are loading too much data into memory, inevitably leading to an OutOfMemoryError.
Solution
The solution involves processing the files line-by-line and creating a Map with the service name as the key and a Counter object with attributes such as the number of calls and days of calls.
The processFiles method will be as follow:
private void processFiles(final List<File> fileList) {
final Map<String, Counter> compiledMap = new HashMap<>();
for (int i = 0; i < fileList.size(); i++) {
processFile(fileList, compiledMap, i);
}
final List<Counter> topCalls =
compiledMap.values().stream()
.filter(Counter::allDaysSet)
.sorted(Comparator.comparing(Counter::getNumberOfCalls).reversed())
.limit(10)
.toList();
print(topCalls);
}Let’s analyze the code:
- First, it declares a Map (compiledMap) with a String as key, representing the service name, and a Counter object (explained later), which will store the statistics.
- Next, it processes the files one-by-one and updates the compiledMap accordingly.
- Then it makes use of stream features to: filter only the counters that have data for all days; sort by the number of calls; and finally, retrieve the top 10.
Before take a look of the processFile method, which is the core of the entire processing, let’s analyze the Counter class, which also plays a crucial role in this process:
public class Counter {
@Getter private String serviceName;
@Getter private long numberOfCalls;
private final BitSet daysWithCalls;
public Counter(final String serviceName, final int numberOfDays) {
this.serviceName = serviceName;
this.numberOfCalls = 0L;
daysWithCalls = new BitSet(numberOfDays);
}
public void add() {
numberOfCalls++;
}
public void setDay(final int dayNumber) {
daysWithCalls.set(dayNumber);
}
public boolean allDaysSet() {
return daysWithCalls.stream()
.mapToObj(index -> daysWithCalls.get(index))
.reduce(Boolean.TRUE, Boolean::logicalAnd);
}
}- It contains three attributes: serviceName, numberOfCalls, and daysWithCalls.
- The numberOfCalls attribute is incremented by the add method, which is called for each processed line of the serviceName.
- The daysWithCalls attribute is a Java BitSet, a memory-efficient structure for storing boolean attributes. It is initialized with the number of days to be processed, with each bit representing a day initialized to false.
- The setDay method sets the bit corresponding to the given day position in the BitSet to true.
The allDaysSet method is responsible for checking if all days in the BitSet are set to true. It does this by transforming the BitSet into a stream of booleans and then reducing it using a logical AND operator.
private void processFile(final List<File> fileList,
final Map<String, Counter> compiledMap,
final int dayNumber) {
try (Stream<String> lineStream = Files.lines(fileList.get(dayNumber).toPath())) {
lineStream
.map(this::toLogLine)
.forEach(
logLine -> {
Counter counter = compiledMap.get(logLine.serviceName());
if (counter == null) {
counter = new Counter(logLine.serviceName(), fileList.size());
compiledMap.put(logLine.serviceName(), counter);
}
counter.add();
counter.setDay(dayNumber);
});
} catch (final IOException e) {
throw new RuntimeException(e);
}
}- The process uses the lines method of the Files class to read the file line-by-line, converting it into a stream. The key feature here is that the lines method is lazy, meaning it doesn’t read the entire file at once; rather, it reads the file as the stream is consumed.
- The toLogLine method converts each string file line into an object with attributes for accessing the log line information.
- The primary process of handling the file line is simpler than expected. It retrieves (or creates) the Counter from the compiledMap associated with the serviceName and then calls the add and setDay methods of the Counter.
As we can see, processing a large file in Java without loading the entire file into memory is not rocket science. The Files class provides methods to process files line-by-line, and we can also utilize a hash to store data during file processing, which helps conserve memory.






