Avoiding Incomplete/Corrupted Files During Processing Operations

Article summary

Applications that work with files on disk can encounter incomplete or corrupted files if a target file is actively being written to disk by another process. Typically, this happens when two different systems or processes are interacting with the same file independently.

For example, if a delivery system (e.g. SFTP) is writing a file, and an independent application (e.g. file importer) expects to read and process the file, issues may arise if the processing application reads the file before all data has been transferred and written to disk. While write() calls are generally atomic, the process writing to the file may use multiple write() calls, and a read() could be interleaved between them.

In order to prevent incomplete or corrupted file issues, the application that is reading the file from disk must make sure that no other processes are writing to the target file. There are a few different strategies to coordinate access, including: lock filessentinel files, monitoring filesystem events, and monitoring open file handles.

While each approach has strengths and weaknesses, circumstances may force the use of one particular method. In a recent case, I found that only subscribing to filesystem events or monitoring open file handles would be sufficient. Below are scripts from the two approaches that I used to coordinate access to a file that was being delivered via SFTP:

Scenario

An SFTP server is accepting large text files for processing. When files are received, a separate process needs to read the files and manipulate the data. The files may arrive periodically throughout the day, and they must be processed immediately after delivery.

Monitoring filesystem events with inotify

inotifywait is used to track files and exit when specific filesystem events are received (in this case, close_write). This indicates that the file is no longer being transferred (written), and it may be safely read by the import process.

Generally, if the event is not received, inotifywait will block indefinitely. In case inotifywait starts tracking a file after the close_write event (perhaps it was transferred very fast), a timeout is provided, after which time inotifywait exits. When inotifywait exits, it is assumed the file transferred successfully:

inotifywait -e close_write -t 120 /path/to/file.txt

Context in script:


#!/bin/bash

while true
do sleep 1s
  for file in $(find /opt/sftp -type f -not -name "\.*")
  do inotifywait -e close_write -t 120 "${file}" > /dev/null 2>&1
    then echo "Processing: ${file}..."
      file_importer.rb "${file}"
    mv "${file}" /tmp
  done
  find /tmp -type f -ctime +7 -exec rm {} \+
done

Monitoring file handles with lsof

lsof will show information about file descriptors opened by processes, including regular files, directories, block special files, character special files, libraries, streams, and network files (e.g. sockets). lsof will exit with an exit code of zero if there are any processes that have an open file handle to the target file, and a non-zero exit code if there are no processes that have an open file handle to the target file.

If lsof does not find any processes that have file handles on the target file, then it indicates that the file is no longer being transferred (written), and it may be safely read by the import process:

lsof /path/to/file.txt

Context in the script:


#!/bin/bash

while true
do sleep 1s
  for file in $(find /opt/sftp -type f -not -name "\.*")
  do lsof "${file}" > /dev/null 2>&1
    retval=$?
    if [ ${retval} -eq 0 ]
    then echo "Waiting on ${file}..."
      continue
    fi
    echo "Processing: ${file}..."
      file_importer.rb "${file}"
    mv "${file}" /tmp
  done
  find /tmp -type f -ctime +7 -exec rm {} \+
done

Conclusion

In the end, I chose to use lsof to monitor open file handles as it avoided the possibility of inotifywait hitting its timeout before a file was completely transferred via SFTP. While files should always have been transferred before the two-minute timeout, it seemed possible that an unexpectedly slow network connection could be an edge case. Furthermore, if inotifywait missed the close_write event because a file was transferred extremely quickly, processing the file would be delayed by at least two minutes (waiting for the timeout to be reached).

Hope these examples are helpful to you.