Working with files

Retrieving files

Use the file() function to obtain a reference to a file by name:

myFile = file('some/path/to/my_file.file')

The file() function can reference both files and directories.

Use the files() function to obtain a list of files. When using the wildcard characters *, ?, [] and {}, the file name is treated as a glob pattern, returning all files that match the given pattern, or an empty list if no matching files are found:

listOfFiles = files('some/path/*.fa')

Note

The file() function can also be called with a glob pattern, as long as the pattern is intended to match exactly one file.

A double asterisk (**) in a glob pattern works like * but also searches through subdirectories:

deeplyNestedFiles = files('some/path/**/*.fa')

By default, wildcard characters do not match directories or hidden files. Use the hidden option to include hidden files:

listWithHidden = file('some/path/*.fa', hidden: true)

Given a file reference, you can use the resolve() method or the / operator to obtain files relative to that path:

def dir = file('s3://bucket/some/data/path')

dir.resolve('sample.bam')         // correct
dir / 'sample.bam'
file("$dir/sample.bam")           // correct (but verbose)
"$dir/sample.bam"                 // incorrect

Getting file attributes

The file() function returns a Path, which has several methods for retrieving metadata about the file:

def path = file('/some/path/file.txt')

assert path.baseName == 'file'
assert path.extension == 'txt'
assert path.name == 'file.txt'
assert path.parent == '/some/path'

See the Path reference for the list of available methods.

Reading and writing

Reading and writing an entire file

Reading a file is as easy as using the file’s text property, which returns the file contents as a string:

print myFile.text

Similarly, you can write text to a file by assigning it to the file’s text property:

myFile.text = 'Hello world!'

This approach overwrites any existing file contents, and implicitly creates the file if it doesn’t exist.

Tip

The text property is shorthand for the getText() and setText() methods:

println myFile.getText()
myFile.setText('Hello world!')

Warning

The above methods read and write the entire file contents at once, requiring the entire file to be loaded into memory. Consider using a more memory-efficient approach for large files, such as reading/writing the file line by line.

Reading a file line by line

You can use the readLines() method to read a text file line by line:

file('some/my_file.txt')
    .readLines()
    .each { line ->
        println line
    }

The readLines() method loads the entire file into memory, so it is not ideal for large files.

You can use the eachLine() method to read line by line while only loading one line at a time into memory:

count = 0
myFile.eachLine { line ->
    println "line ${count++}: $line"
}

The withReader() method creates a Reader that you can use to read the file line by line, or even character by character. It is useful when you don’t need to read the entire file.

For example, to read only the first line of a file:

myFile.withReader { r ->
    def firstLine = r.readLine()
    println firstLine
}

Writing a file line by line

You can use the append() method or left shirt (<<) operator to append text to a file without erasing the existing contents:

myFile.append('Add this line\n')
myFile << 'Add a line more\n'

For example, the following snippet copies the contents of a source file into a target file, replacing all U characters with X:

sourceFile.eachLine { line ->
    targetFile << line.replaceAll('U', 'X')
}

Filesystem operations

See the Path reference for the complete list of methods for performing filesystem operations.

Listing directories

You can use the listFiles() method to list the contents of a directory:

children = file('any/path').list()
children.each { file ->
    println file
}

Changed in version 26.04.0: The listFiles() method is deprecated – use listDirectory() instead.

You can use the eachFile() method to iterate through the contents of a directory:

myDir.eachFile { item ->
    if( item.isFile() ) {
        println "${item.getName()} - size: ${item.size()}"
    }
    else if( item.isDirectory() ) {
        println "${item.getName()} - DIR"
    }
}

Copying files

In general, you should not need to manually copy files, because Nextflow will automatically stage files in and out of the task environment based on the definition of process inputs and outputs. Ideally, any operation which transforms files should be encapsulated in a process, in order to leverage Nextflow’s staging capabilities as much as possible.

Remote files

Nextflow works with many types of remote files and objects using the same interface as for local files. The following protocols are supported:

HTTP(S)/FTP (http://, https://, ftp://)
Amazon S3 (s3://)
Azure Blob Storage (az://)
Google Cloud Storage (gs://)

To reference a remote file, simply specify the URL when opening the file:

pdb = file('http://files.rcsb.org/header/5FID.pdb')

It can then be used in the same way as a local file:

println pdb.text

Note

Not all operations are supported for all protocols. For example, writing and directory listing is not supported for HTTP(S) and FTP paths.

Note

Additional configuration may be necessary for cloud object storage, such as authenticating with a private bucket. See the documentation for each cloud storage provider for further details.

Remote file staging

When a process input file resides on a different file system than the work directory, Nextflow copies the file into the work directory using an appropriate Java SDK.

Remote files are staged in a subdirectory of the work directory with the form stage-<session-id>/<hash>/<filename>, where <hash> is determined by the remote file path. If multiple tasks request the same remote file, the file will be downloaded once and reused by each task. These files can be reused by resumed runs with the same session ID.

Note

Remote file staging can be a bottleneck during large-scale runs, particularly when input files are stored in object storage but need to be staged in a shared filesystem work directory. This bottleneck occurs because Nextflow handles all of these file transfers.

To mitigate this, you can implement a custom process to download the required files, allowing you to stage multiple files efficiently through parallel jobs. Files should be given as a val input instead of a path input to bypass Nextflow’s built-in remote file staging.

Alternatively, use Fusion file system with the work directory set to object storage. In this case, tasks can access remote files directly without any prior staging, eliminating the bottleneck.