Pre-Ingest File Storage API

Warning

The following tutorial involves a more low-level look at the Pre-Ingest File Storage API.

If you’re only uploading files, you can use the premade command-line tools detailed in Tutorial: Uploading files to Pre-ingest File Storage.

This document is a hands-on tutorial, which demonstrates how to use the Pre-Ingest File Storage API. Pre-Ingest File Storage API can be used to upload files to temporary storage. The files in temporary storage can be used in the creation of Submission Information Packages, which can be transferred to digital preservation. Basic workflow for uploading files to temporary storage is as follows:

  • Make a ZIP archive of the files: zip -r files.zip directory/

  • Send the ZIP archive to temporary storage: /filestorage/api/v1/archives/<project> -X POST -T files.zip

  • Make sure the checksums match: md5sum files.zip

These steps are explained in more detail in the following chapters.

Installation

Using the interface only requires you to able to make HTTP requests and as such doesn’t need any special software. In this tutorial command-line tool curl is used to make the HTTP requests, which is pre-installed on most Linux distributions. Optionally, jq can also be installed, which parses the JSON responses sent by the server and makes them more human-readable. Check that curl and jq are installed by running the following command in terminal

$ sudo yum install curl jq

If you choose to install jq, it is used by adding :code:` | jq` to the end of the curl commands.

Usage

The following chapters go over the whole upload process. Let’s start by testing the connection to the API. Root of the API is located at https://manage.fairdata.fi/filestorage/api/. Check your connection to the API by sending a GET request to the root:

# Token-based authentication
$ curl https://manage.fairdata.fi/filestorage/api/ -H "Authorization: Bearer <TOKEN>"
# Username and password authentication
$ curl https://manage.fairdata.fi/filestorage/api/ -u username:password

Succesful request returns:

{
    "code": 404,
    "error": "404: Not Found"
}

since no functionality is defined for the root of the API. If the server returns 401: Unauthorized, the provided credentials username:password were mistyped or the user does not exist.

List projects

Your REST API user can have access to one or more projects, each with their own usage quotas. To list all projects accessible to the user, send a GET request to the following endpoint:

$ curl https://manage.fairdata.fi/filestorage/api/v1/users/projects -H "Authorization: Bearer <TOKEN>"

{
    "projects": {
        "test_project_a": {
            "used_quota": 1024,
            "quota": 1024000
        },
        "test_project_b": {
            "used_quota": 4096,
            "quota": 4096000
        }
    }
}

In this example, you could upload files to either project test_project_a or test_project_b.

POST files

Next, let’s actually upload files to temporary storage. Let’s begin by creating fake data with commands:

$ mkdir -p data/test1 data/test2
$ echo "This is test file 1" > data/test1/file_1.txt
$ echo "This is test file 2" > data/test1/file_2.txt
for i in {00..99}; do echo $i > data/test2/$i.txt; done

This creates directories data/test1/ and data/test2/, which contain 2 and 100 test files respectively. Let’s first look how individual files can be uploaded by uploading the files in directory data/test1/ and then how the whole directory data/test2/ can be uploaded.

Files can be uploaded to temporary storage by sending a POST request to /filestorage/api/v1/files/<project>/path/to/the/file, where <project> is the project identifier and /path/to/the/file is the path to the file on the server relative to your project directory. Files data/test1/file_?.txt can be uploaded with commands:

$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project>/data/test1/file_1.txt -X POST -T data/test1/file_1.txt -H "Authorization: Bearer <TOKEN>"
$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project>/data/test1/file_2.txt -X POST -T data/test1/file_2.txt -H "Authorization: Bearer <TOKEN>"

Here, flags -X and -T define request method and the actual data sent respectively. Without any flags provided, curl sends a GET request by default. The aforementioned commands should return responses like:

{
    "file_path": "/data/test1/file_1.txt",
    "status": "created"
}

Directory data/test2/ contains 100 files so uploading them individually doesn’t make sense. Writing a shell script that uploads each of them seperately would work, but even that would accumulate latency and make uploading multiple small files really slow. Thus, it’s best to make a ZIP archive, and upload it. The archive is extracted by the server automatically. Pack data/test2/ into a ZIP archive with command:

$ zip -r test2.zip data/test2/

Upload the ZIP archive to the server:

$ curl https://manage.fairdata.fi/filestorage/api/v1/archives/<project> -X POST -T test2.zip -H "Authorization: Bearer <TOKEN>"

This should return a response like:

{
    "file_path": "/",
    "message": "Uploading archive",
    "polling_url": "https://manage.fairdata.fi/filestorage/api/v1/tasks/5e7df16c2413de7e7b29f263",
    "status": "pending"
}

This response tells that the upload phase has been finished but the archive extraction continues on the server. The polling_url attribute of the response can be used to query the status of the extraction:

$ curl https://manage.fairdata.fi/filestorage/api/v1/tasks/5e7df16c2413de7e7b29f263 -H "Authorization: Bearer <TOKEN>"

As long as the extraction phase is ongoing on the server the command above will return a response like:

{
    "message": "Extracting archive",
    "status": "pending"
}

Finally when the extraction phase is ready on the server the response looks like:

{
    "message": "archive uploaded to /",
    "status": "done"
}

When task is finished, the archive is extracted to the root of the project. If extracting the archive would overwrite files, the task will fail, and in the server. returns 409: Conflict

GET files

Now that all the test files have been uploaded to the server let’s check some of them. A list of all the directories and filenames can be requested by sending a GET request to /filestorage/api/v1/files:

$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project>?all=true -H "Authorization: Bearer <TOKEN>"

GET more info about an individual file with e.g.

$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project>/data/test1/file_1.txt -H "Authorization: Bearer <TOKEN>"

This should return a response like

{
    "file_path": "/data/test1/file_1.txt",
    "md5": "7dbdc7a8126dcbb55dd383fab5c2d6f8",
    "identifier": "urn:uuid:f7b4913c-7172-44ea-913b-9fa3a426c93d",
    "timestamp": "2019-03-20T14:23:30+00:00"
}

File path is the path where the file was uploaded and is also used when creating the submission information package. MD5 is the checksum of the file and identifier is an UUID4 identifier that can be used for searching the file metadata from Metax. Checksums of the uploaded files should always be checked to make sure the files were not corrupted during the transfer. Checksums returned by the server should always match the local checksums, which can be calculated with command md5sum:

$ md5sum data/test1/file_?.txt

If information about the parent directory of the file is requested:

$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project>/data/test1/ -H "Authorization: Bearer <TOKEN>"

It can be noticed that also the directory has an identifier:

{
    "directories": [],
    "files": [
        "file_2.txt",
        "file_1.txt"
    ],
    "identifier": "452c6b0be99d3752bc177cd2e8efcb5a"
}

More info about the file metadata stored in Metax can be found on the Metax documentation.

DELETE files

Finally, let’s see how files can be deleted from temporary storage. This can be done by sending a DELETE request to the Pre-Ingest File Storage API. DELETE request removes the files from temporary storage and file metadata from Metax, if it is not associated with any dataset. Delete can be requested for the whole project, a single directory or a single file similar to the GET request shown earlier. Following command deletes all the files:

$ curl https://manage.fairdata.fi/filestorage/api/v1/files/<project> -X DELETE -H "Authorization: Bearer <TOKEN>"

The response looks like:

{
    "file_path": "/",
    "message": "Deleting files and metadata",
    "polling_url": "https://manage.fairdata.fi/filestorage/api/v1/tasks/5e7b60b52413de53896900fd",
    "status": "pending"
}

This tells that the server is still deleting metadata and files. The polling_url attribute of the response can be used to query the status of the file and metadata deletion:

curl -k https://manage.fairdata.fi/filestorage/api/v1/tasks/5e7b60b52413de53896900fd -H "Authorization: Bearer <TOKEN>"

As long as the file and metadata deletion is ongoing on the server the command above will return a response:

{
    "message": "Deleting files and metadata: /",
    "status": "pending"
}

Finally when the file and metadata deletion is ready on the server the response looks like:

{
    "message": "Deleted files and metadata: /",
    "status": "done"
}

Files can be deleted from temporary storage after the dataset has been accepted for digital preservation. All the files will automatically be cleaned after 30 days based on the timestamp returned by GET /filestorage/api/v1/files/<project>/path/to/the/file.

upload-rest-api-client

upload-rest-api-client is a simple python client for sending files to the Pre-Ingest File Storage. The client can be downloaded with command:

$ git clone https://github.com/Digital-Preservation-Finland/upload-rest-api-client.git