mscharhag, Programming and Stuff;

A blog about programming and software development topics, mostly focused on Java technologies including Java EE, Spring and Grails.

  • Tuesday, 20 October, 2020

    REST: Updating resources

    When building RESTful APIs over HTTP the PUT method is typically used for updating, while POST is used for creating resources. However, create and update operations do not perfectly align with the HTTP verbs PUT and POST. In certain situations PUT can also be used for resource creation. See my post about the differences between POST, PUT and PATCH for more details.

    Within the next sections we will look at updating resources with PUT.

    Note that this post does not cover partial updates (e.g. updating only a single field) which can be done with HTTP PATCH. This topic will be covered in a separate future blog post.

    Updating resource with HTTP PUT

    HTTP PUT replaces the resource at the request URI with the given values. This means the request body has to contain all available values, even if we only want to update a single field.

    Assume we want to update the product with ID 345. An example request might look like this:

    PUT /products/345
    Content-Type: application/json
    
    {
        "name": "Cool Gadget",
        "description": "Looks cool",
        "price": "24.99 USD"
    }

    Responses to HTTP PUT update operations

    You can find various discussions about the question if an update via HTTP PUT should return the updated response.

    There is no single true here. If you think it is useful to return the updated resource in your situation: do it. Just make sure to be consistent for all update operations in your API.

    The server responds to HTTP PUT requests usually with one of the following HTTP status codes:

    • HTTP 200 (Ok): The request has been processes successfully and the response contains the updated resource.
    • HTTP 204 (No content): The request has been processed successfully. The updated resource is not part of the response.
    • HTTP 400 (Bad request): The operation failed due to invalid request parameters (e.g. missing or invalid values in the request body).

    Note that responses to HTTP PUT are not cacheable (See the last paragraph of RFC 7231 4.3.4).

    Replacing resources in real-life

    As mentioned earlier HTTP PUT replaces the resource at a given URI. In real-life this can lead to various discussions because resources are often not really replaced.

    Assume we send an GET request to the previously used product resource. The response payload might look like this:

    GET /products/345
    
    {
        "id": 345,
        "name": "Cool Gadget",
        "description": "Looks cool",
        "price": "24.99 USD",
        "lastUpdated": "2020-10-17T09:31:17",
        "creationDate": "2029-12-21T07:14:31",
        "_links": [
            { "rel": "self", "href": "/products/345"},
            ..
        ]
    }

    Besides name, description and price we get the product ID, creation and update dates and a hypermedia _links element.

    id and creationDate are set by the server when the resource is created. lastUpdated is set whenever the resource is updated. Resource links are built by the server based on the current resource state.

    In practice there is no reason why an update request needs to contain those fields. They are either ignored by the server or can only lead to HTTP 400 responses if the client sends unexpected values.

    One point can be made here about lastUpdated. It would be possible to use this field to detect concurrent modification on the server. In this case, clients send the lastUpdated field they retrieved via a previous GET request back to the server. On an update request the server can now compare the lastUpdated value from the request with the one stored on the server. If the server state is newer, the server responds with HTTP 409 (Conflict) to notify the client that the resource has been changed since the last GET request.

    However, the same can be accomplished using the HTTP ETag header in a more standardized way.

    Now it can be discussed if we really replace the resource if we do not send certain fields with the PUT request.

    I recommend being pragmatic and only require the fields that can be modified by the client. Other fields can be skipped. However, the server should not deny the request if other fields are sent. Those fields should just be ignored. This gives the client the option to retrieve the resource via a GET request, modify it and send it back to the server.

    HTTP PUT and idempotency

    The PUT method is idempotent. This means that multiple identical PUT requests must result in the same outcome. Typically no extra measures are required to achieve this as update behavior is usually idempotent.

    However, if we look at the previous example GET request, there is again something that can be discussed:

    Does the lastUpdated field break idempotency for update requests?

    There are (at least) two valid ways to implement a lastUpdated field on the server:

    • lastUpdated changes whenever the resource state changes. In this case we have no idempotency-issue. If multiple identical PUT requests are sent, only the first one changes the lastUpdated field.
    • lastUpdated changes with every update request even if the resource state does not change. Here lastUpdated tells us how up-to-date the resource state is (and not when it changed the last time). Sending multiple identical update requests results in a changing lastUpdated field for every request.

    I would argue that even the second implementation is not a real problem for idempotency.    

    The HTTP RFC says:

    Like the definition of safe, the idempotent property only applies to what has been requested by the user; a server is free to log each request separately, retain a revision control history, or implement other non-idempotent side effects for each idempotent request.

    A changing lastUpdated field can be seen as a non-idempotent side effect. It has not been actively requested by the user and is completely managed by the server.

  • Thursday, 15 October, 2020

    Spring Security: Delegating authorization checks to bean methods

    In this post we will learn how authorization checks can be delegated to bean methods with Spring Security. We will also learn why this can be very useful in many situations and how it improves testability of our application. Before we start, we will quickly look over common Spring Security authorization methods.

    Spring Security and authorization

    Spring Security provides multiple ways to deal with authorization. Some of them are based on user roles, others are based on more flexible expressions or custom beans. I don't want to go into details here, many articles are already available on this topic. Just to give you a quick overview, here are a few commented examples of common ways to define access rules with Spring Security:

    Restricting URL access via a WebSecurityConfigurerAdapter:

    public class SecurityConfig extends WebSecurityConfigurerAdapter {
        
        @Override
        protected void configure(HttpSecurity http) throws Exception {
            http.authorizeRequests()
            
                // restrict url access based on roles
                .antMatchers("/internal/**").hasRole("ADMIN")
                .antMatchers("/projects/**").hasRole("USER")
                
                // restrict url access based on expression
                .antMatchers("/users/{username}/profile")
                    .access("principal.username == #username");            
        }
    }

    Using annotations to restrict access to methods:

    @Service
    public class SomeService {
    
        // Using Springs @Secured annotation for role checks
        @Secured("ROLE_ADMIN")
        public void doAdminStuff() { }
    
        // Using JSR 250 RolesAllowed annotation for role checks
        @RolesAllowed("ROLE_ADMIN")
        public void doOtherAdminStuff() { }
    
        // Using Springs @PreAuthorize annotation with an expression 
        @PreAuthorize("hasRole('ADMIN') and hasIpAddress('192.168.1.0/24')")
        public void doMoreAdminStuff() { }
        
        // Using an expression to delegate to a PermissionEvaluator bean
        @PreAuthorize("hasPermission(#stuff, 'write')")
        public void doStuff(Stuff stuff) { }
    }

    What to use when?

    If roles are the only thing you need, it is easy. You just need to decide if you prefer defining the required roles based on URLs or based on methods in your Java code. If you prefer the later, just pick one annotation and use it consistently.

    In case you need some ACL-like security (e.g. User x has permission y on object z) using @PreAuthorize with hasPermission(..) and a custom PermissionEvaluator is often a good choice. Also, have a look at the Spring Security ACL support.

    However, there is a huge field between both approaches where roles are not enough but ACLs might be too fine grained or just the wrong tool. Here are a few example authorization rules that do not fit well into both solutions:

    Access to a resource should only be given ..

    • .. to the owner of the resource (e.g. a user can only change his own profile)
    • .. to users with role x from department y
    • .. during standard business times
    • .. to administrators who signed in using two-factor authentication
    • .. to users who connect from specific IP addresses

    All those examples can probably be solved by building a security expression and passing it to @PreAuthorize. However, in practice it is often not that simple.

    Let us look at the last example (the ip address check). The previously shown code snippet contains a @PreAuthorize example that does exactly this:

    @PreAuthorize("hasRole('ADMIN') and hasIpAddress('192.168.1.0/24')")
    

    This looks nice as an example and shows what you can do with security expressions. However, now consider:

    • You possibly need to define more than one IP range. So, you have to combine multiple hasIpAddress(..) checks.
    • You probably do not want to hard-code IP addresses in your code. Instead they should be resolved from configuration properties.
    • It is likely that you need the same access check in different parts of your code. You probably do not want it to duplicate it over and over.

    In other cases you might need to do a database look-up or call another external system to decide if a user is allowed to access a resource.

    Simple expressions are fine. However, if they get larger and are scattered all over a code base they can become painful to maintain.

    Side note: Spring Security implements method security by proxying the target bean. Security checks are then added via the proxy. If you don't know about proxies, you should probably read my post about the Proxy pattern.

    Delegating access decisions to beans

    Within security expressions we can reference beans using the @beanname syntax. This feature can help us to implement the previously described authentication rules.

    Let's look at an example:

    @Service
    public class ProjectService {
    
        @PreAuthorize("@projectAccess.canUpdateProjectName(#id)")
        public void updateProjectName(int id, String newName) {
            ...
        }
        
        @PreAuthorize("@projectAccess.canDeleteProject(#id)")
        public void deleteProject(int id) {
            ...
        }
    }

    Here we define a ProjectService class with two methods, both annotated with @PreAuthorize. Within the security expression we delegate the access check to methods of a bean named projectAccess. Relevant method parameters (here id) are passed to projectAccess methods.

    projectAccess looks like this:

    @Component("projectAccess")
    public class ProjectAccessHandler {
    
        private final ProjectRepository projectRepository;
        private final AuthenticatedUserService authenticatedUserService;
    
        public ProjectAccessHandler(ProjectRepository repo, AuthenticatedUserService aus) {
            this.projectRepository = repo;
            this.authenticatedUserService = aus;
        }
    
        public boolean canUpdateProjectName(int id) {
            return isProjectOwner(id);
        }
    
        public boolean canDeleteProject(int id) {
            return isProjectOwner(id);
        }
    
        private boolean isProjectOwner(int id) {
            User user = authenticatedUserService.getAuthenticatedUser();
            Project project = projectRepository.findById(id);
            return (project.getOwner().equals(user.getUsername()));
        }
    }

    It is a simple bean with two public methods that are called via security expressions. In both cases only the owner of the project is allowed to perform the operation. To determine the project owner we first have to look-up the related project by using a ProjectRepository bean.

    The injected AuthenticatedUserService is a simple facade around Spring Security's SecurityContextHolder:

    @Service
    public class AuthenticatedUserService {
        public User getAuthenticatedUser() {
            Authentication authentication = SecurityContextHolder.getContext().getAuthentication();
            return (User) authentication.getPrincipal();
        }
    }

    This cleans up our code a little bit because it removes Spring Security internals (and the type cast) from our access control logic. It also becomes helpful when writing unit tests. This way we do not have to deal with static method calls during tests.

    Note we use the standard Spring Security User class for simplicity in this example. Often it is a good idea to create your own customized class as principal. However, this is something for another blog post.

    Testing access rules

    Another important benefit of this approach is that we can test access rules in simple unit tests. No Spring application context is required to evaluate @PreAuthorize expressions. This speeds up tests a lot.

    A simple test for canUpdateProjectName(..) might look like this:

    public class ProjectAccessHandlerTest {
    
        private ProjectRepository repository = mock(ProjectRepository.class);
        private AuthenticatedUserService service = mock(AuthenticatedUserService.class);
        private ProjectAccessHandler accessHandler = new ProjectAccessHandler(repository, service);
        private User john = new User("John", "password", Collections.emptyList());
    
        @Test
        public void canUpdateProjectName_isOwner() {
            Project project = new Project(1, "John", "John's project");
            when(repository.findById(1)).thenReturn(project);
            when(service.getAuthenticatedUser()).thenReturn(john);
            assertTrue(accessHandler.canUpdateProjectName(1));
        }
    
        @Test
        public void canUpdateProjectName_isNotOwner() {
            Project project = new Project(1, "Anna", "Anna's project");
            when(repository.findById(1)).thenReturn(project);
            when(service.getAuthenticatedUser()).thenReturn(john);
            assertFalse(accessHandler.canUpdateProjectName(1));
        }
    }

    Summary

    Many authorization requirements cannot be solved by using roles alone and ACLs often do not fit. In those situation it can be a viable solution to create separate beans for handling access checks. With @PreAuthorize we can delegate the authorization check to those beans. This also simplifies writing tests as we do not have to create a Spring application context to test authorization constraints.

    You can find the shown example code on GitHub.

  • Wednesday, 7 October, 2020

    REST: Working with asynchronous operations

    Sometimes a REST API operation might take a considerable amount of time to complete. Instead of letting the client wait until the operation completes we can return an immediate response and process the request asynchronously. In the following sections we will see how this can be achieved.

    Pointing to a status resource

    When processing a request asynchronously we should provide a way for the client to get the current processing status. This can be done by introducing a separate status resource. The client can then request this resource in regular intervals until the operation has been completed.

    As an example we will use the creation of a product which might take some time to complete.

    We start the creation process by issuing this request:

    POST /products
    {
        "name": "Cool Gadget",
        ...
    }

    Response:

    HTTP/1.1 202 (Accepted)
    Location /products/queue/1234

    For standard creation requests the server would respond with HTTP 201 and a Location header pointing to the newly created product (See REST Resource creation).

    However, in this example, the server responds with HTTP 202 (Accepted). This status code indicates that the request has been accepted but not yet processed. The Location header points to a resource that describes the current processing status.

    Clients can obtain the status by sending a GET request to this URI:

    GET /products/queue/1234

    Response:

    HTTP/1.1 200 (OK)
    
    {
        "status": "waiting",
        "issuedAt": "2020-10-03T09:34:24",
        "request": {
            "name": "Cool Gadget",
            ...
        }
        "links": [{ 
            "rel": "cancel", 
            "method": "delete", 
            "href": "/products/queue/1234"
        }]
    }

    In this example the response contains a status field, an issue date and the request data. Clients can now poll this resource at regular intervals to see if the status changes.

    You might also want to provide a way to cancel the operation. This can be done by sending a DELETE request to the status resource:

    DELETE /products/queue/1234

    If you are using Hypermedia controls in your REST responses you should provide a link to this operation in your response (as shown in the example).

    In case you can estimate the time the operation needs you can add a Retry-After HTTP header to the response. This tells the client how long it should wait until sending the follow-up request.

    What to do when the operation has been completed?

    When the requested operation has been completed we should communicate this through the status resource. If possible the status resource should provide links to processed or newly created resources.

    Assume the product from our previous example has been created. If the client now requests the status resource the response looks different:

    GET /products/queue/1234

    Response:

    HTTP/1.1 303 (See other)
    Location: /products/321

    After the product has been created, the queue element is no longer available. This is indicated by HTTP 303 (See other). The response also contains a Location header that points to the newly created product resource.

    However, sending HTTP 303 with a Location header might not always be possible. Assume we have an asynchronous operation that imports multiple products. The result of this operation is not a single resource we can point to.

    In this case we should keep the status resource for at least some time. A response might look like this:

    {
        "status": "completed",
        "took": "PT5M23S"    
        "imported": [
            {
                "name": "Cool Gadget",
                "links": [{
                    "rel": "self", "href": "/products/345"
                }]
            }, {
                "name": "Nice Gadget",
                "links": [{
                    "rel": "self", "href": "/products/346"
                }]
            },
            ...
            
        ]
    }

    The status field indicates that the operation has been completed. The took field contains the processing time as ISO 8601 duration.

    If suitable we can provide links to related resources as part of the response.

    Callback URLs

    Sometimes it can be a viable option to support callbacks. This lets the client submit an url with the request data. When the operation is finished the server sends a request to this url to inform the client. Usually this is a POST request with some status information.

    The initial request issued by the client might look like this:

    POST /products
    {    
        "name": "Cool Gadget",
        ...
        "_callbackUrl": "https://coolapi.myserver.com/product-callback"
    }

    Here we provide a callback url to the server using the _callbackUrl field. In JSON leading underscore are sometimes used for additional meta properties. Of course you can adapt this to your own style if you don't like this syntax.

    As in the previous example the server responds with HTTP 202 and a status resource:

    202 Accepted
    Location /products/queue/1234

    When the operation is finished the server updates the status resource and sends a POST request to the provided URL:

    POST https://coolapi.myserver.com/product-callback
    {
        "status": "completed",
        "took": "PT14M22S",
        "links": [{ 
            "rel": "product", 
            "href": "/products/321"
        }]
    }

    This tells the client that the product has been imported successfully.

    The problems with callback URLs

    Callback handling can increase the complexity on the server side significantly. What happens if a client does not respond on the provided url? How often should it be retried? The server needs to handle all these things.

    Authentication can be a problem. If the client and the server are not running in a trusted network the client needs a way to authenticate requests from the server. Otherwise, any untrusted third party would be able to POST data to the callback endpoint. This means the server needs to know its clients because they need to exchange some authentication information before the actual request. The server has to store this information securely for each client.

    Callback requests can also be hard to test and debug for developers. Often it is not possible to receive callback requests on a local development machine. This is typically blocked by network policies for security reasons.

    Summary

    Longer running asynchronous processes can be modeled with REST-APIs. The server tells the client that the request is handled asynchronously by sending a HTTP 202 status code. A Location header is used to point to a resource that gives information about the current processing status. The client can poll this status resource in regular intervals until the operation has been completed. Adding a Retry-After header can reduce unnecessary requests.

    Additionally the server can support callback URLs to inform the client after the request has been processed.

     

  • Tuesday, 29 September, 2020

    REST: Deleting resources

    In RESTful APIs resources are typically deleted using the HTTP DELETE method. The resource that should be deleted is identified by the request URI. DELETE is an idempotent HTTP operation. Sending the same DELETE request multiple times should only alter the server state once.

    Deleting single resources

    Single REST resource are usually identified by URIs containing a unique identifier. For example, we can delete the artist resource located at /artists/123 by sending a DELETE request to this URI.

    Request:

    DELETE /artists/123

    Response:

    HTTP/1.1 204 (No content)

    The server can respond to delete requests with different HTTP status codes:

    • HTTP 200 (Ok) indicates a successful deletion with additional information. In this case, the response body can contain the deleted resource or some details about the deletion.
    • HTTP 204 (No content) is used to indicate a successful deletion with no additional information (response body is empty).
    • HTTP 202 (Accepted) is returned if the server accepted the request, but the deletion has not been completed. For example, the server might have queued the request to process it sometime in the future.

    If no resource exists at the given URI a HTTP 404 (Not found) status code should be returned.

    After a resource has been deleted, a GET request on the resource URI should return HTTP 404 (Not found) or HTTP 410 (Gone).

    Deleting resource collections

    The HTTP DELETE operation can also be used to remove all items from a resource collection. For example, we can delete all artist resources by sending a DELETE request to /artists.

    Request:

    DELETE /artists 

    Response:

    HTTP/1.1 200 (Ok)
    
    {
        "total": 321
    }

    In this example the server responds with HTTP 200 and a response body containing the total number of deleted resources.

    If you want you can combine the delete operation with query parameters to filter the collection. For example, this might delete all orders that have been fulfilled before 2015-01-01.

    DELETE /orders?fulfilled-before=2015-01-01
    

    While the deletion of all collection elements can be useful it is not common to support this operation. Before you provide this feature in your REST API, you should think twice if a client should really be able to delete an entire collection with a single request.

    Request body and the DELETE method

    Delete requests usually do not need a request body. However, in rare situations a delete operation might need some additional instructions beside filter parameters that should be transported as payload body.

    The HTTP RFC 7231 describes the usage of the payload body for the DELETE method like this:

     A payload within a DELETE request message has no defined semantics; sending a payload body on a DELETE request might cause some existing implementations to reject the request

    On Stackoverflow you can find a lengthy discussion if the request body can and should be used for DELETE requests.

    In my opinion it should be avoided to use the request body for HTTP DELETE operations. It is generally unexpected and might produce hard to track issues with certain technologies. As a workaround a POST request to a separate resource can be used.

    Summary

    Using the HTTP DELETE method we can delete resource within a REST API. When necessary the DELETE method can also be used to delete entire collections. Services usually should respond to delete operations with 200 (Ok), 202 (Accepted) or 204 (No content) response codes.

     

  • Monday, 21 September, 2020

    Command-line JSON processing with jq

    In this post we will learn how to parse, pretty-print and process JSON from the command-line with jq. At the end we will even use jq to do a simple JSON to CSV conversion. jq describes itself as a lightweight and flexible command-line JSON processor. You can think of unix shell tools like sed, grep and awk but for JSON.

    jq works on various platforms. Prebuild binaries are available for Linux, Windows and Mac OS. See the jq download site for instructions.

    For many of the following examples we will use a file named artist.json with the following JSON content:

    {
        "name": "Leonardo da Vinci",
        "artworks": [{
                "name": "Mona Lisa",
                "type": "Painting"
            }, {
                "name": "The Last Supper",
                "type": "Fresco"
            }
        ]
    }

    Pretty-printing JSON and basic jq usage

    jq is typically invoked by piping a piece of JSON to its standard input. For example:

    echo '{ "foo" : "bar" }' | jq
    {
      "foo": "bar"
    }

    Without any arguments jq simply outputs the JSON input data. Note that the output data has been reformatted. jq outputs pretty-printed JSON by default. This lets us pipe minimized JSON to jq and get a nicely formatted output.

    jq accepts one or more filter(s) as parameter. The simplest filter is . which returns the whole JSON document. So this example produces the same output as the previous example:

    echo '{ "foo" : "bar" }' | jq '.'

    We can now add a simple object identifier to the filter. For this we will use the previously mentioned artist.json file. With .name we select the value of the name element:

    cat artist.json | jq '.name'
    "Leonardo da Vinci"
    

    Arrays can be navigated using the [] syntax:

    cat artist.json | jq '.artworks[0]'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }

    To get the name of the first painting we use:

    cat artist.json | jq '.artworks[0].name'
    "Mona Lisa"

    If we want to get the names of all artworks we simply skip the array index parameter:

    cat artist.json | jq '.artworks[].name'
    "Mona Lisa"
    "The Last Supper"

    Processing curl and wget responses

    Of course we can also pipe responses from remote systems to jq. This is not a specific feature of jq, but because this is a common use-case we look into two short examples. For these examples we will use the public GitHub API to get information about my blog-examples repository.

    With curl this is very simple. This extracts the name and full_name properties from the GitHub API response:

    curl https://api.github.com/repos/mscharhag/blog-examples | jq '.name,.full_name'
    "blog-examples"
    "mscharhag/blog-examples"

    Note we used a comma here to separate different two different filters.

    With wget we need to add a few parameters to get the output in the right format:

    wget -cq https://api.github.com/repos/mscharhag/blog-examples -O - | jq '.owner.html_url'
    "https://github.com/mscharhag"
    

    Pipes, functions and operators

    In this section we will into a more ways of filtering JSON data.

    With the | operator we can combine two filters. It works similar as the standard unix shell pipe. The output of the filter on the left is passed to the one on the right.

    Note that .foo.bar is the same as .foo | .bar (the JSON element .foo is passed to the second filter which then selects .bar).

    Pipes can be combined with functions. For example we can use the keys function to get the keys of an JSON object:

    cat artist.json | jq '. | keys'
    [
      "artworks",
      "name"
    ]

    With the length function we can get the number of elements in an array:

    cat artist.json | jq '.artworks | length'
    2

    The output of the length function depends on the input element:

    • If a string is passed, then it returns the number of characters
    • For arrays the number of elements is returned
    • For objects the number of key-value pairs is returned

    We can combine the length function with comparison operators:

    cat artist.json | jq '.artworks | length < 5'
    true

    Assume we want only the artworks whose type is Painting. We can accomplish this using the select function:

    cat artist.json | jq '.artworks[] | select(.type == "Painting")'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }

    select accepts an expression and returns only those inputs that match the expression.

    Transforming JSON documents

    In this section we will transform the input JSON document into a completely different format.

    We start with this:

    cat artist.json | jq '{(.name): "foo"}'
    {
      "Leonardo da Vinci": "foo"
    }

    Here we create a new JSON object which uses the .name element as key. To use an expression as an object key we need to add parentheses around the key (this does not apply to values as we will see with the next example)

    Now let's add the list of artworks as value:   

    cat artist.json | jq '{(.name): .artworks}'
    {
      "Leonardo da Vinci": [
        {
          "name": "Mona Lisa",
          "type": "Painting"
        },
        {
          "name": "The Last Supper",
          "type": "Fresco"
        }
      ]
    }

    Next we apply the map function to the artworks array:

    cat artist.json | jq '{(.name): (.artworks | map(.name) )}'
    {
      "Leonardo da Vinci": [
        "Mona Lisa",
        "The Last Supper"
      ]
    }

    map allows us to modify each array element with an expression. Here, we simply select the name value of each array element.

    Using the join function we can join the array elements into a single string:

    cat artist.json | jq '{(.name): (.artworks | map(.name) | join(", "))}'
    {
      "Leonardo da Vinci": "Mona Lisa, The Last Supper"
    }

    The resulting JSON document now contains only the artist and a comma-separated list of his artworks.

    Converting JSON to CSV

    We can also use jq to perform simple JSON to CSV transformation. As example we will transform the artworks array of our artist.json file to CSV.

    We start with adding the .artworks[] filter:

    cat artist.json | jq '.artworks[]'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }
    {
      "name": "The Last Supper",
      "type": "Fresco"
    }

    This deconstructs the artworks array into separate JSON objects.

    Note: If we would use .artworks (without []) we would get an array containing both elements. By adding [] we get two separate JSON objects we can now process individually.

    Next we convert these JSON objects to arrays. For this we pipe the JSON objects into a new filter:

    cat artist.json | jq '.artworks[] | [.name, .type]'
    [
      "Mona Lisa",
      "Painting"
    ]
    [
      "The Last Supper",
      "Fresco"
    ]

    The new filter returns an JSON array containing two elements (selected by .name and .type)

    Now we can apply the @csv operator which formats a JSON array as CSV row:

    cat artist.json | jq '.artworks[] | [.name, .type] | @csv'
    "\"Mona Lisa\",\"Painting\""
    "\"The Last Supper\",\"Fresco\""

    jq applies JSON encoding to its output by default. Therefore, we now see two CSV rows with JSON escaping, which is not that useful.

    To get the raw CSV output we need to add the -r parameter:

    cat artist.json | jq -r '.artworks[] | [.name, .type] | @csv'
    "Mona Lisa","Painting"
    "The Last Supper","Fresco"
    

    Summary

    jq is a powerful tool for command-line JSON processing. Simple tasks like pretty-printing or extracting a specific value from a JSON document are quickly done in a shell with jq. Furthermore the powerful filter syntax combined with pipes, functions and operators allows us to do more complex operations. We can transform input documents to completely different output documents and even convert JSON to CSV.

    If you want to learn more about jq you should look at its excellent documentation.