Introduction
In our recent project, we had a requirement from one of our clients where we need to validate data in CSV files based on custom requirements. This validated CSV would need to be imported into Drupal 8 into various content types.
In this article, we will look at the requirement, the library, the architecture of the custom module, the different components of the module with some code samples and finally adding some ideas on how this module can be made more reusable and even contributed.
About Client
Our client is a well known international NGO with offices worldwide, each with different types of data management systems and frameworks. They wanted a centralized system to manage the data from each of these offices. Having concluded that Drupal 8 was the ideal solution to implement that centralized system, the challenge was to set up a migration pipeline to bring in data from all of the offices and their varying frameworks. Consequently, the files generated by these systems needed to be validated for specific constraints before being imported into our Drupal system.
Challenges and Goals
Following are the goals that the system should meet:
- The CSV files were in a custom format and there were multiple files with different structures and needed to be handled accordingly. Each column needed to have another validator.
- The files needed to be validated for errors before they could be imported and the errors needed to be logged with line numbers and relevant error messages.
- The validation had to be triggered automatically when the files were downloaded from a central location.
- Notification emails had to be sent on successful and failed validation to the IT admins.
- After successfully validating the files, the validator needed to trigger the next step of the process, which is importing the files.
The main challenges
- The validation had to cross-reference the incoming data with existing data and also with data in different files (referential integrity checks).
- We also had to check the uniqueness of certain columns in the CSV files. Doing this in a database is pretty easy and straightforward, but this had to be done before inserting it into the database.
Step 1: Choosing a CSV reader library
The first step was to figure out a PHP based CSV reader library. League CSV was found to be the best option due to the below reasons:
- It was managed by composer and was already being used by the Migrate module in Drupal core and hence no additional code needed to be added for the library to work.
- The library covered many common scenarios like iterating through rows of the CSV, getting the field values and headers, and streaming large CSV files.
- And finally, it was implemented in an object-oriented way.
Step 2: Architectural requirements
Below are the requirements we had concerning the architecture of the code:
- The code needs to work as an independent service to call it at different locations of code and thereby invoke validation wherever required.
- The validations need to be as generic as possible so that the same validation rule can be reused for different fields in the same CSV or in others.
- We need to have an extensible way to specify the validation to be done for each field. For example, whether a specific field can be allowed to be blank.
Step 3: Designing the components of the validator
To satisfy the above architectural requirements, we designed the validator module into the following sub-components:
The main service class
Below are the main responsibilities of this class:
- Load the CSV library and loop through each of the files in a particular folder.
- Use the methods supplied by the CSV league to read the file into our variables. For example, each row of the file will be stored in an array with an index containing each column data.
- During processing, the filename is taken in and checked to see if the validator method in the Validator class matching the filename exists.
- If the method exists, then validation is done for the file and errors are logged into the error log table.
- If there are no errors, the class triggers the next event, which is migration using a predefined custom event via the Event API of Drupal.
- This also passes the status of the import to the calling class so that emails can be triggered to the site admins.
The Validators class
Here, we basically assign constraints for each file type in a method. The input to the validator class would be a complete row.
The Constraints class
This class contains the individual constraints that check if a particular type column meets the required criteria. These constraints are methods that take in the column value as a parameter and return an error message if it does not meet the criteria for that column type. This class will be invoked from the validators class for each column in every row.
The Error log
As its name suggests, the validator needed to capture the errors and log them somewhere. We defined a custom table using the database hooks provided by Drupal. A custom view was defined in code to read the data from this table. The errors captured by the constraint class were logged into the database using this logger.
Eventsubscriber and mail notification
We needed the validation to be auto-triggered when the files were downloaded. To achieve this, we tapped into Drupal’s EventSubscriber and Response APIs.
Referential Integrity checks
Most of the columns did not have any relation with existing data and could be validated on the fly. However, some of the data had to be validated if it has corresponding references either in the database or in another CSV file. We did this as follows.
- For those values which act as a parent, dump them into a temporary table, which will be cleared after validation is completed.
- When we arrive at another CSV with a column that references values dumped above, then we query the above table to check if the value is present. If yes, return TRUE.
- If the value is not present in the temporary table, then we search the Drupal database as the value might have been imported as part of the previous import. If not, then we throw a referential error for that row in the CSV.
The code snippets are available here.
We used the migrated data as a source for a headless backend using REST. For more details on the specifications, refer to our blog on how to validate API response using OpenAPI3.
Future scope and ideas to extend this as a module by itself
We have written the module with an architecture where the validators can be reused but require some coding effort. Below are changes that can be done to make this module a contribution.
- Add configurations to have a list of files that need to be validated.
- Each file will have an option to add the fields that need to be validated and the type of data (similar to what you have when creating content type).
- Based on the above list of files and field types, we can validate any number of CSVs with any number of columns.
- We would need to modify the above classes to fetch the columns' data type and call respecting constraints for each CSV.
As a result of doing the above changes, anyone will be able to use this module to validate CSV files with their own columns.
Hope this blog helped you with this module and how it can be made more reusable and even contributed. Share your experience in the comments below!
Binny Thomas, PHP/Drupal Engineer - L2
Addicted to Quora, he is a geek but also gentle and nostalgic. His idea of an enjoyable holiday is a quiet day relaxing at home. And if you find him pulling his hair, he is probably deep-diving in his thoughts!
Leave us a comment