XML-INTO Meets Open Access With RPG’s New DATA-INTO
An introduction to the latest addition to the RPG language: DATA-INTO. Perhaps the best way to think of DATA-INTO is as a combination of Open Access and XML-INTO
This article is to introduce you to the latest addition to the RPG language: DATA-INTO. As we hinted in the title, perhaps the best way to think of DATA-INTO is as a combination of Open Access and XML-INTO. XML-INTO takes data and unpacks it into a matching RPG data structure. Open Access utilizes a custom handler, written by you or supplied by a third party, to treat data originating from any source as if it came from a file. DATA-INTO places data into Data Structure (or array), as XML-INTO does but it uses a custom parser to figure out what data goes where, as Open Access does
Why did IBM introduce this latest feature? In part at least because they have been receiving many requests for a JSON equivalent to XML-INTO. Of course there are many other forms of character data, from CSVs to keyword-text pairs, that could also benefit from native language support. Who knows what the next “hot” data interchange format will be. After all, how many of us had even heard of JSON 10 years ago? Yet it has come to dominate the world of data interchange. That was supposed to be XML’s role, but things change rapidly in this crazy business of ours.
So, bearing this in mind, IBM decided to follow the path that they had set with Open Access and design this new capability as a language extension point. This not only allows companies to write their own parsers, but also allows third parties and open-source groups to offer parsers for a variety of data interchange formats. To get the ball rolling, IBM is supplying a number of sample parsers. While many are primarily intended as teaching tools, they are also supplying a complete JSON parser, which should make those who lobbied for JSON-INTO happy.
This approach to adding functionality to the language is a route that we expect to see RPG continue to follow in the future. Namely to use IBM’s resources to facilitate the extension of the language rather than simply add new features with limited capability.
OK, enough with the introduction—let’s look at the implementation.
Basic Syntax
The basic syntax for DATA-INTO is very similar to that used for XML-INTO and this similarity is more than skin deep.
One difference that you’ll immediately notice however is the use of %DATA where %XML would have been used with XML-INTO. Apart from the difference in name, the basic purpose of the parameter (to identify the source of the data and options to be used) is much the same as with XML-INTO.
The real difference comes with the third parameter, %PARSER. This is used to identify the program (or subprocedure) that will be performing the parsing operation. This is where you see the first similarity with Open Access.
Basic Syntax:
DATA-INTO{(EH)} receiver
%DATA( document {: options })
%PARSER( parser {: parser options }};
Example:
DATA-INTO accounts
%data('Sample1.csv': 'doc=file case=any')
%parser('*LIBL/PARSECSV1');
In the example shown, the Data Structure (DS) accounts is being loaded with data extracted from the file sample1.csv. The work of identifying the fields within that file and their associated values is to be performed by the parser PARSECSV1. As you can see in the syntax diagram it is also possible for the caller to pass additional parameter information to the parser via the parser options parameter.
In addition to the basic syntax shown here, like XML-INTO before it, DATA-INTO also allows for a %HANDLER variant for those instances where the amount of data to be processed exceeds RPG’s capacity limitations, or where you simply want to process each portion of the data as and when it is available. If you’re unfamiliar with the use of %HANDLER with XML-INTO you can find all about it here. We’ll look at an example of using it with DATA-INTO in a later article. View the PTFs and Op-code here.
How Does a Parser Work?
Think for a moment of how XML-INTO operates. Under the covers, an IBM supplied XML parser extracts the names and values from the document, and places the values into the RPG variables with a matching name and hierarchy. The only real difference with DATA-INTO is that instead of IBM’s runtime logic being responsible for parsing out the names and values, YOU must supply the logic to perform that task. Having identified the relevant values, your parser then notifies the RPG runtime, which in turn stores the values.
When you invoke DATA-INTO, RPG runtime logic first extracts the data from the variable or the file that you specified and places it in a buffer. It then passes the address of that buffer, along with its length and other identifying information, to your parser.
Once in control, the parser processes the buffer notifying RPG what it has found. In order to do this, it can call a number of procedures which allow it to inform RPG about the structure of the data, names, and their values. We’ll talk a little more about these procedures when we look at a simple example later. For the moment though, let’s simply step through the major ones that just about every parser will use, in the sequence in which they will typically be used.
The parser must start by calling QrnDiStart() to notify RPG that processing has commenced.
The next step would normally be to call QrnDiReportName() to notify RPG of the name of the item that is being processed. Just like with XML-INTO, names matter and this name should match the name of the target structure.
If the name reported equates to a Data Structure (DS) then the next call is to QrnDiStartStruct(). On the other hand, if the name represents an array, then QrnDiStartArray() would be called instead.
Assuming for a moment that we are starting a DS, then the next call would normally be to QrnDiReportName() to identify a field within the DS, this would immediately be followed by a call to QrnDiReportValue() to notify RPG of the value to be placed in that field. This pair of calls would be repeated for all fields in the DS.
Once all of the fields have been processed, the parser would then call QrnDiEndStruct() to notify RPG that the structure (or, in the case of a DS array, an array element) has been completed.
Last but not least, the parser notifies RPG that it has completed its work by calling QrnDiFinish(). This tells RPG to return control to the original program once the parser exits.
At this point your main program regains control at the instruction following the DATA-INTO operation and your data should all be tucked away nice and neatly in its associated variables just waiting for you to process it.
Should your parser identify a field name that does not exist, then an error will be signalled indicating that the data does not match the target. As with XML-INTO this can be avoided by specifying the option allowextra=yes on the DATA-INTO. Similarly, allowmissing=yes can also be used to avoid errors such as missing field values. More on the various options available in future articles.
Example Parser
Rather than use one of IBM’s examples, we decided to write a program to parse a simple CSV file. The calling program looks like this:
dcl-ds accounts Qualified Dim(10) Inz;
account char(4);
name char(20);
end-ds;
dcl-ds pgmStat psds;
numElements int(20) pos(372);
end-ds;
dcl-s b char(1);
dcl-s i int(10);
data-into accounts
%data('Sample1.csv': 'doc=file case=any ccsid=job')
%parser('*LIBL/PARSECSV1');
for i = 1 to numElements;
dsply ( 'Account: ' + accounts(i).account +
' Name: ' + accounts(i).name);
endfor;
As you can see this will load the data from the file ‘Sample1.csv’ into the DS array accounts. This array has two subfields: account and name. The CSV file sample1.csv contains data that looks like this:
1234,Jones
2345,Smith
3456,Gantner
4567,Paris
As you can see it consists of a four-character account code followed by a name of arbitrary length. The important thing to note here is that, unlike an XML or JSON document, there is no name associated with the individual pieces of data. Their position in the record determines which field they represent.
Notice that once DATA-INTO is completed we can use the value that RPG has placed in the numElements variable in the PSDS to control the display loop. Just as with XML-INTO, when the target for a DATA-INTO operation is an array, RPG maintains a count of the number of active elements in this variable.
Time to study the parser PARSECSV1.
Here are the basic data definitions that will be used by our parser:
(A) // Names of DS subfields
dcl-s subfieldName_Account varchar(15) inz('account');
dcl-s subfieldName_Name varchar(15) inz('name');
(B) // Values of subfields
dcl-s account varchar(10);
dcl-s name varchar(30);
(C) // Variables for record location and content
dcl-s record varchar(256);
dcl-s separator int(5);
dcl-s pcurrentPosn pointer;
At (A) we define the names of the DS subfields that we will be populating. Remember, as we noted earlier, there are no field names in the data, so the parser must supply them. In a future example, we’ll show you how we could use column names to get round this problem but for now we’ll keep things simple. Notice that the fields are defined as varying length (varchar). We did this because, later on, we’ll need both the name and its length which can easily be obtained by using %Len. By using this approach, we avoid having to remember to trim trailing spaces from the names all the time. Failure to do so would result in a field name mismatch within the RPG runtime.
The storage for the values of the account and name field are defined at (B). These are also varying length fields. In addition to making it easy to obtain the lengths of the values, this has the added advantage that should one of the target fields in the calling program’s DS array be varying in length it will be handled correctly and not have a lot of spurious blanks in the back end of the field.
The variables at (C) are used for the current record, the position of the comma separator, and the position within the buffer at which the first/next record starts.
Now that we’ve seen the basic variable definitions it is time to look at the program logic. As you’ll see it follows very closely along the pattern outlined in the previous section.
The very first step (D) is to copy the environment pointer so that our code has access to all of the QrnDi… procedures. Don’t worry about this—you simply need it at the start of every parser. Similarly, at (E) we copy the pointer to the data buffer that RPG has supplied into our current record position pointer.
Now the real work begins with a call to the QrnDiStart procedure (F). The parameter passed here is a “handle” that RPG has given us to uniquely identify this specific DATA-INTO operation. As you will see we have to include this value on each and every subsequent call to any of the QrnDixxx procedures.
At (G) we call the QrnDiStartArray procedure to notify RPG that we are starting an array. Since there’s no array name in the file, we are simply starting an unnamed array. RPG will derive the name to be used as the target from the DATA-INTO operation.
Within the loop (I) we search for the position of the comma separator, and split the record into its composite fields. Once we have the data separated it is time to tell RPG what we have “discovered.”
// Enable access to the callback function
(D) pQrnDiEnv = parm.env;
// Set current record position to start of buffer
(E) pCurrentPosn = parm.data;
// Start the parse
(F) QrnDiStart (parm.handle);
// Notify RPG that the data represents an array
(G) QrnDiStartArray(parm.handle);
// Get first set of data to parse
(H) record = getRecord(pcurrentPosn);
DoW record <> '';
// find position of comma in record and extract values
(I) separator = %scan(',': record);
account = %subst(record: 1: separator - 1);
name = %subst(record: separator + 1);
For each array element, we need to call QrnDiStartStruct (J) to notify RPG of the beginning of an array element.
Next, we call (K) QrnDiReportName and QrnDiReportValue to notify RPG of the names and values for the data that we have extracted. Note that because we are using variable length fields for the names and values, the address we need to pass back to RPG is to the beginning of the data portion of the field—hence the use of the qualifier *data on the %Addr function.
Once the values have been set, we call QrnDiEndStruct (L) to notify RPG that this element of the DS has been completed. We then simply move on to the next record and repeat the process.
(J) // Notify RPG of the beginning of an iteration of the structure
QrnDiStartStruct (parm.handle);
// Report the name of the 1st subfield 'account'
(K) QrnDiReportName (parm.handle
: %addr(subfieldName_Account: *data)
: %len(subfieldName_account) );
// and give RPG the associated value
QrnDiReportValue(parm.handle
: %addr(account: *data)
: %len(account) );
// Repeat the process with the subfield 'name'
QrnDiReportName (parm.handle
: %addr(subfieldName_Name: *data)
: %len(subfieldName_Name) );
QrnDiReportValue(parm.handle
: %addr(name: *data)
: %len(name) );
// End this iteration of structure
(L) QrnDiEndStruct (parm.handle);
// Get another set of data to parse
record = getRecord(pcurrentPosn);
EndDo;
Finally, when all records have been processed and the Do loop exits, we call QrnDiEndArray (M) to finish the array.
The final step is to call QrnDiFinish (N) to notify RPG that parsing has been completed. We then simply return from the parser, RPG “tidies up” and returns control to the statement following the DATA-INTO operation.
// Notify RPG of end of array
(M) QrnDiEndArray(parm.handle);
// Notify RPG that we are ending the parse (no more data)
(N) QrnDiFinish(parm.handle);
That’s all there is to it. The data is now all safely stowed away in the accounts DS array and when we return to the original program we can go ahead and process the elements in whatever way we choose. In the case of our example program we simply loop through the elements and display the values.
Wrapping Up (For Now)
As you can imagine, we’ve only just touched the surface of what can be done with DATA-INTO. We already have a number of projects in mind–all we need now is the time to make them a reality! We’ve already mentioned that one of them will be to produce a version of this parser that can be more generic by taking advantage of having the first row of the CSV file contain column/subfield names. That way, no hard-coded names are needed the parser code. That will obviously make the parser far more flexible than the one we have shown here.
Another way we could add flexibility is to take advantage of the optional extra parameter that can be passed into the parser. For example, field names could be provided that way, rather than use information from the data.
Other details we haven’t covered in depth here include the option to supply RPG with parser specific error messages and the ability to handle data in any CCSID (which our European and Asian friends will love), and so much more. But as you can probably appreciate, we haven’t had very long to play with this new toy yet and we know there will be a lot more that we find out over the next few months. We look forward to sharing that with you.
Note: If you would like to study the source code for this example in more detail you can download it from our website.