Skip to main content

John Westcott on Unlocking the Value of Data

Db2 for i consultant John Westcott on the business benefits of dimensional modeling

This transcript is lightly edited for clarity. Charlie Guarino: Hi everybody. This is Charlie Guarino. Welcome to another edition of TechTalk SMB. Today I’m sitting with Mr. John Westcott. John is a Db2 for i consultant, and the topic that we’re going to talk today about is dimensional modeling. John, thanks for coming. It’s so great to speak with you again. John Westcott: Hey Charlie, it’s great to be here, thanks. Charlie: John you know when we last spoke, the conversation that we got into was about dimensional modeling, and I thought it was so fascinating and a topic that people need to hear more about because there are so many different ways that we can derive value from our existing data that people may not be aware of or maybe not properly putting it in the right ways to get to discern the best data value from it, I should say. So before we even go down that road, for those who may not even know, what is dimensional modeling? John: So dimensional modeling is a way to reshape your data into something that’s more easily consumed by businesspeople. They’re the people who make decisions, so they’re not going to understand our relational database optimized for business transactions and all the rules that go along with that. But when it comes down to making a business decision, somebody needs the data accessible to them: available, accurate. It needs to be suitable to their means. It needs to be a single source of truth, because these are the folks that make business decisions, right? They’re not going to be worried about which tables do I need to go access. They’re going to want a model just to say, run something on this. Charlie: Well, these are end users. These are not people who are necessarily technical. These are not people in the IT department. John: This should not be the folks in the IT department, right? I mean they’re not the business users. When it comes to IT decisions, they’re obviously the actual business decision makers. But what product should I buy? Should I introduce this into the business? Should I open up a store?  Should I do this? Should I do that? These are all business decisions, and they should be made by businesspeople. Charlie: It sounds like the goal is to transform the data perhaps into a more usable format for data interrogation, for an end user. John: Absolutely. We need to shape that data differently, where a third normal form, our traditional type of database with files and children tables and joins and all those type of things with the rules and constraints built into it. It’s built for performance of a large number of transactions. All those things don’t necessarily always translate into something that’s easy for businesses to make decisions from, so I want to unravel all those things, put them back together into something that looks differently that the users can then access on their terms, not on our terms from a technology standpoint. Charlie: These are via data warehouse. John: Absolutely. Data warehouse, data marts, operational data source, all that good stuff. Charlie: Right. So we are talking about literally replicating our data, but it’s not just pure—replication to me is an exact copy, I think. This is taking another form of the data. Is that what the modeling refers to? John: Absolutely right. It’s not just a replicated copy, and chances are we have a replicated copy because of high availability. I have the data in System A, and now I have to replicate that exact thing in System B. It’s a little bit different because it’s like for like. The layout of the table, the schema hasn’t changed. The connection between those tables have not changed, and what we’re talking about doing is taking this information and massaging it into maybe pre-aggregated forms. Give me my monthly totals. Very big deal and well, what if I have monthly totals already aggregated so users didn’t have to sum up thousands or millions of records to get to what was the last 12 months by month snapshots, right? What was the business at the end of every day, okay? All of this information is available on a transactional database, but it’s not shaped appropriately for a business user. Charlie: One of the problems I’ve seen in my years of doing this—and I mean years of being in IT, not necessarily as a person doing dimensional modeling I should say—but is that sometimes the onus is put on the end users to have a more intimate knowledge of a database, which is totally out of scope for them. They don’t necessarily need to know what customer master actually means and the different columns in that table or historical tables, things like that, and it’s putting too much responsibility. I think what happens sometimes is that it becomes too complex for the end users, and sometimes these projects just don’t get moved forward because it becomes this illusion of complexity, perhaps. So will dimensional modeling help that? John: Absolutely, right. I mean because what we’re going to do is we’re going to take the data—and I hate to use the word replicating, but Charlie used it a couple of times so I’ll stick with it—because we’re not really replicating the data. We’re taking the data and we’re making something new. So if we look at this as I’m baking a cake: I have milk, I have eggs, I have flour, chocolate, whatever, and that’s our transactional database. But if I’m making a cake, I’ve got to mix those things together, and now I’m going to make a cake—or maybe I’m going to take those similar kind of ingredients and mix them differently, and I’m going to get a batch of brownies or muffins. That’s what we’re doing here, right? Nobody wants to eat raw flour but everybody would love to have a piece of chocolate cake, so we’re reshaping it so it’s consumable by business. Charlie: So we said—I think Sally in accounting was a term we used in our other conversation. So Sally in accounting has a real requirement to do financial analysis, for example. We don’t want to put the responsibility on her to understand the actual GL table—it should be in a much more usable form. So that’s a perfect use case right there in that she shouldn’t have to worry about normalized data. John: Correct. She shouldn’t even care what normalized data is, and a lot of times what do we do to help Sally in accounting? We give her a spreadsheet and then say go to town, do whatever you need with it. And she performs the analysis and she massages that data in that spreadsheet in charts or graphs or summaries or whatever she pivots it around. All those types of things are the things that should be available in the warehouse. So when you’re looking to say okay, what would a good use case for a dimensional modeling, a warehouse, or a data mart of some sort? Look for your spreadsheets because that’s where all that decision making is now offline. It’s in a spreadsheet, and Sally in accounting has her spreadsheet and Bob in accounting has his spreadsheet and guess what? They don’t get the same answers because maybe Bob ran his this morning and Sally ran hers 5 minutes ago. The data has changed, because business has moved on and now we no longer have a single source of truth. That’s a simple example. How many meetings have you been in where somebody said my numbers are right, yours are wrong? And so we have all these business users trying to really make a good business decision but instead try just to decide who has the right data before they even can make a decision. Charlie: And the truth is they were both right at some moment in time. John: At some moment in time, they were probably both right. Charlie: Interesting—which actually if I bring that term again, normalized data. Normalized data: as a developer, that’s something that we strive to do. We talk about third normal form. That’s ultimately—I mean I guess there’s fourth and fifth, but we always say third is a good place to land. But that’s to make our systems run most efficiently. It’s not the best design for what you’re describing here. John: No. No, right. Relational database model: That’s what we’re used to as developers, right? It’s online transaction processing, optimized for business transactions, right at volume, at scale. They’re normalized to reduce data redundancy, and usually there’s some type of transactional lock oriented, so there’s some type of commitment control that says hey, we all reached a state, do something. These are all great things and this is what we want from the churn of the business, the everyday activities, but that’s not what the users want. The users want information where they can quickly assess something and move on. They need to make decisions and maybe it’s a quick decision. Do I give somebody this refund because they’re calling me up and complaining and I need to pull that information off and make a decision to a larger decision that says should I start selling this product? Charlie: You know we use the term replicate, and that’s an interesting term because obviously in a HA environment, I’m going to replicate my data. So if the system crashes, I can immediately do a fail over and now I’m up and running without any hiccup in the system, in the operation. That’s not our goal but yet when I hear people talk about doing data mining and things like that, oftentimes I hear oh, hit the replication box. Hit the HA box because I don’t want to tax the system. But is that always the best approach? John: You know whether it’s best approach or not, it’s often the best approach. By moving those queries right off to an isolated environment on your replicated box, your backup box, usually there’s no other activity going on on that box. It sits idle, so why not take advantage of that environment? It’s got its own processor, its own cores, its own memory, its own disk. Why not? Okay but the question is is that good for business decisions, because it’s just your transactional data, right? We haven’t shaped it. Let’s go back to the whole making of a cake. It’s independent—eggs and flour and milk and things like that—and are we asking users to make the cake, to make the decision. Or do we just want to say here’s the cake, make your decision. And that’s really the difference where yeah, it’s great to run on a separate workload. Backup environments make a great option, but they’re not shaped properly for user consumption. Charlie: I hear sometimes that people want to take IBM i data—and sure, we can have the discussion of dimensional modeling—but then move it to another platform. My first question that comes to mind is well, why aren’t we just hitting the database on IBM i? So how do you speak to that? John: [Laughs] I take a deep breath, shake my head just the same as you and go okay, why? What are you trying to accomplish? What do you think you’re going to get by moving that workload to some other cloud database? Sometimes there’s a reason because the business wants it there, and that’s a whole separate discussion about how did we get into that boat? Sometimes we need to aggregate data from multiple sources, and a lot of times we think that IBM i/Db2 is closed and that we can’t participate in Microsoft’s world or Excel/Oracle or all these Snowflakes and things. Well sure, we can bring that information into this dimensional modeling that we built and enhance it with data from other sources, other databases, other external web services. Weather is a great example. We use weather an awful lot, and we actually have a client who is using weather to aggregate and enhance their data. They want to know what was the weather at the store, and I think that’s a great example of analysis. Why were the store sales down? Well, we had monsoon rains that day, right? We had snow in April and that was unexpected, and the stores shut down. We didn’t get our foot traffic in, right? Without having that piece of information enhanced into this dimensional modeling, into this warehouse, we wouldn’t have known that. We’d have been shaking our head. Why did the store not do well in April? Charlie: You know John, a well normalized database by design is to create it to eliminate redundancy, and again to make the programs work most efficiently. But what we’re talking about here is—I’m not going to say the exact opposite, but it’s a repopulation and an explosion of what I have today. So if I have a million historical records, for example, the end result after it gets transformed in dimensional modeling it may be far more than a million rows. John: That is correct, because sometimes what we fail to look at when we build the dimensional modeling—if you look on a row-by-row basis, that’s usually not a good way of looking at it, because I have 1000 customers and I have 5000 products and all these transactions. We can start seeing how much transactions occur on a regular month, so we know the volume of data, it’s pretty easy to predict. But in the data warehouse when we reshaped that data, would we end up with more records? Probably, okay, but we’d end up with more records in a more efficient way for user to consume that data, to understand what that data meant because maybe we’re going to not use third normal form. These tables will no longer be normalized. In fact, they will be denormalized, in many instances into something that’s more flat. Maybe I have a customer’s name and product and sales transaction date all on the same row because it provides better performance for whatever the user is making a decision. If it’s important for users to have customer, product, date of sale, quantity sold, revenue at their fingertips, then build the data warehouse that way. Model your data in that respect because it serves the purpose of the users making the proper decisions. If you don’t have that type of decision, then don’t build your model that way, right? There isn’t a one fit all, build your model this way. There’s no magic bullet. There’s no one way of doing it. You have to come from the actual business users who make decisions, and model the data to fit their needs. Charlie: So that translates to additional DASD. John: Absolutely, right. You will increase your DASD, but you know the saying today is “DASD is cheap,” and it is comparatively to where it was decades ago. But I think the thing that you need to worry about, or the thing you really want to consider is if I have one user who can make one business decision that leads to $100,000 in savings or a net gain of some way because they had instant access to data to make the proper decision. Did that DASD pay for itself? It sure did, and if I have ten people doing that, it really did. Charlie: So it really amplifies the ROI very quickly. John: Yeah, it does. Charlie: So the argument of not doing this becomes very weak, I guess, in a very short term. John: It is, and we’ve boxed ourselves in the IBM i community into some corners, and we’ve tried to make things better. I’ll often tell people do you have a history file? Well, yes. Okay, you’ve got a good foot forward into some kind of data warehouse. Maybe it’s not shaped properly, maybe we need to model it into a dimensional way, but at least you’re pointed in the right direction. And if you’re changing that data, if you’re enhancing it, I don’t want to character dates to go into my history tables, I actually want data types. So there’s a little bit of already transformation of data that’s occurring. So maybe now we’re building on operational data store okay. So we’re really tiptoeing into the light end of data warehousing just by doing these small, simple things. Charlie: So let’s go a little bit deeper on that point. You mentioned historical data and that’s a good starting point. Give me a good use case. I mean, if I’m somebody in accounting, I’m trying to make financial decision for the company for example, so walk me through that. I have a financial historical table and I want to empower Sally in accounting to make better decisions. I’m going to use some kind of dimensional modeling. I’m going to speak with Sally first and see, what information are you trying to receive from this? What’s your ultimate goal? Is that the road map for this? John: Absolutely. I always preach you have to go out to your user population to find out what their needs are, because a lot of times you don’t know what it is. You make an assumption or you think you know what they might need, but once you talk to them, you get a completely different type of thing. So the way I look at it is that on one side we have the IT brain. And then on the opposite side of the page we have the business brain. And over the years, over the decades, those brains have come closer together. And I know it’s not video, Charlie, so you can’t see my hands moving together here but those two brains have come closer together, but we need to bring them into that last little bit and merge them together, so they come into sync. It’s all about the data, and if we can get people data, the information that they need, business will receive ROI, whether it be Sally in accounting or Fred on the loading dock or the CFO or the owner of the business, right? It’s data. Data drives business—and I’ll just go off on a soapbox here for a second. A lot of times we get hung up as technologists on what’s the better application. Are we using RPG? Are we using SQL? Are we using Node.js or Java? That’s irrelevant, okay? It’s the data that drives the business decisions, not the application language. Charlie: So that was two points I want to make to what you just said, and the first one is that if we’re going to have this goal to build an efficient and useful data warehouse, we can’t, nor should we, do it in a vacuum. We need to collaborate with the business end users to come up with a better version of the data, I guess. Is that right? The right version of the data? Is that the right word? John: Yeah, right. I’ll use the word model instead of version, but yes, it’s a model that we want to build, and we can’t do it in isolation. We need to understand what the business really needs, and then it’s maybe Sally in accounting and maybe it’s a couple of other people, and we need to get all these folks together and understand really what they’re trying to accomplish. What would move the business forward? And if we can learn what moves the business forward, then we can model it and shape it in some way with data, and then maybe that model is successful. Then maybe we’ll have to create another model over time, right? We’ll have to relearn new problems and develop new models to solve those new problems. So again, it’s not a I do it once and go away kind of thing. It’s a continually evolving type of situation. Charlie: So it’s a living thing, so to speak. John: Absolutely a living thing. Absolutely. The data is a living entity, and it grows up over time, and sometimes little pieces die off because that part of the business—we solved that problem and there’s no more problems to solve. Maybe that little data model just kind of languishes and gets deprecated over time and new models take its place. Charlie: And you know it’s funny how my brain initially went to Sally in accounting. So me thinking of financial analysis—but really the breadth of what could take from this is virtually unlimited. There’s no reason why we couldn’t look at manufacturing floor or the distribution, the supply chain, or any other facet of a business. There’s no reason why we couldn’t do analysis on this and get valuable nuggets of information that may not have been so obvious in a normalized database, so to speak. John: Exactly, and I’ll give an example: So yesterday, as a matter of fact, it was kind of an odd day. I had two people, two different clients call and want to talk about how to get insights into their shipments—lots of missed shipments. They need to understand why are these shipments failing, right? What’s the root cause? They needed to understand how to solve the problem. Well, they had the data, and it was very convoluted. It was spread out as all these IBM i databases tend to be, right? They have legacy back into the System/36 in one of these clients, so a lot of bad database design that had been inherited through the years. But they still, at the end of the day, they needed to figure out and get a handle on where were these shipments? What was causing these shipment delays or missed shipments? We had to model the data, so now we understand ah, we have problems with shipments. Well let’s take the data out, let’s aggregate it, let’s start and figure out what type information, what facts do we need to understand and what perspectives, categories do we need to start needing to look at these facts? And once we understood what those lenses were, then we were able to start talking about okay, here’s probably what you’re going to need to get this type of information. Charlie: One thing I checked on, John—before we had this little conversation, I looked at other use cases, and what came back was absolutely fascinating to me. I mean healthcare—that’s a whole other huge industry—supply chain analysis, social media analytics, fraud detection, inventory management, HR. The list is endless, it seems. John: Well it is, and I’m taking that as a leading question because all of these things combine with it’s a living document. It’s changing, it’s evolving, it’s growing, and we need to get a handle on all this information. And there is this thing that’s out there in the world, and some people think it’s scary. We have this artificial intelligence, this machine learning thing, but we can tie into that a little bit and gain insights into analytics that give us a perspective, right? Maybe we get an answer to a question that we didn’t even know needed to be asked, and it’s all funneled through this data, and it has to be some place where it makes sense for us to make business choices. Charlie: It almost seems to me that one data set that’s running in production—my production data—one data set can yield so many different aspects of data. John: Yes, right. So typically when we dimension data, we will look at what are the specific needs, and we’ll build these things called facts around dimensions, and that’s the traditional star scheme, right? We’ve probably over the years seen the picture of the five-pointed star with a word in the middle, and that’s the fact, and then each point are the different categories that we can slice and dice and view this data, the dimensions around that star. So yes, typically we have these multiple facts that are tied to these dimensions to answer specific type of questions. Yes, we can tie these different facts together to create a larger answer or a larger perspective on the data. Charlie: And ultimately the data warehouse doesn’t necessarily have to only be reliant on data coming from IBM Db2. It can be coming from many different sources. John: Oh absolutely, right. I mean there’s no reason that we could not take—ingest would be the word. We can be part of a pipeline, right? There’s this thing called the data pipeline where data moves from place to place to place to place, wherever it needs to end up, it’s transformed along the way. So we can participate in that pipeline and be just a cog in that chain, or we can be the end place and we can take data from Oracle and data from Snowflake and data from Azure and Microsoft SQL Server and any place else, bring that into our warehouse, and then create this enterprise view of your data. Charlie: You know I think my big takeaway from our discussion John is I think there’s so much additional value that could be obtained from what we already have, and it’s just a matter of knowing how to properly shape into more meaningful—you know cubes is a term I hear, cube. Is that right word? Cubes. John: Cubes is, along with data modeling. So cubes is a big part of it. Cubes usually provides a pre-aggregated dimension perspective on a data right so it provides a specific answer to a specific need. Charlie: But I guess my point is the time is now for us to look at our database, and really communicate with our end users, and empower them, and really see what they need to really derive more value and become less dependent on the IT department to get the value out of the database. John: Absolutely right. I mean I talked to somebody about a year ago who’s like, well I don’t need that. I’m like, why don’t you need that? Well, I give them reports. I’m like, a printed copy is only going to go so far. They’re like oh well, they take that printed copy and then we have the software that turns it into an Excel spreadsheet and then they do whatever they want with it. I’m done, right? And that’s not the perspective that I believe that we need to take. We need to give people charts and graphs, visualization dashboards: information at their fingertips. It’s accessible. It’s there. Everybody has the same information, right? I don’t have 12 spreadsheets running around in data silos. Everybody comes to the same spot, everybody is getting the same information, and then hopefully everybody can make the same decision—faster decisions, better decisions, right? That’s all good for business. Charlie: With one version of the truth. John: One version. A single source of truth. Charlie: Single source of truth. You know, I have a feeling that we just touched the tip of the iceberg here. I suspect that you and I could probably speak for hours on this very topic. It’s so fascinating to me. John: I love dimensional modeling. Data warehousing is something I believe every IBM shop should be invested in. It is critical for business, and we’re seeing a lot of it. The rise of all these cloud databases like Tableau and Snowflake and Azure—all these databases are popping up because there is a need for business to have this type of information, and we can go to these external sources. If we want to go to Azure that’s great, but we don’t need to, right? We can do it on IBM i and participate solely in the IBM i environment or an environment that includes other databases as well. We can bring the information to us; we can send the information out to other folks. Charlie: It’s so interesting. John, I’m going to leave it there with the understanding that I’m going to have to bring you back at some point, because this is such a fascinating topic and I think people can really get more value out of the machines that they already have running on their premises or in the cloud whatever. John: Absolutely right. The data is the value, and there’s so much rich history of data on the platform. We have System/36, /38, AS/400, the iSeries—we’re i today, right? All that data. We have 35 years’ worth of data locked into our systems. We need to unlock that for true business value. Charlie: For anybody listening, I will kind of leave it here—because gosh, as I said we could talk for hours. But I think my takeaway is this is a topic worthy of your time to learn more about, and I think you’re going to be very pleased with how much information you can get from what you already have sitting on your disk. Is that a good summary, John? John: Absolutely. You don’t know the unlocked potential—and so you used the word data mining earlier. Until you start mining for that data, looking for it, you don’t know what you’re going to find. And if you talk to your users, you’re going to find the right questions to start asking. Charlie: Unlocking the value. John: Unlocking the value, yes. Charlie: That’s maybe that’s a good title for our podcast today: Unlocking the value. John: There you go. Charlie: Sounds like a plan. John, I want to thank you so, so much for your time today. This is again an absolutely fascinating topic, and one that I think anybody who listens to this podcast should really take to heart because it’s great for decision support for anybody, any end user. And again, we’re empowering our users, which is a great goal. John: Absolutely. Absolutely. Charlie: Terrific. Well John, thank you very much. It’s truly a pleasure. It’s always a pleasure to chat with you and for everyone else that’s listening, be sure to check out TechChannel website.  It’s chock full of other podcasts and white papers and such, and it’s really worth a look. Anyway, this is Charlie Guarino, and thank you everybody for listening. Talk to you next time. Bye now.