ORFIUM's BI Toolkit and Skillset

We’re back

As described in a previous blog post from ORFIUM’s Business Intelligence team, the set of tools and software used varied as time progressed. A shorter list of software was used when the team was two people strong, a much longer and more sophisticated one is being currently used.

2018-2020

As described previously, the two people in the BI team were handling multiple types of requests from various customers from within the company and from ORFIUM’s customers. For the most part, they were dealing with Data Visualization, as well as a small part of data engineering and some data analysis.

Since the team was just the two of them, tasks were more or less divided in engineering and analysis vs visualization. It is simple to guess that in order to combine data from Amazon Athena and Google Spreadsheets or ad-hoc csv’s, a lot of scripting was used in Python. Data were retrieved from these various sources and after some (complex more often than not) transformations and calculations the final deliverables were some csv’s to either send to customers, or load on a Google Sheet. In the latter case a simple pivot table was also bundled in the deliverable in order to jump start any further analysis from the, usually internal, customer.

In other cases where the customer was requesting graphs or a whole dashboard, the BI team was just using Amazon Athena’s SQL editor to run the exploratory analysis, and when the proper dataset for analysis was eventually discovered, we were saving the results to a separate SCHEMA_DATASET in Athena itself. The goal behind that approach was that we could make use of Amazon’s internal integration of their tools, so we provided our solutions into Amazon Quicksight. This seemed at that moment the decision that would provide deliverables in the fastest way, but not the most beautiful or the most scalable.

Quicksight offers a very good integration with Athena due to both being under the Amazon umbrella. To be completely honest, at that point in time the BI Analyst’s working experience was not optimal. From the consumer side, the visuals were efficient but not too beautiful to look at, and from ORFIUM’s perspective a number of full AWS accounts was needed to share our dashboards externally, which created additional cost.

This process slightly changed when we decided to evaluate Tableau as our go-to-solution for the Data Visualization process. One of the two BI members at that time leaned pretty favorably towards Tableau, so they decided to pitch it. Through an adoption proposal for Tableau, which was eventually approved by ORFIUM’s finance department, Tableau came into our quiver. Tableau soon became our main tool of choice for Data Visualization. It allows better and more educated decisions to be made from management, and is able to showcase the value that our company can offer to our current and potential future clients.

This part of BI’s evolution led to the deprecation of both Quicksight and python usage, as pure SQL queries and DML were developed in order to create tables within Athena, and some custom SQL queries were embedded on the Tableau connection with the Data Warehouse. We focused on uploading ad-hoc csv’s or data from GSheets on Athena, and from there the almighty SQL took over.

2021- 2022

The team eventually grew larger and more structured, and the company’s data vision shifted towards Data Mesh. Inevitably, we needed a new and extended set of software.

A huge initiative to migrate our whole data warehouse from Amazon Athena to Snowflake started, with BI’s main data sources playing the role of the early adopters. The YouTube reports were the first to be migrated, and shortly after the Billing reports were created in Snowflake. That was it, the road was open and well paved for the Business Intelligence team to start using the vast resources of Snowflake and start building the BI-layer.

A small project of code migration so that we use the proper source and create the same tables that Tableau was expecting from us, turned into a large project of restructuring fully the way we worked. In the past, the python code used for data manipulation and the SQL queries for the creation of the datasets to visualize were stored respectively in local Jupyter notebooks and either within View definitions in Athena or Tableau Data source connections. There was no version control; there was a Github repo but it was mainly used as a code storage for ad-hoc requests, with limited focus on keeping it up to date, or explaining the reasoning of updates. There were no feature branches, and almost all new commits on the main one were adding new ad-hoc files in the root folder and using the default commit message. This situation, despite being a clear pain point of the team’s efficiency, emerged as a huge opportunity to scrap everything, and start working properly.

We set up a working guide for our Analysts: training on usage of git and Github, working with branches, PullRequest templates, commit message guidelines, SQL formatting standards, all deriving from the concept of having an internal Staff Engineer. We started calling the role Staff BI Analyst, and we indeed currently have one person setting the team’s technical direction. We’ll discuss this role further in a future blog post.

At the same time we were exploring options on how to combine tools so that the BI Analysts are able to focus on writing proper and efficient SQL queries, without having to either be fully dependent on Data Engineers for building the infrastructure for data flows, or requiring python knowledge in order to create complex DAGs. dbt and Airflow surfaced from our research and, frankly, the overall hype, so we decided to go with the combination of the two.

Initially the idea was to just use Airflow, where an elegant loop would be written so that the dags folder would be scanned, and using folder structures and naming conventions on sql files, only a SnowflakeOperator would be needed to transform a subfolder in the dags folder to a DAG on the AirflowUI, with each file from the folder would be a SnowflakeOperator task, and the dependencies would be handled by the naming convention of the files. So, practically a folder structure as the one shown to the right would automatically create a Dynamic Dag as shown on the left.

No extra python knowledge needed, no Data Engineers needed, just store the proper files with proper names. A brief experimentation with DAGfactory was also implemented but we soon realized that the airflow should just be used as the orchestrator of the Analytics tasks, and the whole analytics logic should be handled by something else. All this was very soon abandoned when dbt was fully onboarded to our stack.

Anyone who works in the Data and Analytics field must have heard of dbt. And if they haven’t already, they should. This is why there is nothing too innovative to describe about our dbt usage. We started using dbt from early on in its development, having first installed v0.19.1, and with an initial setup period with our Data Engineers, we combined Airflow with dbt-cloud for our production data flows, and core dbt CLI for our local development. Soon after that, and in some of our repos, we started using github actions in order to schedule and automate runs of our Data products.

All of the BI Analysts in our team are now expected to attend all courses from the Learn Analytics Engineering with dbt program offered at dbt Learn regarding its usage. The dbt Analytics Engineering Certification Exam remains optional. However, we are all fluent with using the documentation and the slack community. Generic tests dynamically created through yml, alerts in our instant messaging app in case of any DAG fails, and snapshots are just some of the features we have developed to help the team. As mentioned also above, our Staff BI Analyst plays a leading role in creating this culture of excellence.

There it was. We endorsed the Analytics engineering mindset, we reversed the ETL and implemented ELT, thus finally decoupling the absolute dependency on Data Engineers. It was time to enjoy the fruits of Data Mesh: Data Discovery and Self Service Analytics.

2023-beyond

Having more or less implemented almost all of ORFIUM’s Data Products on Snowflake with proper documentation we just needed to proceed to the long awaited data democratization. Two key pillars of democratizing data is to make them discoverable and available for analysis by non-BI Analysts too.

Data Discovery

As DataMesh principles dictate, each data product should be discoverable by its potential consumers, so we also needed to find a technical way to make that possible.

We first needed to ensure that data were discoverable. For this, we started testing out some tools for data discovery. Among the ones tested was Select Star Select Star,, which turned out to be our final choice. During the period of trying to find the proper tool for our situation, Select Star was still early in its evolution and development so, after realizing our sincere interest, they invested in building a strong relationship with us, consulting us closely when building their roadmaps, while having a very frequent communication looking to get our feedback as soon as possible. The CEO herself Shinji Kim was attending our weekly call helping us make not just our data discoverable to our users, but the tool itself easily used by our users in order to increase adoption.

Select Star offered most of the features we knew we wanted at that time, and it offered a quite attractive pricing plan which went in line with our ROI expectations.

Now, more than a year after our first implementation, we have almost 100 users active on Select Star, which is a pretty part of the internal Data consumer base within ORFIUM, given that we have a quite large operations department of people who do not need to access data or metadata.

We are looking to make it the primary gateway of our users to our data. All analysis, even thoughts, should start by using Select Star to explore if data exist.

Now, data discovery is one thing, and documentation coverage is another thing. There’s no point in making it easy for everyone to search for table and column names. We need to add metadata on our tables and columns so that the search results of Select Star parse that content too, and provide all available info to seekers. Working in this direction we have established within the Definition of Done of a new table in the production environment a clause that there should be documentation on the table and the columns too. Documentation on the table should include not only technical stuff like primary and foreign keys, level of granularity, expected update frequency etc, but also business information like what is the source for the dataset, as this varies between internal and external producers. Column documentation is expected to include expected values, data types and formats, but also business logic and insight.

The Business Intelligence team uses pre-commit hooks in order to ensure that all produced tables contain descriptions for all the columns and the tables themselves, but we cannot always be sure of what is going on in other Data products. As Data culture ambassadors (more on that on a separate post too), BI has set up a data coverage monitoring dashboard, in order to quantify the Docs coverage of tables produced by other Products, raising alerts when the coverage percentage falls below the pre-agreed threshold.

Tags and Business and Technical owners are also implemented through Select Star, making it seamless for data seekers to ask questions and start discussions on the tables with the most relevant people available to help.

Self-Service Analytics

The whole Self-Service Analytics initiative in ORFIUM, as well as Data Governance, will be depicted in their very own blog posts. For now, let’s focus on the tools used.

Having all ORFIUM Data Products accessible on Snowflake and discoverable through Select Star, we were in position to launch the Self-Service Analytics project. A decentralization of data requests from BI was necessary in order to be able to scale, but we could not just tell our non-analysts “the data is there, knock yourself out”.

We had to decide if we wanted Self-Service Analysts to work on Tableau or if we could find a better solution for them. It is interesting to tell the story of how we evaluated the candidate BI tools, as there were quite a few on our list. We do not claim this is the only correct way to do this, but it’s our take, and we must admit that we’re proud of it.

We decided to create a BI Tool Evaluation tool. We had to outline the main pillars on which we would evaluate the candidate tools. We then anonymously voted on the importance of those pillars, averaging the weights and normalizing them. We finally reached a total of 9 pillars and 9 respective weights (summing up to 100%). The pillars list contain connectivity effectiveness, sharing effectiveness, graphing, exporting, among other factors.

These pillars were then analyzed in order to come up with small testing cases, using which we would assess the performance in each pillar, not forgetting to assign weights on these cases too, so that they sum up to 100% within each pillar. Long story short we came up with 80 points to assess each one of the BI tools.

We needed to be as impartial as possible on this, so we assigned two people from the BI team to evaluate all 5 tools involved. Each BI tool was also evaluated by 5 other people from within ORFIUM but outside BI, all of them potential Self-Service Analysts.

Coming up with 3 evaluations for each tool, averaging the scores, and then weighting them with the agreed weights, led us to an amazing Radar Graph.

Though there is a clear winner in almost all pillars, it performed very poorly in the last pillar, which contained Cost per user and Ease of Use/Learning Curve.

We decided to go for the blue line which was Metabase. We found out that it would serve >80% of current needs of Self-Service Analysts, with very low cost, and almost no code at all. In fact we decided (Data Governance had a say on this too) that we would not allow users to be able to write SQL queries on Metabase to create Graphs. We wanted people to go on the Snowflake UI to write SQL, as those people were few and SQL-experienced, as they usually were backend engineers.

We wanted Self Service Analysts to use the query editor, which simulates an adequate amount of SQL features, in order to avoid coding at all. If they got accustomed to using the query builder, then for the 80% of their needs they would have achieved this with no SQL, so the rest of the Self-Service Analysts (the even-less tech savvy) would be inspired to try it out too.

After ~10 months of usage (on the Self-Hosted Open Source version costing zero dollars per user per month, which translates to *calculator clicking * zero dollars total) we have almost 100 Monthly Active Users and over 80 Weekly Active users, and a vibrant community of Self-Service Analysts looking to get more value from the data. The greatest piece of news is that the Self-Service Analysts become more and more sophisticated in their questions. This is solid proof that, within the course of 10 months, they have greatly improved their own Data Analysis skills, and subsequently the effectiveness of their day-to-day tasks.

Within those (on average) 80WAUs, the majority is Product Owners, Business Analysts, Operations Analysts, etc., but there are also around five high level executives, including a member of the BoD.

Conclusion

The BI team and ORFIUM itself have evolved in the past few years. We started from Amazon Athena and Quicksight, and after a part of the journey with python by our side, we have established Snowflake, Airflow, dbt and Tableau as the BI stack, while adding in ORFIUM’s stack Select Star for Data Discovery and Metabase for Self-Sevice Analytics.

More info on these in upcoming posts, but we have more insights to share for the Self-Service Initiative, the Staff BI role, and the Data Culture at ORFIUM.

We are only eager to find out what the future holds for us, but at the moment we feel future-proof.