Tips on building a data analytics warehouse
By a data analytics warehouse we mean a solution that combines and cleans data from various internal and external sources and offers reporting and querying capabilities to business users. Sometimes also referred to as a big data platform.
Based on failures and hard-won lessons, here's 21 tips for teams and individuals who are responsible for making a company more data-driven.
Solve the right problem #
- Understand that building an analytical warehouse is not about the technical tooling.
- Get an in-depth understanding of the root problem to minimize the risk of delivering a solution that doesn't solve a tangible business problem.
Choose the right tool for the job #
- Consider the most straightforward solutions to resolve the root problem. Perhaps, an excel would suffice.
- Do not choose a solution according to personal preference.
- Be aware of maintenance costs. Consider off-the-shelf solutions before building your own. If a cloud platform is an option, consider no-ops and SaaS offerings. There is a plethora of solutions out there for ETL, storage and BI tools. Be pragmatic about Hadoop.
- Start with delivering a narrow vertical slice of functionality to discover possible problems early and to validate the chosen approach.
- Validate new deliverables as frequently as possible through an intimate collaboration with business.
Provide correct and consistent data #
- Check for data correctness: integrity violations (foreign keys, unique values), invalid value types, errorneus ranges (values out of bound, text too long or too short), missing data, etc.
- Check for suspicious data: values out of expected bounds, the number of new records too large/low, etc.
- Treat incorrect and suspicious data as exceptions and warnings, respectively. Report on them automatically and periodically.
- Use consistent prefixes and suffices per field type to limit guesswork. For example, use
is_
for booleans and_at
or_time
for time. - Ensure consistency in values. For example, use
null
for missing data and ISO codes for countries among all tables. - Convert values to native types as early as possible. For example, convert date strings to dates, and binary integers to booleans.
Bring data closer to business #
- Add as much semantics to field names as possible. Where applicable, include units, timeframes, and aggregation functions. For example, use
avg_kwh_consumed_per_year
instead ofenergy
. - Establish and use common terminology with business. Rely on ubiquitous language for table names, field names, and values. Reconsider using abbreviations.
- If possible, encode domain knowledge and undocumented business rules into data to increase value for data users.
- If relevant, teach SQL to data users to enable advanced self-service querying.
- Understand that data is an abstract model of reality. Identify and communicate data limitations.
Develop efficiently #
- Learn window functions and
with
queries to simplify complex SQL queries. - Create views for frequently reused SQL queries to ensure users don’t repeat themselves.
- Follow software engineering best practices, incl. version control, clean code, testing, peer reviews and boring and predictable deployments.
- ← Previous post: Taking things personally
- → Next post: Drawings for technical explanations: my first steps
This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter