The future is now: an update on the csu data lake

Presenter

  • Brendan Aldrich, Chief Data Officer, CSU Office of the Chancellor

Gartner on Analytics & BI Strategy Key Findings

  • Use only a fraction of their data
  • Modern analytic tech do little to ensure deployment and use
  • Vis and interest have been transformed by AI, but QC has been under the radar for most orgs. Eventual impact may be equally significant

What is Dx (Digital Transformation)? Series of deep and coordinated culture, workforce, and tech shifts that enable new educational and operating models that transform an institution’s ops, strategic directions and value proposition.

Most folk are using the same tools, with small refinements over the last 30+ years.

Traditional Data Issues

  • Create a stable data history from source systems
  • We can answer questions that haven’t yet been asked
  • All our in-use data
  • It’s easy and fast to add new data
  • Focus on cleaning data in sources
  • Do the interesting easy stuff and curate the useful
  • Every team at every campus can iterate independently while maintaining order

Drop ALL data into the data lake > do transformations

CSU Data Lake: A Retrospective

  • June 2017: data lake prototype (data provided to CO collected and housed in SQL Server tables.
  • January 2018: CSU is 1st CA higher ed to appoint a Chief Data Officer

New BI/DW Sub-teams

  • Discover team: data lake architecture & functionality
  • Tomorrow Team: ETL & Modeling
  • FED Team: front end design
  • InfoSec: data privacy, protection and security

Data & Analytics Strategies Driving the Future: CSU Challenge: data is highly distributed across the system and not easily accessible/usable

Architectural Deep Dive

  • Shifting data from on-premise to cloud: Delphix. Data virtualization = secure, lightweight & portable data. Unique block mapping, block aware filtering, efficient compression, secure transfer.
  • Flashed through many complex slides with “enterprise-y looking” architectural diagrams, so I couldn’t effectively capture this info.
  • AWS – DMS: migrate DBs to AWS quickly & securely: homogenous & heterogeneous DB migrations, continuously replicate with HA, Streaming data to Amazon Redshift & S3, AWS schema conversion tool, fast and easy to set-up, supports widely used DBs.

Discovery Team: architectural issue: Oracle vs. Amazon DDL. Oracle does NOT go Redshift. We created a “teleporter” process that does conversion of the DDLs and stores it in RedShift with the data.

Cost optimization: $405/day, moved down $100/day

AWS is providing us with custom patches to improve DMS acceleration results.

We’re talking about creating reserved instances of DMS from Amazon to save costs.

Curated Student Collections

  • Students: student info
  • Students by Term: by terms they attended
  • by class: by enrolled classes
  • By degree: by degree(s) attained
  • by section: by class section offered
  • apps by applicant: by application submitted

Prototyping Tech

  • AWS: crawlers, data catalogs, glue
  • Airflow + Python: hand-crafted ETL platform
  • Alteryx, Matillion, Others: Visual ETL (new prototypes)

What do YOU Get When you Start Using All This?

Curated Data Sets

  • In the next 30 days: work with CIOs and heads of IR to ID participants
  • Data validation: no statewide normalization, does this look like what’s in your SIS?
  • The Goal: access to a set of curated data sets refreshed on a daily basis; once validated, we’ll give you the ETL code; we will assist and advise in implementing a campus environment, if desired

Data validation with Pentaho (shared a view of this tool)

Direct Data Lake Access

  • In the next 60 days: work with CIOs to ID initial participants
  • Looking for pilot campuses: 3-5 pilot campuses with rollout to all other campuses to follow
  • The Goal: direct access to stored copies of all source tables via data lake; campus teleporter: to help campuses spin up RedShift tables from files; we will assist and advise in connection and best practices

We’re about ready to get all you involved!

Data Governance Orchestration

  • Cross functional data governance teams: 17 of our 23 campuses
  • Over the next 6 months we will start coordinating with those teams to actively help to share data gov practices and data dictionary definitions across campuses
  • Introducing our new Student Analytics PM: Angela Williams