Honeypot Cult Article: Data Lake life: Going from Zero to Two Billion Rows of Data in Six Months
Published on 2020-10-13
6 min read
This article was authored as a contribution to the Honeypot Cult community - you can read it there too!
Upon graduating from James Madison University in the states, I continued my early career in GE Healthcare’s Information Technology Leadership Program (ITLP). The program develops young professionals into next-generation leaders at GE through on the job training, leadership experiences, and mentorship from top professionals.
Every six months over the program’s two-year span, I rotated onto a new team and new project within the IT organization. I, along with my fellow program members, was expected to ramp up quickly and deliver on commitments by the end of each rotation. The program allowed us to experiment with different types of roles and technology to find potential long-term career interests.
This program wasn’t unique to GE Healthcare; it was part of every business within GE’s portfolio, such as GE Aviation or GE Power, with members representing their businesses in different locations. Typically, program members started at and graduated from the same business where they took on their first after program role.
A unique opportunity came up during my second rotation. If I accepted, my 3rd rotation would be in Cincinnati, OH for six months as part of the GE Aviation business, allowing me to meet new people, learn a new business, and venture to a new city.
I was ecstatic to accept!
Originally, my role in GE Aviation was focused on project management to implement technology at various manufacturing shops – this quickly changed when arriving on my first day.
Instead of one program member aligned to one project, a group of us would be part of a team of teams experiment.
Six members of the IT Leadership Program would work with members of the Operations Leadership Program (OMLP) as part of a larger project effort. With leadership support, we were tasked with revamping the data lake processes to ingest manufacturing system data – as quickly and efficiently as possible.
That was all that we knew and all that we were told that day.
We understood that there was a large amount of work and not much knowledge about how to do it yet. We didn’t know our operating mechanisms, our contacts, or priorities – only that they would come soon.
Immediately we had questions.
- What has been done before?
- Where does the data come from?
- How are we going to manage work?
- How can we assign tasks?
- How can we keep leadership informed and let them prioritize effectively?
The answers led us into our storming stage, where we decided on using the scrum methodology to manage tasks and roadmap.
Our first objective was ramping up, utilizing the various expertise and experience on the team to assign tasks, which gave us the opportunity to knowledge share later.
We learned that the data lake used Greenplum for data and our data loading (ETL) technology was Talend. Further, we established how we would interact with supply chain data teams and how we could get access to the data systems.
Once our internal processes were finalized, we commenced development on our MVP Talend job by the end of the first month. Two weeks later, the team had collaborated through challenges to pass initial testing – we could load test data into the data lake!
Outside of our small team of six, the larger team (OMLPs) and leadership partnered with the functional teams to prioritize data sources; and, it was thanks to this effort, we could then reach out and begin iterating on our process for onboarding new systems into the data-lake.
Hear hear! Let the real data ingesting begin!
You can imagine what happened once we attempted to ingest data from a real system: the bugs attacked!
We immediately started iterating our Talend jobs: update and test, update and test, until finally…
We loaded our first rows of real data into the data lake! A major milestone!
But we weren’t done yet…the data party was just starting!
With our first real data system ingested and processes normalizing, we were ready to begin our norming phase. Don’t get me wrong, retros still enabled continuous improvements – just not huge shifts like changing from scrum to waterfall. Our fundamental processes stayed consistent throughout the last 3-4 months of our rotation.
Sure, we had new job updates to improve the data load time and items, but our main focus shifted from learning to executing!
Like the long hand of a clock ticking off each second, our onboarding process and communication mechanisms between all of the teams were in constant forward motion.
The functional team tackled which systems were most important, while the data teams began finding ways to use the data. Leadership support was constantly helping us move faster by removing roadblocks – enabling more and more data to be ingested at a faster pace.
We were running full steam ahead!
Day after day, more and more data was loaded via our jobs on various schedules. Some data was loaded daily, some weekly, etc.
At the end of each sprint, we presented a data chart showing the number of total rows ingested into Greenplum. It went from zero to one hundred thousand, from one hundred thousand to one million, from one million to one hundred million, and kept going!
The numbers accelerated as the prioritized systems were marked off the list. Before the rotation’s last sprint demo, we began to analyze our ingested data totals once more.
2 billion rows! We couldn’t believe it!
How does a team of teams project go from zero rows of data to two billion within six months?
I will tell you how – teamwork, prioritized work, innovation, and focus!
Plus, some fun along the way :).
Teamwork vs Individual Contributions
The ITLP team had experiences ranging from project management, business expertise, web development, and Java.
Our team leaders were fantastic at keeping us updated and communicating between all of the different stakeholders on schedules, timeline, and work involved.
Executives came to our stand-ups as they offered help to speed us along.
We interfaced with contractors, data lake team members, and our colleagues in OMLP as we all rallied around the main goal: get data in the data lake and get it in there fast!
There was no doubt hard work correlated with strong results from this adventure, but it was empowered by all of the amazing partnerships and experiences from the teams working together!
Priority Defined Work
The data lake initiative was colossal – even after checking in on the team a year later, it was still on-going. There’s always more data, more transformations, and more value to add.
The size exposes just how much potential work there was, thankfully we had it prioritized.
Because of prioritization, we took on chunks of work each sprint that added the most value to the business.
Innovation and Focus
All teams involved with the simple problem statement were focused on doing just that. Leadership helped remove the extra noise, and we collaboratively innovated to make it work on the technical side.
It’s always amazing how quickly a problem leads to results when you give people time to focus on discovering a solution.
The six months spent in Cincinnati were six months that won’t be forgotten. The lessons and memories of those days continue to shape the way I approach problems today.
There are moments in our career where we are working on large impactful projects, with fantastic coworkers, all having so much fun. Unfortunately, they also come to an end, so we must take a second to recognize those moments and enjoy them to the fullest!
This was one of those moments – we all definitely enjoyed it!