One of the most important things we can do is help our communities - it is key to remember that there is a data science community. Today’s big topic is about ensuring equitability for all.
As a society, it is crucial to ensure that everyone has an equal opportunity to learn and explore; however, it is easy to forget that sometimes data is not easily accessible for all, nor are the technology/methods to analyze this data. Companies are collecting enormous amounts of data and utilizing it to find market trends, drive business decisions, and many other applications.
While data can hold key information for companies, other times, it does not. What would happen if we started making some of this data available to the public and openly discussing how it can best be used and analyzed?
What is an open-source package?
An open-source package is computer software released to the public and grants any user the right to study, change and/or distribute the software and its source code to anyone for any purpose. Having open-source packages has allowed individuals to be empowered to create advanced models and truly drive innovation. In addition, this has allowed the community to give back and add to these packages with:
- New ideas
- Improved functionality
- Innovative methods
Why should we make data and code public?
We have seen the impact of open-source packages and data on the data science community. By publicly sharing our packages and code, we encourage and empower others to explore new possibilities, pick up where we left off, and offer innovations we didn't know were possible.
Moreover, combining this practice with publicly available data sources illustrates how many people work together to create highly accurate and innovative models/methods to perform data analytics. We've seen this in practice when hackathons use AI and machine learning to solve real-world problems, which has yielded truly incredible models providing an untold amount of assistance.
Will open-source packaging hurt my business?
Rest assured; I'm not advocating for companies to give away key insights or trade secrets openly. But by opening up more sets of data, packages, and code to the public, we can expect new and innovative ideas to be created and faster evolution of our data science community by allowing those interested to learn and grow with us. We will not only be helping our community but driving exploratory research for real-world problems that can have a positive impact on the world.
Is posting my code and data enough?
While getting our data and resources out to the public is a commendable first step, it is not enough. What do I mean by that? While I feel posting data and code is a significant first step, it doesn't truly embrace equitability.
For example, we all remember the first time we were doing a data project and had questions like:
- What can I do with the data?
- How do I clean the data?
- What model should I make?
When just beginning, this quickly becomes daunting. Simply posting our code does not explain the inner workings of the code or illustrate that it is not as daunting as it looks. Nor does merely posting code address that it took a tremendous amount of computing resources to complete that is not accessible to everyone.
Empowering the public can lead to data science breakthroughs
While I do not know how to address all of this thoroughly, I believe we can start tackling this monster by empowering individuals to understand and have the ambition to implement this themselves. If everyone wrote small papers explaining their methodology and code, these could be shared publicly. By doing so, we can encourage and empower others to try.
Often, individuals will see things such as machine learning as a scary, daunting task. By having more explanations out there, we can begin to demystify these topics, encouraging high school and college students (and any individual who wants to learn data science – young or old) to try their hand at data science. While this may be a small step for each individual, it can have a significant impact.
Countless times we take for granted ideas and concepts that we have learned from others regarding data manipulation. We can nurture and grow our data science community by paying this forward.
The takeaway - make data publicly available, share code, and explain your thinking
Making sure that data science is equitable for everyone is not easy, and I don't have all the answers to address it. However, one small step we can all make that will have a huge impact on the data science community and the world is to start making more data publicly available, sharing our code/packages, and explaining our thinking.
By making this available to the community, we can start truly harnessing the power of 1 + 1 = 3 or more, in this case, 1,000,000 + 1,000,000 = 3,000,000. By harboring creative thinking and new ideas, we can see innovations and ideas that we couldn't think of that will benefit humankind and our own projects.
I genuinely believe that "one random act of kindness will spark another," so I encourage anyone in the data science field to pick up your keyboard and write a small paper explaining something you find interesting/exciting. This action may encourage the next generation of data scientists to try it themselves, thus giving back to our forever-growing data science community.
Learn about CGI’s methodology for designing and implementing data-driven insights with CGI Data2Diamonds.