This is an introduction to Conda which un-complicates the life of Data Scientists and Developers alike and whose development has been led by the Analytics Inc., whose founders are the originators of Numpy! A library which is the backbone of entire python Data Science ecosystem.
Let’s go over the problems that it is designed to solve.
One of the biggest issues that people face in Linux is the path variables required for various tools and various versions of the softwares, this is also called as environment management. People need to be comfortable with the culture of “living in the shell” to navigate and manage the machine properly in the terminal – but this takes time.
Since there are so many programming languages, this is the result of constant innovation and the needs that arise during the process of writing programs. The thing that separates software engineering from programming is the emphasis on being able to precisely repeat and explain what your program does, but most importantly the element of design must not be overlooked.
This is all the more important for any scientific analysis that needs to precisely reproducible.
For example, you are using many different programs and languages in your own analysis, it requires a conscience effort on your part to note and make sure that their “environment variables” are set properly – which just means that the terminal can see the program that you wish to run.
Now why would we need different versions of the same program?
I noticed that Bio Linux is a really nice wrapper on Ubuntu for various Biological softwares already pre-set and pre-installed but it also comes with a bit of a downside. Since you’re using something that you don’t completely understand you’ll not be able to change the version of “prokka” or “trimmomatic” or “java” if in future it is discovered there is a bug in that specific version and you need to do the entire analysis on a different version you’ll face a lot of difficulty because since the entire Bio Linux is built by someone else, if anything goes wrong with the pre-set variables then the entire system becomes something really fragile and we often find ourselves stuck.
Something we don’t understand, though use every day, is much like a double-edged sword.
Of course, with time you can learn more about Linux and tackle such problems easily but the beauty of science and open source is that, any problem that you are facing might already be solved by someone else. Thus, instead of “going down the rabbit hole”, you can focus on your own domain rather than taking on a domain that you never even wanted to solve in the first place and thus avoiding yak-shaving.
The way “Conda” solves this problem is that instead of installing softwares globally ( like BioLinux or sudo apt-get et cetera), it installs softwares in different “Conda environments” which do not interfere with each other, I’ve also attached an environment file which clearly specifies the softwares and versions of CondaGenetics environment I’ve been working on for a particular project. When you feel that you don’t need particular software then simply remove it using Conda and it’ll do all the logical analysis to make sure that everything will work even after removal of that package.
Every single language eventually evolves with a culture around itself, its community standards and best practices. This culture helps write and distribute the programs that people write and more importantly read in these languages. This culture dictates where the important components of the language/software are located. But the problem of always changing software is a pain for everyone, ruby solves this by using gem package manager, python solves this using pip package manager and java leaves this to the sophisticated IDE ( like Eclipse / IntelliJ ) and similarly for R / Julia / Clojure etc.
The strength of Conda package manager is that it can also be used to manage any language and thus you can be free to use any tool available in any language for your analysis without the headaches involved in setting up different languages properly. You can quickly move from a fresh install (which is recommended every 3 months on personal Linux systems) to your perfect analysis environment relying on Conda to do the heavy lifting.
This way, in future, if you wish to deploy it on a server for production or perhaps you need to revisit your research or create the perfect setting for continuing your analysis on a different machine hardware or some other operating system you can simply ask Conda to create the environment using this single file and then continue your work, that’s it!
You don’t need to make sure if you’re system python version is 2 or 3 or ruby version is 2.3.3 compiled with xyz options etc. or some such. Simply let Conda manage all these requirements you can focus on your analysis and your results.
Be Curious and Happy Hacking!
P.S. You can read more about how it could be used in practice for modern day scenario of Polyglot Data Science in this article [link]
This post was authored by Abhinav Sharma. If you want to sponsor or contribute an article please reach us at email@example.com
Machine Learning Engineer, Product Strategist
Fourtek IT Solutions