Can somebody give me an advice about how to efficiently handle data files, source codes, documents, and manuscripts of research projects based on numerical simulations? My problem is that I use multiple computers while the maily-used one has limited storage space, and that the files and directories tend to scatter over multiple storage devices without being synchronized if I don’t take much effort of manually deciding which ones should be stored where and remembering the decisions. If there are some people under similar situation described below as me, I wonder if they could share their detailed know-hows.
I work on simulation-based research using multiple computing environments (a laptop computer, desktop computers at work and home, as well as computing clusters at work) to write program codes, run the jobs, analyze the output data and make figures, and write manuscripts. The OS is exclusively Linux so far. I don’t have a stable job, and I need to move from one place to the next every now and then.
Most of my human time is spent on my laptop by writing codes and making plots from the output files of simulations. However, this laptop has a limited amount of storage space (about 500 GB). Sometimes one project generates 100 GB or more output data in total before publishing a manuscript, and I work on multiple projects in parallel. I often look at my old source codes, analysis scripts, and examine output data files of finished or frozen projects. Therefore, it would be the most convenient if I could collect all the files related to all the current and past projects on the laptop, but this is not possible due to the limited storage space.
I tried different schemes in the past but have always been frustrated by the difficulty to keep files consistent over multiple storage devices (internal and external hard disk drives on the multiple computers). By consistent, I mean that those files that are intended to be present on multiple drives should be exactly the same and up-to-date with the latest version, while other files should be present only on a spacious drives.
It’s important for me to keep a set of source codes, input data files, raw output files of simulation runs, analysis and plotting scripts, and output files of the analysis for each project in an accessible way, in order to keep myself accountable for my published results as well as to accelerate future research based on the past projects.
In the last few years, I have used git to version control my source codes and keep them synchronized over multiple computers. I also used unison to synchronize data files between my laptop, a USB hard disk drive, and a desktop computer, to some extent. I like both of these tools, and I’m wondering if I can make a more efficient work flow with more extensive use of them. I also use rsync to fetch and push files from one computer to another manually.
I have the following hypothetical directory tree as a collection of all of my projects:
projects/ (about a few TB) - projA/ (about a few 100 GB) + deploy/ + dvlp/ + stable/ + .git/ + source/ + doc/ + tests/ + bin/ + sandbox/ + debug000 + .. + debug999 + jobs_v000_machineA/ + jobs_v000_machineB/ - jobs_v001_machineA/ (about 100 GB) + source/ + doc/ + tests/ + bin/ - job000/ (about 10 GB) - input.dat - input_moo.dat - out_foo000.dat - out_bar000.dat - .. + graph000/ (about 10 MB) + graph001/ + job001/ + .. + job099/ - projB/ .. - projQ/
but it’s hypothetical at the moment because any of the storage drives doesn’t completely mirror this entire tree. On one drive (an older USB Hard disk drive), the actual tree is
projects/ - projA/ - .. - projN/
while on another drive (an internal disk on a computing machine) it is like
projects/ - projL/ - .. - projQ/ - deploy/ - jobs_v001_machineA/ + source/ + doc/ + tests/ + bin/ - job000/ - input.dat - input_moo.dat - out_foo000.dat - .. - out_foo320.dat + job001/
Yet on another drive (an internal disk on the laptop), it is like
projects/ - projL/ - .. - projQ/ - deploy/ - dvlp/ + stable/ + sandbox/ - jobs_v001_machineA/ + source/ + doc/ + tests/ + bin/ - job000/ - input.dat - input_moo.dat - out_foo000.dat - .. - out_foo320.dat + graph000/ + graph001/ - job001/
I can buy an external USB hard disk drive of a few TB capacity to store the entire
projects/ tree now. But, still I want to have an efficient and reliable way to partially synchronize an appropriate subtree of it with each computers.
I saw the following related questions, but they do not address exactly my concerns.
- Archiving papers, simulation and experimental data, etc?
- This case does not involve a large data that doesn’t fit in one of the computer.
- Organizing data and files
- This post does not seem to be concerned about files being scattered over multiple computers.
I found the following tool, but it’s an overkill for me.
- Pegasus, a Workflow Management System for Science Automation
- I will be spending more time to learn how to use the tool than writing codes.
I would appreciate your advices.