I am a 2nd year PhD student in physics. Tenure-track positions are highly competitive and I do not love research enough to pursue it as a life career. Since I like programming and playing with data, I want to have a job as a data scientist after finishing my degree. I read some success stories of people who got degrees in Physics but works as data scientists but the people are from top universities like UC Berkely, Stanford, etc…
So my question is how doable it is for someone who only gets Physics degree from the low-rank university to find a job as a data scientist. What is the plan for the next years when I am still in my PhD program? What should I learn? How should I have real projects and internships to work on? Will working unpaid in a research lab about data analysis in my current university help?

We have submitted our recent findings to PRL. As the second round of revision came through, we are almost confident that we will get published in PRL.

Now we wanted to provide our algorithm as a matlab GUI so that the readers can reproduce our results in a single click. The idea is to motivate readers from other fields to have a quick feel of what we do.

Most journals like nature have a Data availability where they allow to provide links to available data. Is there a way in PRL to do such thing. We would like to add a link to our webpage which has the software created by us.

For example,

Data availability: Interested readers can reproduce our results by using our algorithm. The software can be found at [URL].

Is such a statement allowed in PRL? If so where does it go, before acknowledgments or after that or in the supplemental?

I’m very interested in machine learning and its application to the healthcare field. I am also interested in the Robotics program at Hopkins because of its wide array of research in the medical field. They research many things like medical imaging, uses machine learning for surgical robots, analyzing EMR data, etc. Does anyone know of other universities that have similar medical focus? The ranking is not important. I am just looking for a little bit of a curriculum overlap. Thank you!

I am an undergraduate student studying Geographic Information Systems. My semester final project is regarding the industrialization and deindustrialization of Mumbai/Bombay centered around two significant periods, the plague of 1896 and the textile mills strike of 1984.

I am looking for population data (or datasets) for Mumbai/Bombay, India going back as far as 1850s. So far, the earliest readily available population data I can find is 1991. I’ve datamined a bit and found population data within early 20th century publications, but all provide different numbers. I understand that population estimations for Mumbai/Bombay will vary greatly for this time period depending on the source; however, it would help tremendously if I could find a trusted source that stretches back to at least mid-19th century.

Would appreciate if someone can create the following tags (population, datasets, Mumbai, and Bombay)

We conducted a couple of simulations and the design parameters for optimum performance of the system seem to be following a pattern. Based on this observation, we established an empirical formulae for a couple of design parameters required for the system. When we simulated the system using these parameters at different conditions, the system performance aligned with most desirable results every-time the empirical formulation was used. How do we substantiate the claim that this empirical formulae best describes the system ? How much data is required ? Is it best to refer to past papers in the field of study or are there new methods to approach this?

I have a discrete data set showing number of customers entering a shop per day, with the following descriptives:

Std: 193

As I want to run a monte-carlo simulation with number of customers per day as input, I want to find the best-fitting model to describe the data.

My problem is that the data fits a Weibull-distribution very well (p=0,9), but do not fit any discrete models at all. Is it possible to use the Weibull distribution to then generate data (using bins, rounding to nearest integer), or would this be regarded bad scientific practice?

The journal Immunity says on this webpage

“As a matter of publishing ethics, we cannot consider any paper that contains data that have been published or submitted for publication elsewhere.”

Why not? What if the two papers analyze rich sources of data that overlap only partially? What if they have completely different ambitions? Is the only meritorious activity in science the act of gathering data, and to hell with the analysis? And what does this have to do with ethics?

I am going to base my data analysis on secondary data. These data is drawn from a manufacturing process. Furtunately I do exactly know the circumstances under which the original data has been collected. To check if the (secondary data) is reliable I am planning to conduct the following steps:

  1. Draw a sample from the process (under the same circumstances as the original data)
  2. I know the expected outcome of the process –> good product / bad product
  3. I compare the drawn sample to the expected output
    –> If this returns valid results, I can assume that also the data on which my analysis will be based on, are valid.

Is this approch appropriate or are there other/ better ways? I really don’t know how I should check the qulity of the secondary data differently…

Recently I started my PhD on a rather fixed subject. On my application/interview I proposed something specific and I think that is the reason why they have chosen me. Problem: When they gave me the initial data I found out that their accuracy is very low and thus my initial plan is not going to work. I now that these kinds of difficulties is part of any research, so I have two questions:

  1. Is it normal to face something like that at the beginning of my research?
  2. Now that I have to change my research questions, what should I look for?

Apologies if my question is very basic or broad.

I’ve just “published” some Datasets as the Mendeley Data repositories. After pressing the “Publish” button I was surprised with message that my dataset(s) are “in the moderation” process that I wasn’t well aware of in advance (my bad).

Now, I’m puzzled how long it takes usually for Mendeley Data staff to approve a dataset/repository for being published on-line? Does someone has any experience with the service?

I planned to submit an article (linking to these dataset) today, but it seems I’ll have to wait for the dataset moderation first.