Two books help with methods of doing this.
Rapid Development: Taming Wild Software Schedules: https://www.amazon.com/Rapid-Development-Taming-Software-Sch...
Software Estimation: Demystifying the Black Art https://www.amazon.com/Software-Estimation-Demystifying-Deve...
This likely does mean you'll need to be allowed to deploy automated test cases, develop documentation and the entire nine yards, because it's going to be far harder to hit any reasonable deadline if you don't have a known development pipeline.
You'll need to reduce your uncertainty as much as you can to even have a chance - and even then, things will still blindside you and double (or more) the actual vs. the original estimate.
After a number of iterations of this, you converge on a baseline architecture that you try to support future change without being excessively focused on YAGNI. Over-focus on YAGNI often leads to architectures that are so inflexible that when you DO need it, you can't add it without a demolition crew.
People should also remember the context in which YAGNI came up: Kent Beck, Ward Cunningham, Smalltalk, and the C3 project. This particular combination of people, project, and process made YAGNI feasible. And the C3 project, which spawned XP, was not as successful as folklore would have it.
It's not always so. Imagine if YAGNI was the focus of Roy Fielding and the HTTP specification, which has lasted remarkably well because of allowance for future change in the architecture.
Scrum and other processes that claim adherence to the Agile Manifesto work best in certain contexts, where code turnaround can be fast, mistakes do not take the company down and fixes can be redeployed quickly and easily, the requirements are not well-understood and change rapidly, and the architecture is not overly complex.
Many other projects don't fit this model, and people seem to think that so-called "Agile" (which even the creators of the Manifesto complain is not a noun), and mostly Scrum, is the sole exemplar of productivity.
The fact is that there are many hybrid development processes, described by Steve McConnell long before the word "agile" became a synonym for "success", that may be more suitable to projects that have different dynamics.
An example of such a project could be one that does have well-defined requirements (e.g. implementing something based on an international standard) and will suffer from a pure "User Story" approach.
Let's be much more flexible about how we decide to develop, and accept that you need to tailor the approach to the nature of the project, based on risk factors, longevity, and other factors, and not on dogma.
And let's not underplay the extreme importance of a well-thought out architecture in all but the most trivial of projects.
for the engineering and peopleware considerations, which is maybe 1/3 of "right".
Another 1/3 of "right" is the path from conceptual design to database modelling and realizing operations on the database from code. If this part is well planned the code almost writes itself and the customer can be always right because the answer to "can you do this small thing?" is always "yes!"
The other 1/3 of "right" is the content of the computer science curriculum. Some of this is practically math such as combinatorics and algorithm analysis. You would also have to take some classes in areas such as compilers, computer architecture, operating systems, etc. For the average person who wants transferrable skills I say go for compiler construction because small simple compilers are userful and methods used in compiler construction are useful for other kinds of programs. Also compilers interact with the processor and operating system so you can learn some of that by learning compilers.
Another path that gets closer to the metal is do some embedded development, for instance, program a microcontroller to talk to the computer in your car. Operating systems for tiny machines are all the rage these days and easy to learn because they are themselves tiny.
"Point the boat in the right direction and row" is the dominant paradigm in the industry.
The concept of "technical debt" is I think harmful because it is a two word phrase that stops thought. In the real world if you want to take on commercial debt, the bank and/or bondholders and originators are going to want to see a detailed financial analysis that will indicate they will get their money back.
Probably 80% of effort on software is maintenance, but it is rare for any project to start out thinking about the cost of maintenance.
Ok, it's a little more complex than that. In project management people talk about the "triple constraints" of budget, time, and scope.
The relationship between budget and time is not a simple tradeoff. To some extent you can accelerate a software project by increasing the budget, but that extent is limited, maybe you can accelerate the schedule by 30% relative to a "least cost" plan.
Fred Brooks learned these limitations the hard way in the 1960s and he wrote the Mythical Man Month so you don't have to! One problem is that if you add more people to a project, it takes time and attention to onboard them that could otherwise be used to get the project done.
A counter to that is that sometimes buying hardware, software, or services, can greatly accelerate the project. For instance if you are training neural networks on a MacBook, it is probably worth every penny to get a real desktop PC and put a 1080Ti graphics card in it, or to spend some money on cloud computing.
Many managers look at the cost as a function of the deadline, that is, they see the cost of the project as a function of the time you are tied up doing it, so if they can compress the deadline, the cost goes down. (or so they think).
Thus that leads to the "phony deadline" which has no real basis. One problem is that setting out without a realistic plan you are likely to make mistakes which will draw out the project, add to costs, possibly make the project fail.
Most of what I say above is laid out in more detail here:
The flip side of that is the hard deadline, where the job might as well not be done if it is not done on time, for instance, you need to get a grant application in on before a certain day, or you are putting together a demo you are going to show at a trade show on a particular date.
It is important to understand what the actual nature of deadlines that you are up against, what flexibility you have, what impact being late has on the business, etc.
That leaves "scope" as an area with wriggle room. Probably there are some features of the project can be dropped or modified, and management may be able to hit the deadline by dropping features it can afford to drop. This is one big advantage of "agile" methods; if you have something that sorta-kind works at the 25% mark of the project and then you hit the deadline with the most important 80% part of the functionality you are doing better than most people.
Both the ways "data scientists" typically work and the agile methods used in many software development organizations are unsuitable to commercial use of machine learning and other data-rich methods.
In your question I am hearing two themes: (i) how to organize the actual work ("no data", "no features", "no users") and (ii) how to slot the work into the sprint system.
The typical sprint system often introduces risk and uncertainty to data rich projects. Here is an example. I was working on a project where the sprints were typically two weeks, but one part of building the knowledge base was running a batch job that took two days. Of course if you set the batch job up wrong you might have to do it more than once.
When I was doing the batch job I would account for the risk and spend maybe two days getting ready for the batch job and run the batch job at the very beginning of the sprint, then even if things went horribly wrong with the batch job and I had to do it two or three times I was certain the KB would be ready on time. Practically I had a PERT chart in my head that I was using to plan my own work.
Even though I told them what I just told you, the first time some other team members did the batch job, they started it on the last day of the sprint which meant it wasn't ready and the Sprint shipped with an old and inappropriate KB.
As a retrospective it would be good to turn the 2 day batch job into a 2 hour batch job (It started out as a 2 century batch job!) Also the reliability of the batch job is every bit as important as the speed in a situation like that. More fundamentally, I think some thinking about the ordering of work (PERT charts) should have been built into the process.
There are lots of cases there, but note the risk amplifying property of the sprint. If some input to the sprint is a day late, everything that input depends on slips two weeks.
For that project we also did two hour "planning poker" meetings and that was another problem because with two hours we didn't have enough time to make certain decisions. If we'd had two or three people think about things for a day we could have made consistently better decisions about certain things which would mean doing the right work in the next sprint, similarly saving two weeks of calendar time.
It is very easy for little failures of the type described above to cascade and produce a recurring pattern of failure that is awful for productivity, morale, etc.
It is very important to push back on management and address these kinds of problems.
Now this sounds very negative for agile in data-rich projects and that's not the only thing you should take away. In the long run, data rich projects benefit hugely from continuous improvement that is done on a regular cadence.
You meet "data scientists" or "junior programmers" who have started a number of projects and sent deliverables over to other people who get them ready for production. They think they have a great batting average, but when you look it from a wider perspective you see that 4x the man hours they put in the project got spent getting ready to get stuff for production. Had the team "begun with the end in mind", the total cost of the project could be cut in 1/2 or more and the risk greatly reduced.
Big and very capable co's like IBM and Nuance, as well as many smaller ones you have not heard of, have built data-rich systems that turned out to be like building a nuclear reactor. We are not talking something that cost $22,000 when it should have cost $21,000, but rather something that cost $20 billion when it should have cost $5. The people involved will tell you they don't know what they're going to do next but they do know they are never going to do that again.
So your process, technology, everything, has to be designed to control (1) risk and (2) cost to address those things and you've got to communicate that to the people you work with.
What most people don't know/accept/believe is that most teams would control cost best if they tried to control risk first, see:
As for your other issues, this is what I am going to say.
Short term there are two things that really matter: (1) getting data, and (2) developing the basic interfaces between the ML component and the rest of the system. If you have (2) you can really contribute to the sprints, if you don't, you are cannot. Without (1) any data pipeline stuff, featuring engineering, etc. is going to largely be a waste of time.
For data start out with the Enron emails or your own emails and label enough of it that you can start thinking about the other issues. Your early data set will be nowhere near large enough to get useful results, and that's another issue you'll need to bring up with management once you've reached it.
If you want to get deeper into project management I suggest that you become a member of the PMI and possibly get certification from them. The training and testing are rigorous and it is a certification that means something both from the knowledge you get and the benefit of having it on your resume.
Code Complete was (to me) revolutionary at the time. Though it may seem a bit dated now, it was a foundational work that much of what we take for granted was built on.
I don't hear enough talk about Steve McConnell's other book "Rapid Development". http://www.amazon.com/Rapid-Development-Taming-Software-Sche...
I just looked at its Amazon page and it has 5 stars with 115 reviews. That doesn't surprise me at all - it's well deserved.
It came out well before the agile movement, and I would contend that it laid the foundation for it.
I'm so glad I read that at the beginning of my career. As a junior developer, it helped me immensely and even more so over the years.
Get dozens of book recommendations delivered straight to your inbox every Thursday.