This is a common question that arises when learning Rx.
Your suggestion to put your exception handling logic in the subscriber is preferable over creating a generic observable wrapper.
Remember, that Rx is about pushing events to subscribers.
From the observable interface, it's clear there's not really anything an observable can know about it's subscribers other than how long they took to handle an event, or the information contained in any thrown exceptions.
A generic wrapper to handle subscriber exceptions and carry on sending events to that subscriber is a bad idea.
Why? Well the observable should only really know that the subscriber is now in an unknown failure state. To carry on sending events in this situation is unwise - perhaps, for example, the subscriber is in a condition where every event from this point forward is going to throw an exception and take a while to do it.
Once a subscriber has thrown an exception, there are only two viable courses of action for the observable:
Specific handling of subscriber exceptions would be a poor design choice; it would create inappropriate behavioural coupling between subscriber and observable. So if you want to be resilient to bad subscribers the two choices above are really the sensible limit of responsibility of the observable itself.
If you want your subscriber to be resilient and carry on, then you should absolutely wrap it in exception handling logic designed to handle the specific exceptions you know how to recover from (and perhaps to handle transient exceptions, logging, retry logic, circuit breaking etc.).
Only the subscriber itself will have the context to understand whether it is fit to receive further events in the face of failure.
If your situation warrants developing reusable error handling logic, put yourself in the mindset of wrapping the observer's event handlers rather than the observable - and do take care not to blindly carry on transmitting events in the face of failure. Release It! whilst not written about Rx, is an entertaining software engineering classic has plenty to say on this last point. If you've not read it, I highly advise it.
What's between your problematic app and the database ? Since it's in a DMZ I suspect you've got a router. If the application keeps a connection open to the database, but that connection is quiet for a period (say, overnight?), then in the absence of traffic, the router may close that connection. I've seen behaviour like this before.
I vaguely remember such a scenario detailed in Release It.
Are you using database pooling, with checks on the connections as they're handed out from the pool ? If the above is the problem, you may want to look at Apache DBCP
Okay, here's the basic answer: you can crack this in .NET, PHP, Java or Python. Within that, you'll find any number of frameworks. Most of these frameworks are production ready.
So, what you and your team know or like becomes a huge issue. Learning a radically different framework will take time. Often personal preference will prove to be the deciding factor.
However, there is one major factor that should be taken into account: cost. Whilst .NET probably has the best tooling of any environment, it's pretty tied to Windows. If you're planning on 100 machines in a server farm, the year on year cost of running your final solution needs to be computed.
Finally, read Release It. The question of how you monitor and instrument your code is probably even more important than any feature you've mentioned.
If the monolith is truly modularized, then a failed module should not bring the whole system down. This means that every module should have its own database, for example.
Another requirement for good modularization is that a module should not make a synchronous call to another module. If one module needs data from other modules, they should do this in the background, outside a user's request.
The aggregation and orchestration of multi-module requests should be done in the Application layer. For example, if a query needs data from modules A and B, the Application sends a sub-query to A, then a sub-query to B and then combines the result and return response to client. During this request, A may not query B or vice versa. In case of partial failure, the Application may return a partial response or an error.
Also, you should have a monitoring solution for each module. This is required especially because modules have background tasks that may fail and you need to know when and how they fail.
I recommend the book Release it for this matter.
P.S. you don't need to go microservices just for this, a good designed monolith is better.
In high-level stuff, exceptions; in low-level stuff, error codes.
The default behaviour of an exception is to unwind the stack and stop the program, if I'm writing a script an and I go for a key that's not in a dictionary it's probably an error, and I want the program to halt and let me know all about that.
If, however, I'm writing a piece of code which I must know the behaviour of in every possible situation, then I want error codes. Otherwise I have to know every exception that can be thrown by every line in my function to know what it will do (Read The Exception That Grounded an Airline to get an idea of how tricky this is). It's tedious and hard to write code that reacts appropriately to every situation (including the unhappy ones), but that's because writing error-free code is tedious and hard, not because you're passing error codes.
Both Raymond Chen and Joel have made some eloquent arguments against using exceptions for everything.
I like a couple of Pragmatic Programmers' books on this subject, Ship it! and Release it!. Together, they teach a lot of real-world, pragmatic stuff about such things as build systems and how to design well-deployable programs.
My "step-by-step" process would be this:
Book tip: Release It