1) The naive approach is to assign all writes to a chunk randomly. This makes reads a lot more expensive as now a read for a particular key (e.g. device) will have to touch every chunk.
2) If you know a particular key is hot, you can spread writes for that particular key to random chunks. You need some extra bookeeping to keep track of which keys you are doing this for.
3) Splitting hot chunks into smaller chunks. You will wind up with varying sized chunks, but each chunk will now have a roughly equal write volume.
One more approach I would like to add is rate-limiting. If the reads or writes for a particular key crosses some threshold, you can drop any additional operations. Of course this is only fine if you are ok with having operations to hot keys often fail.
I've read a handful of books on general design/architectural stuff involving large pots of data. Designing Data Intensive Applications is my favorite.
Also 3 different management books. The Manager's Path is my favorite in that camp.
I am a fan of Designing Data-Intensive Applications.
Designing Data Intensive Applications by Martin Kleppmann is a solid overview of the field and gives you plenty more references for further investigation. It starts on singe-host databases and expands out to all kinds of distributed systems. Starting on single host systems is important because it helps you appreciate the designs of the distributed systems that replaced them.
Edit: markdown is hard
On a side note. I am currently reading https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321. Loving it so far. Author clearly explains the difference b/w relational & document model.
This book is a very good for Distributed Systems at a high level.
i've been reading Designing Data-Intensive Applications by Martin Kleppman and i would recommend to all backend developers out there that want to step up their game.
(i also love that it's a language agnostic book)
The reason you can't find data engineering materials online is because real data engineering really only happens at a handful of companies - and those companies maintain this knowledge base internally and do not share it.
I noticed that you listed tools / frameworks to learn, as well as languages. Another piece of advice would be to not focus on those because they come and go (for example, Hadoop is pretty much deprecated in any DE-heavy company). What lasts is an understanding of distributed systems, distributed query engines, storage technologies, and algorithms & data structures. If you have a firm grasp on those, you won't have to start from scratch every time a new framework is introduced. You'll immediately recognize what problems the tech is solving and how they're solving it, and based on your knowledge you can connect the dots and know if that solution is what you need.
Another thing to do is watch CS186 from Berkeley in its entirety. This course is about relational databases, but will give you the foundation you need to speak the DE language.
Source: I work as a data engineer at what some would call a big company :)
I read through this book last year when I saw it recommended on HN. I recommended it to engineers on my team at work.
I’m reading it for a second time now, and just finished chapter 2 today. It’s dense but an amazingly detailed and thorough text.
* The Go Programming Language
* Building Microservices
Plan to do next:
* Designing Data-Intensive Applications
* Designing Distributed Systems
* Unix and Linux System Administration 5th ed, but probably just gonna skip/read chapters of interest, i.e. I wanna get a better understanding of SystemD.
Read last month:
* Learning React
Good for a quick intro but I probably wouldn't read cover-to-cover again, some sections are old, but overall an OK book.
* React Design Patterns and Best Practices
Really liked this one, picked a tonne of new ideas and approaches that are hard to find otherwise for a newbie in JS scene. These two books, some time spent reading up on webpack and lots of github/practice code made me not scared of JS anymore and not feeling the fatigue. I mean, I was one of the people who dismissed everything frontend related, big node_modules, electron, complicated build systems etc. But now I sort of understand why and am on the different side of the fence.
* Flexbox in CSS
Wanted to understand what's the new flexbox layout is about since it's been a while when I've done some serious CSS work. Long story short I made it about half of this and dropped it - not any more useful than MDN docs and actually playing with someone's codepen gave me better understanding in 5 minutes than 3 hours spent with this book.
An overview of databases (what and why, but also a lot of how) plus distributed concepts and modern architectures.
Here is a quick excerpt, this book is filled to the brim with these gems.
> The final twist of the Twitter anecdote: now that approach 2 is robustly implemented,Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.
Approach 1 is a global collection of tweets, the tweets are discovered and merged in that order.
Approach 2 involves posting a tweet from each user into each follower's timeline, with a cache similar to how a mailbox would work.
The Architecture of Open Source Applications series is a good one for leaning how to build production applications and you can read it online. The chapter on Scalable Web Architecture is a must-read.
 https://www.amazon.com/Designing-Data-Intensive-Applications... https://news.ycombinator.com/item?id=15428526
Clean Code: A Handbook of Agile Software Craftsmanship  is a great book on writing and reading code.
Similarly, Clean Architecture: A Craftsman's Guide to Software Structure and Design  is, no surprise, a book on organizing and architecting software.
Designing Data-Intensive Applications  may be overkill for your situation, but it's a good read to get an idea about how large scale applications function.
The Architecture of Open Source Applications  is a fantastic free resource that walks through how many applications are built. As another comment mentioned, reading code and understanding how other programs are built are great ways to build your "how to do things" repertoire.
Finally, I'd also recommend taking some classes. I started as a self-taught developer, but I've since taken classes both in-person and online that have been a tremendous help. There are many available for free online, and if in-person classes work better for you (motivation, support, resources, etc), definitely go that route. They're a fantastic way to grow.