Auto-sync: 2026-04-22 04:02
This commit is contained in:
291
openclaw/intent-ux-jakobnielsenphd-20260421.md
Normal file
291
openclaw/intent-ux-jakobnielsenphd-20260421.md
Normal file
@@ -0,0 +1,291 @@
|
||||
> **Summary**: AI is not just a better chat box. It changes the user’s role from operator to supervisor, which forces UX to move from command-based interaction toward intent-based delegation, new usability metrics, orchestration layers, calibrated friction, and ultimately exploration-based interaction to clarify the user’s needs.
|
||||
|
||||
The most important thing about AI as an interface is not that it chats in natural language. It is that it changes the user’s *role*. AI changes computing from command-based interaction to intent-based outcome specification: the user states the result to be achieved, and the system determines the procedure.
|
||||
|
||||
In **batch** systems, the user submitted the whole workflow at once. In **command** -based systems, the user and computer alternated turns. In **intent** -based systems, the AI will infer and execute the workflow itself: You no longer tell the computer *how*. You tell it *what* you want accomplished, and it figures out the rest.
|
||||
|
||||

|
||||
|
||||
*In command-based interaction, you strike every blow (click every icon) to gradually produce what you want, inspecting and correcting the intermediate work product at every step. (NotebookLM)*
|
||||
|
||||

|
||||
|
||||
*Intent-based outcome specification is similar to how a Viking jarl (chief) would order, “get me silver from an English monastery,” setting in motion a chain of events that starts with the weaponsmith making the shields and ending with the raid. He doesn’t have to specify these steps because the Vikings already know what to do. Using AI is the same. (NotebookLM)*
|
||||
|
||||
An **intent** is not merely a wish expressed in natural language. A usable intent has at least three parts: the desired outcome, the constraints that bound acceptable behavior, and the delegation boundary that defines what the system is allowed to do. “Plan my Chicago trip” is underspecified unless the AI also knows the budget, the immovable meetings, and whether it may purchase tickets or only prepare options. Much of AI UX will therefore consist of helping users express not only what they want, but what the system is allowed to assume, optimize, and execute.
|
||||
|
||||
Intent-driven interaction shifts the locus of control rather than being a cosmetic change in input modality. While the [GUI was a massive leap](https://www.uxtigers.com/post/gui-history), the shift from typing commands to clicking them was much smaller than the AI-driven change in interaction design. As I pointed out when I [identified intent-based outcome specification](https://www.uxtigers.com/post/ai-new-ui-paradigm) as the AI interaction modality at the dawn of modern AI in May 2023, this is an **entirely new UI paradigm**, and the first major shift in 60 years since we changed from batch processing to commands.
|
||||
|
||||
With a paradigm change in the UI, it stands to reason that we also need a paradigm shift in design and usability. What users do is being flipped, and UX must change with our users. AI changes the *interaction grammar* more than it changes any one screen: intent-based interaction is not just a new input method. It changes **where decisions happen**, **who bears the cognitive load**, and **what “error” means**.
|
||||
|
||||
In command-based interfaces (including GUIs), the human forms a plan internally and then executes it through controls. We’ve had the design goal to make the computer “transparent” precisely because it stays inside the user’s plan. This is one reason direct manipulation felt so powerful: operating on visible objects with immediate feedback let users focus on tasks rather than on the system.
|
||||
|
||||
In intent-based interfaces, the user externalizes part of the plan: they are no longer navigating, but delegating. The system must now interpret the goal, choose subgoals, schedule actions, acquire permissions, and handle exceptions. That pushes the system into a classic automation role, which human factors research has studied for decades: once automation takes over planning and action selection, the user shifts from operator to supervisor. Supervisory control has different failure modes than direct manipulation, and it demands different design safeguards.
|
||||
|
||||

|
||||
|
||||
*Users are changing from doing the work (operating the UI) to supervising the work. (NotebookLM)*
|
||||
|
||||
The winning system of the next decade will not be the one with the most aesthetically pleasing buttons, nor will it be the one with the fewest screens. It will be the system that best understands the human’s “job to be done,” autonomously selects the right tools on their behalf, clearly shows the user what is about to happen, and gracefully recovers when the user’s context is incomplete or ambiguous.
|
||||
|
||||
## The Three Eras of UX Goals
|
||||
|
||||
UX design has never had one fixed goal. The goal has shifted twice already, and it’s shifting again.
|
||||
|
||||

|
||||
|
||||
*The three goals of UX design: productivity, influence, and augmentation. (NotebookLM)*
|
||||
|
||||
**Era 1, Business Computing (1960–1995).** The dominant applications were accounting software, word processors, payroll systems. The UX goal was **productivity**: help people learn the software faster, make fewer errors, get more done per hour. I used to tell clients that their training budget was a pork chop ready to be eaten by usability: a well-designed system could cut onboarding time in half.
|
||||
|
||||
**Era 2, The Internet (1995–2025).** The web shifted the UX goal to **influence**: get users to buy, subscribe, share, or scroll long enough to see another ad. This era leaned heavily on [Robert Cialdini’s influence principles](https://www.uxtigers.com/post/explore-discover), such as reciprocity, social proof, scarcity. It also gave us [dark patterns](https://www.uxtigers.com/post/dark-design) and infinite scroll. If you don’t pay for the product, you *are* the product.
|
||||
|
||||
**Era 3, AI (2026 onward).** The goal shifts again, to something harder to name: **augmenting human existence**. When AI handles execution of routine tasks, human energy is freed for imagination, judgment, and meaning-making. Doug Engelbart’s original vision was to “augment the human intellect.” That framing is too narrow now. The goal of UX in the AI era is to expand what humans can do and be, not only what we can accomplish in software, but what we can decide, imagine, and coordinate. Usability, therefore, shifts from removing friction in predetermined paths to expanding the range of viable paths, opening up possibilities we haven’t yet imagined.
|
||||
|
||||

|
||||
|
||||
*AI can help us reach new heights and explore fabulous new vistas. Our design goal is no longer simply productivity or selling; it’s augmenting human existence. (NotebookLM)*
|
||||
|
||||
When I present this 3-stage process of changing UX goals, I often get pushback from naïve designers who resent the implication that the main goal of their existence has been to manipulate customers. However, while becoming master manipulators might not have been the reason they embarked on a design career as idealistic youngsters, it was what they needed to do to thrive in the Internet business environment. The reason companies pay for design is to get customers to buy more and users to look at more advertisements.
|
||||
|
||||
In fact, one of the reasons I’m a big AI fan is that I never liked the business goals of Internet design. Of course, we’ll still need to persuade customers to buy. That will never change. But persuasion changes from manipulating humans by exploiting our many cognitive biases and weaknesses to providing clean information to AI agents that will do the buying.
|
||||
|
||||
## The Short-Term Crisis: The Articulation Barrier
|
||||
|
||||
Current chat-based AI interfaces suffer from severe usability problems. The intent-based paradigm demands that users write out their problems as prose text. However, as repeatedly demonstrated by literacy research, about half the population in rich countries like the United States and Germany is classified as low-literacy users, with results being even worse in poor countries.
|
||||
|
||||
Writing new descriptive prose is cognitively more challenging than reading existing text. This creates an immense [articulation barrier](https://www.uxtigers.com/post/ai-articulation-barrier). It gives a massive advantage to the small fraction of the population with extraordinarily strong literacy skills. The very existence of “prompt engineering” advice is empirical evidence of this deep-rooted usability failure. If users are forced to learn arcane methods to tickle an AI into coughing up the right result, the interface fails human-centered design standards.
|
||||
|
||||

|
||||
|
||||
*The articulation barrier is the problem of making your intent clear. It’s often hard to put something into words, especially if the goal is inherently nonverbal, like the shape of something, or if the user has low literacy skills. (NotebookLM)*
|
||||
|
||||
In the short term, UX professionals must design to overcome this articulation barrier. We cannot rely on users generating perfect text from a blank canvas. [Prompt augmentation](https://www.uxtigers.com/post/prompt-augmentation) and [aided prompt understanding](https://www.uxtigers.com/post/prompt-understanding) are two sets of design patterns to help users refine their intent for AI.
|
||||
|
||||

|
||||
|
||||
*Style galleries are one of the design patterns for prompt augmentation. It’s easier to select something you like from a range of styles than it is to describe the style in words. (NotebookLM)*
|
||||
|
||||
The articulation barrier is also a memory problem. If users must restate their preferences, recurring constraints, tone of voice, risk tolerance, and exceptions in every session, the interface remains unusable no matter how fluent the model sounds. A mature intent-based system, therefore, needs a visible, editable user model: a place where people can inspect what the AI believes about them, correct it, override it temporarily, or tell it to forget. In the AI era, memory becomes a first-class UX surface.
|
||||
|
||||
In the long run, we need a new approach to designing intent-based interactions.
|
||||
|
||||
## Redefining Usability Metrics
|
||||
|
||||
Because the locus of control has reversed, the [core usability metrics](https://www.uxtigers.com/post/what-is-ux) we have used for decades to evaluate UX must be completely rewritten. In the command-based paradigm, usability was measured by how efficiently a user could learn and execute the steps to accomplish a task. My [ten classic heuristics](https://www.uxtigers.com/post/10-heuristics-reimagined) assumed a human navigating a structured interface one step at a time.
|
||||
|
||||
In an intent-based ecosystem, the [system acts probabilistically](https://www.uxtigers.com/post/ai-uncertainty-ux) rather than deterministically. Usability is no longer judged by the elegance of the steps on screen, but by the quality of the machine’s understanding and the safety of its execution.
|
||||
|
||||
My classic usability heuristics will still hold, but must be reinterpreted. “Visibility of system status” used to mean: show progress through a sequence of steps the user chose. In an agentic workflow, it becomes: show *what the system believes the user intends*, what it is doing to satisfy that intention, and what it plans to do next, even when none of those steps were explicitly requested. “User control and freedom” used to mean: allow undo, cancel, and escape from a dialog or flow. In an intent-based environment, it becomes: allow interruption of an executing plan, allow correction of misunderstood intent, and allow safe rollback across multiple systems. Undo is harder when the system has already sent an email, booked a ticket, or modified a shared document. The old principle becomes more important, but also more expensive to implement.
|
||||
|
||||
The evaluation of a successful interface shifts:
|
||||
|
||||
- **From Discoverability to Intent Capture:** Can the system accurately map a vague natural-language request to a highly structured machine action? Did it infer the goal, constraints, and priorities correctly?
|
||||
- **From Error Prevention to Clarification Quality:** Because we cannot [disable invalid buttons](https://www.uxtigers.com/post/inactive-buttons) to prevent hallucination, the metric shifts to how gracefully the system handles ambiguity. Does the system ask the right follow-up questions at the right time? The best clarifying question is the smallest intervention that prevents the largest mistake.
|
||||
- **From “Time to Learn” to “Ease of Delegation”:** Traditional UI learnability becomes less relevant when there are no menu hierarchies to understand and navigate. The primary metric becomes how comfortably a user can delegate a multi-step objective without fearing catastrophic failure. Time-to-correct becomes far more important.
|
||||
- **From Execution Efficiency to Verification Efficiency (Evaluability):** In command-based UIs, the user’s primary cognitive load was executing the task step-by-step. In intent-based systems, *execution* is cheap, but *evaluation* becomes the bottleneck. The usability metric shifts to how rapidly and accurately a user can verify that the AI’s output matches their actual goal. Interfaces must be optimized for “evaluability,” allowing users to judge quality and appropriateness (whether the AI’s work is fit for its external purpose) without painstakingly combing through every detail of the result.
|
||||
|
||||

|
||||
|
||||
*Changing the usability goal from making it easy to make something to making it easy to evaluate the quality and suitability of what was made. (NotebookLM)*
|
||||
|
||||
- **From Visibility of System Status to Execution Transparency:** The system must project an accurate mental model of its operational plan *before* and *during* execution. It must show what it believes the user intends and what it plans to do next.
|
||||
- **From User Satisfaction to Trust Calibration:** Do users rely on the agent appropriately, neither over-trusting nor under-using it? Trust is no longer a soft emotional byproduct; it is the primary functional metric of an intent-based system. Trust calibration also depends on showing why the system preferred one plan over another. A good orchestration UI should be able to say, in effect, “I chose Plan A over Plan B because cost mattered more than speed,” or “This recommendation would change if your deadline moved by two days.” Counterfactual explanation is often more useful than a generic confidence score because it teaches users the model’s decision logic and shows where intervention would matter.
|
||||
|
||||

|
||||
|
||||
*How much do you trust your AI agent? Do you want to give it your entire sack of silver, or just a coin or two? (NotebookLM)*
|
||||
|
||||
These changes imply a different UX measurement toolkit. *Time-on-task* is less important when the human contribution is “say what you want” (and the AI then spends hours performing the task), but *time-to-correct* becomes a central metric. Traditional *error counts* must be split into user slips versus system misinterpretations. *Satisfaction* becomes increasingly bound to perceived agency: users can be pleased with outcomes but still feel uneasy if they cannot tell what happened or why.
|
||||
|
||||
## The Triple-Layered Design Model
|
||||
|
||||
At first glance, [“UI is dead,”](https://www.uxtigers.com/post/ux-roundup-20250825) since users will interact with AI agents more than they’ll be clicking around apps or websites.
|
||||
|
||||
However, the GUI will not disappear; it will be demoted. The screen stops being the place where work *begins*, and instead becomes the place where work is inspected, negotiated, and corrected. As software shifts from isolated apps toward task orchestration, mature intent-based systems will settle into a triple-layered design model.
|
||||
|
||||

|
||||
|
||||
*The three layers of AI user experience architecture: intent, orchestration, and direct manipulation. (NotebookLM)*
|
||||
|
||||
**1\. The Intent Surface:** This is the first layer, where the user states an outcome. It must be highly context-aware, accepting multimodal inputs like voice, text, screen context, or camera data to overcome the articulation barrier. As this layer matures, it will increasingly rely on **implicit intent inference**. By synthesizing ambient context (e.g, calendar events, active screen content, cursor hesitations, and historical routines), the system can proactively offer high-probability intents for the user to simply confirm, overcoming the articulation barrier by drafting the prompt for them.
|
||||
|
||||
**2\. The Orchestration Surface:** This is the critical negotiation layer. Before an agent executes high-stakes actions, it must reveal its proposed plan, expose the provenance of its data, and seek consent. This UI functions as an audit layer. It visualizes steps, provides execution transparency, and manages “permission choreography.” Preview is not enough. Intent-based systems also need explicit post-action receipts. After an agent completes a task, the UI should summarize what it changed, which systems it touched, what assumptions it used, and what can still be undone. In traditional GUIs, the user often knew what happened because they executed each step themselves. In agentic systems, that implicit knowledge disappears. The system must manufacture legibility after the fact.
|
||||
|
||||
Most important work is not solitary. In organizations, the agent acts inside shared systems, shared budgets, and shared responsibilities. The orchestration layer must therefore show not only what it plans to do for *me*, but also *who else* will be affected, which policies constrain the action, and who inherits the consequences. Intent in enterprise UX is never just personal preference; it is personal preference filtered through institutional rules. The Orchestration surface must therefore resolve **collaborative intent** by flagging conflicting directives from multiple human stakeholders or specialized AI sub-agents, and negotiating consensus before execution. Recognizing the need to support and coordinate multiple users, rather than just a single user, becomes more important in AI systems than in traditional GUI design.
|
||||
|
||||
**3\. The Direct-Manipulation Surface:** The traditional GUI remains intact as a fallback layer. This is the familiar world of tapping, dragging, and scrubbing, reserved for edge-case editing, granular corrections, and emergency overrides. In a mature intent UI, the screen becomes where work is **inspected**, **negotiated**, and **corrected**, because the work itself is done off-screen by AI.
|
||||
|
||||
Thus, [direct manipulation](https://www.uxtigers.com/post/direct-manipulation) does not die; it migrates one level higher in the abstraction stack. Instead of manipulating raw controls, users will manipulate *plans*. They will drag a task from “later” to “now,” scrub through a proposed sequence on a timeline, tap a source chip to check provenance, or reorder a travel itinerary. That is still direct manipulation, retaining the biological satisfaction of shaping causality, just applied at a higher level of abstraction.
|
||||
|
||||
## Supervisory Control and Intentional Cognitive Friction
|
||||
|
||||
Because of the phenomenological gap introduced by intent-based interfaces, in which actions occur offscreen without direct bodily involvement, the user’s role shifts profoundly. The correct analogy is no longer *driving* a car; it is *managing* a chauffeur.
|
||||
|
||||
This supervisory control requires a completely different set of design principles. The instinct of every UX designer trained in the command-based era is to ruthlessly eradicate friction. For routine, low-stakes tasks (sorting spam, scheduling a recurring meeting), the frictionless ideal remains correct. But for high-stakes tasks (e.g., financial transactions, medical decisions, sending sensitive emails), the interface must intentionally slow the user down.
|
||||
|
||||
Autonomy should be earned rather than granted all at once. An effective agent should begin in a conservative mode that drafts, prepares, and asks for confirmation, while accumulating a performance history inside a bounded domain. As reliability becomes evident, the interface can let the user widen the agent’s action budget: first draft, then prepare, then execute low-risk actions, and only later touch high-stakes or externally visible systems. The right model is not binary autonomy versus manual control. It is progressive delegation.
|
||||
|
||||
We must choreograph *intentional cognitive friction*. Generative AI often delivers synthesized answers that feel flawlessly authoritative, leading to the Plausibility Trap. Because the interface is clean and instant, authority bias takes over, tempting the user to skip critical analysis.
|
||||
|
||||
To combat this dangerous automation bias, we must force a moment of reflection. When an AI proposes moving $500, we should not offer a frictionless “Approve All” button. We must use granular authorization, artificial time delays (like a three-second countdown), and provenance highlighting to ensure the human remains cognitively responsible for the outcome.
|
||||
|
||||

|
||||
|
||||
*At appropriate points in the workflow, make the user pause to ensure everything is right. (NotebookLM)*
|
||||
|
||||
Friction shouldn’t just be a blanket delay; it should be applied surgically. The UX must visually communicate the AI’s confidence levels so the user knows exactly where to apply their cognitive effort. We need Epistemic UIs: interfaces that visually map the system’s uncertainty. Instead of presenting synthesized answers as monolithic, authoritative truths, the UI should highlight probabilistic leaps, flag data with weak provenance, and color-code confidence levels. By visualizing the AI’s own doubt, the interface directs human cognitive energy precisely to the areas requiring judgment, transforming friction from a blunt delay into a precision tool.
|
||||
|
||||

|
||||
|
||||
*Epistemic UI: when we don’t know what lies ahead (for example, what creature made this footprint), we should be explicit about our level of uncertainty to improve decision quality. (NotebookLM)*
|
||||
|
||||
Naturally, the threshold for this friction must be deeply context-aware. A $500 transfer requires high friction in a personal banking app, but is a frictionless, automated rounding error for a corporate finance AI. Just as human organizations use escalating approval ladders for larger expenditures, AI UX must dynamically scale cognitive friction based on the user’s role, the organization’s risk tolerance, and the reversibility of the action. We will simply tweak traditional management heuristics to account for the unique vulnerabilities of machine intelligence.
|
||||
|
||||
User experience for AI agents will be similar to traditional management techniques in many cases. Similar, not identical, of course: many existing management methods are intended to deal with managing human underlings who suffer from human weaknesses. When managing AI agents, we’ll tweak our old management lessons to account for AI’s weaknesses.
|
||||
|
||||
## Slow AI: The Return of Zombie UX
|
||||
|
||||
As we entrust AI with increasingly complex workflows, we face a bizarre blast from the past: the Zombie UX of batch processing is being revived. While simple chat queries take seconds, powerful AI tools like Deep Research or video-generation models can take 10 minutes to hours to complete a run. We are rapidly approaching a reality where AI agents will run independently for 30 hours or even days to orchestrate massive tasks.
|
||||
|
||||
When turn-taking interaction is destroyed by extreme delays, we must design for “ [Slow AI](https://www.uxtigers.com/post/slow-ai).” Waiting hours for results creates intense anxiety regarding whether the AI is heading in the right direction.
|
||||
|
||||

|
||||
|
||||
*Sometimes AI takes forever to deliver results. We need to design for this reality, because it will only get worse with increasing AI capabilities and task horizons. (NotebookLM)*
|
||||
|
||||
To maintain user control, Slow AI requires distinct UX interventions:
|
||||
|
||||
**1\. Clarification and Run Contracts:** A slow AI should never guess a user’s intent. It must ask clarifying questions upfront. It should then present an explicit run contract showing the estimated time window, a cost cap, the definition of “done,” and hard boundaries (e.g., “will not email external parties”). We will need new usability research to replace our old response time guidelines
|
||||
|
||||
**2\. Conceptual Breadcrumbs:** Traditional percentage bars are useless for 10-hour tasks. Instead of just showing technical logs, the AI must provide “Conceptual Breadcrumbs” as short, synthesized summaries of intermediate conclusions. If the AI reports a flawed conclusion early on, the user can intervene immediately.
|
||||
|
||||
**3\. Context Reboarding:** When a task takes 30 hours, users will context switch and forget what they originally asked for. The UI must gracefully reboard the user with a Resumption Summary: reminding them of the original intent, key decisions made during the run, and the current status.
|
||||
|
||||
**4\. Tiered Notifications:** We must employ context-aware attention management. Notifications should be tiered: immediate push notifications only for critical blocks requiring user intervention, low-priority emails for decisions that simply affect quality, and batched digests for task completions.
|
||||
|
||||
**5\. Progressive Disclosure and Salvage Value:** Long-running tasks aggressively exacerbate the sunk cost fallacy. Users will accept substandard work simply because they waited 20 hours for it. The UI must progressively disclose partial results (rough outlines, wireframes) so users can course-correct early. Crucially, if a user stops a run, the UI must explicitly show the “salvage value” (which intermediate artifacts can be reused), making frictionless restarts less psychologically painful.
|
||||
|
||||

|
||||
|
||||
*Even when AI fails, you may be able to reuse part of what it did, reducing the pain of the sunk cost of an extended AI run. (NotebookLM)*
|
||||
|
||||
## The Long-Term Vision: Exploring Latent Space
|
||||
|
||||
Looking further ahead into the AI Era, [creativity shifts from making to discovery](https://www.uxtigers.com/post/explore-discover). We are moving away from **building** (pre-AI) and **describing** (current intent-based generation) toward **exploring** a latent solutions space created by AI.
|
||||
|
||||

|
||||
|
||||
*Only as you are navigating through the latent space of AI options do you discover what is there and which turn you want to take next in the journey towards the as-yet unknown destination. (NotebookLM)*
|
||||
|
||||
Since AI generates a thousand competent solutions in a minute, the user’s primary need is no longer production, but discovery. Iteration stops being mainly about fixing mistakes and becomes a way of exploring a multidimensional solution space. However, current UIs are far too linear, relying on the old-school “Back” button. The future of UX requires UI support for navigating a multi-branched exploration. We will need tools like “Look Lock” to freeze certain semantic styles or visual invariants while we explore adjacent dimensions. Future interfaces will feel less like pathways and more like collaborative playgrounds.
|
||||
|
||||
**“Intent by discovery”** should become the future of human-AI interaction. Don’t assume that users know what they want. Help them recognize it progressively by reacting to alternatives, locking in what matters, and exploring adjacent possibilities.
|
||||
|
||||

|
||||
|
||||
*Once you discover a new land, you may recognize it as your desired destination. (NotebookLM)*
|
||||
|
||||
While highly effective, current design patterns for prompt augmentation are essentially putting training wheels on a text box. Prompt augmentation still forces the user through a linguistic bottleneck, assuming they *have* a specific intent but simply lack the vocabulary. To fully support intent by discovery, UX must abandon the chat box as the default AI interaction model and stretch into multi-modal, spatial, and behavioral paradigms.
|
||||
|
||||
Here are my predictions for how UX design might evolve to support intent by discovery beyond simple prompting.
|
||||
|
||||
## 1\. Spatial Navigation of Latent Space
|
||||
|
||||
Currently, AI interfaces operate a bit like a slot machine: you pull the lever (prompt) and get a discrete result. In the future, UX will allow users to navigate the AI’s latent space (the multidimensional map of all possible solutions) visually and spatially.
|
||||
|
||||
**Semantic Topographies:** Instead of typing “make the design more professional but slightly playful,” the user might be presented with an interactive 2D map of generated outputs. Dragging a cursor across this space morphs the output in real-time. The user discovers their intent by seamlessly exploring adjacent possibilities, stopping when the output simply “feels right.” Such visual exploration will require real-time AI generation of updated alternatives, and we’re luckily already seeing improved models that emphasize fast response time.
|
||||
|
||||
**Divergent Routing:** Because humans are better at recognizing a solution than describing it, UIs will heavily leverage divergent generation. The AI generates edge-case variations and asks, “Better 1 or better 2?” The user’s selections iteratively narrow down the infinite possibility space through pure recognition, bypassing recall entirely.
|
||||
|
||||
## 2\. Direct Object Manipulation (Blending GUI and AI)
|
||||
|
||||
One of the major regressions of current chat-based AI is the loss of direct manipulation: the tactile tweaking we perfected in the GUI era. The future of intent by discovery will hybridize the two paradigms.
|
||||
|
||||
Users will refine their intent by physically altering the AI’s output. If an AI generates a website mockup or a floor plan, and the user drags a hero image or a wall to make it larger, the AI doesn’t just register a coordinate change. It reverse-engineers the underlying intent (“Ah, the user prioritizes visual impact and open space”) and automatically adjusts the typography, lighting, or secondary elements to maintain coherence. The tactile action becomes the prompt.
|
||||
|
||||
## 3\. Socratic Scaffolding
|
||||
|
||||
To support discovery, the system must stop being a passive order taker waiting for a master prompt, and become an active interviewer.
|
||||
|
||||
**Progressive Probing:** If a user’s initial intent is vague (“I need a strategy for a product launch”), the AI pauses instead of hallucinating a generic 10-page document. It responds with diagnostic questions or visual counterfactuals: “Are we optimizing for immediate revenue or long-term brand awareness?” By proactively presenting constraints, the AI helps the user chisel away at the marble until their exact intent is revealed.
|
||||
|
||||

|
||||
|
||||
*The Greek philosopher Socrates famously taught his students by asking them questions. Similarly, AI can help users achieve their goals by asking insightful, probing questions. (NotebookLM)*
|
||||
|
||||
## 4\. Ephemeral and Generative UIs
|
||||
|
||||
We are accustomed to static interfaces where the controls (dropdowns, menus) are always the same. In an era of intent by discovery, [Generative UI](https://www.uxtigers.com/post/generative-ui-google) will make the interface itself on the fly based on the user’s emerging context.
|
||||
|
||||
If the AI detects that a user is exploring the mood of a generated piece of music or the logic of a database schema, it will dynamically spawn bespoke UI controls (custom sliders, visual node-graphs, or reference boards) just for that specific moment of discovery. Once the intent is locked in, those specific UI controls dissolve.
|
||||
|
||||
## 5\. Curation as Intent
|
||||
|
||||
Text is a low-bandwidth way to communicate complex ideas, vibes, or aesthetics. Intent by discovery will increasingly rely on multimodal curation, similar to Midjourney’s Mood Boards.
|
||||
|
||||
Instead of typing out a description, a user might dump a cluster of disorganized artifacts onto a digital canvas: a PDF of a competitor’s report, a color palette from a photograph, and a 10-second voice memo. The system organizes them, finds the conceptual overlaps, and synthesizes a starting point. The user discovers their intent by seeing how the AI conceptually connects their fragmented inspirations.
|
||||
|
||||

|
||||
|
||||
*As a Viking raider, you may discover that you like amber and arm rings by curating your preferred items from the loot. (NotebookLM)*
|
||||
|
||||
## 6\. Subtractive Sculpting
|
||||
|
||||
The current prompting paradigm is *additive*: the user builds an outcome by adding more words. But discovery is often much easier when it is *subtractive*.
|
||||
|
||||
Future AI UX will frequently rely on generating an overwhelming, maximalist version of an artifact (a hyper-detailed document, a complex piece of code, a busy design). The user’s interaction model is then based on deleting, striking through, and whittling away the parts they don’t want. It is infinitely easier for a human to edit and remove than to generate from a blank screen.
|
||||
|
||||

|
||||
|
||||
*Subtractive sculpting: start with something big and whittle away until only something much nicer remains. (NotebookLM)*
|
||||
|
||||
## The Future Role of the UX Designer
|
||||
|
||||
In this new paradigm, the role of the UX designer shifts dramatically. Instead of designing linear user flows (Screen A → Screen B → Screen C), designers will **architect** **possibility spaces**.
|
||||
|
||||
They will design the boundary constraints, the physics of the latent space, and the feedback loops of these generative environments. Prompt augmentation is a vital bridge for the present moment, but by fully embracing my vision of “intent by discovery,” the UX of the future will treat AI not as a command-line terminal disguised as a chat window, but as a fluid, co-navigational environment where the need to write a “prompt” eventually disappears entirely.
|
||||
|
||||
Yet, we must be cautious about the industry’s obsession with the zero-learning ideal. A utopian vision where users merely express a wish and the AI seamlessly executes it offscreen carries a hidden cost. If users never need to learn how a system works, navigate a hierarchy, or make decisions, they suffer cognitive offloading and deskilling. They become mere passengers in their digital lives, trapped in a “Cognitive Atrophy Loop” in which analytical engagement degrades.
|
||||
|
||||

|
||||
|
||||
*If users have nothing to do, they risk cognitive atrophy from checking out and ignoring what goes on around them. (NotebookLM)*
|
||||
|
||||
This is the ultimate imperative for UX professionals. Our designs must not act as cognitive wheelchairs that replace human agency; they must act as cognitive exoskeletons that support and enhance human flourishing, even as traditional work vanishes. Good AI UX will teach *just enough*, reveal plan structures, and leave a comprehensible trail of action so users can maintain digital judgment.
|
||||
|
||||
What disappears is the assumption that the human is executing the tedious steps. We are entering a complex era of managing autonomous chauffeurs. The winning designs of the next decade will be those that understand the job to be done, orchestrate the solution transparently across a triple-layered interface, demand friction where stakes are high, and preserve unmistakable moments of human authority.
|
||||
|
||||
Designing that delicate relationship of delegation without surrender is the great UX challenge of the next decade. Let’s get started.
|
||||
|
||||
## Action Items
|
||||
|
||||
If you’re designing AI interfaces now, here’s where to focus:
|
||||
|
||||
- **Measure intent capture, not click efficiency.** Build evaluation frameworks around how accurately the system infers user goals, not how quickly users navigate menus that will no longer exist.
|
||||
- **Design the orchestration layer.** The negotiation surface between intent and action is where trust is built or lost. Most teams are ignoring it.
|
||||
- **Choreograph friction deliberately.** Map your task inventory by stakes. For high-stakes irreversible actions, friction is not a design failure, it’s safety.
|
||||
- **Plan for slow tasks from day one.** Run contracts, conceptual breadcrumbs, and salvage-value disclosure are not edge cases. They’re core interaction patterns for anything that runs longer than a few minutes.
|
||||
- **Resist the zero-learning trap.** Design systems that keep users cognitively engaged with what the AI is doing and why. Delegation without understanding is not empowerment.
|
||||
|
||||
The command-based paradigm served us magnificently for sixty years. The heuristics and usability guidelines we developed for it represent genuine intellectual achievement. But the world is shifting under our feet, and the UX profession must shift with it: not by abandoning what we know, but by recognizing that the definition of usability itself is being rewritten.
|
||||
|
||||

|
||||
|
||||
*Summary infographic. (NotebookLM)*
|
||||
|
||||
## About the Author
|
||||
|
||||
Jakob Nielsen, Ph.D., is a usability pioneer with [43 years experience in UX](https://www.uxtigers.com/post/41-years-in-ux) and the Founder of [UX Tigers](https://www.uxtigers.com/). He founded the discount usability movement for fast and cheap iterative design, including heuristic evaluation and the [10 usability heuristics](https://www.uxtigers.com/post/10-heuristics-reimagined). He formulated the eponymous [Jakob’s Law of the Internet User Experience](https://www.uxtigers.com/post/jakobs-law). Named “the king of usability” by *Internet Magazine*, “the guru of Web page usability” by *The New York Times*, and “the next best thing to a true time machine” by *USA Today*.
|
||||
|
||||
Previously, Dr. Nielsen was a Sun Microsystems Distinguished Engineer and a Member of Research Staff at Bell Communications Research, the branch of Bell Labs owned by the Regional Bell Operating Companies. He is the author of 8 books, including the best-selling *Designing Web Usability: The Practice of Simplicity* (published in 22 languages), the foundational *Usability Engineering* ([30,073 citations in Google Scholar](https://scholar.google.com/citations?hl=en&user=y5uL3wUAAAAJ))*,* and the pioneering *Hypertext and Hypermedia* (published two years before the Web launched).
|
||||
|
||||
Dr. Nielsen holds 79 United States patents, mainly on making the Internet easier to use. He received the Lifetime Achievement Award for Human–Computer Interaction Practice from ACM SIGCHI and was named a “Titan of Human Factors” by the Human Factors and Ergonomics Society.
|
||||
|
||||
· Subscribe to [Jakob’s newsletter](https://jakobnielsenphd.substack.com/) to get the full text of new articles emailed to you as soon as they are published.
|
||||
|
||||
· [Follow Jakob on LinkedIn](http://www.linkedin.com/comm/mynetwork/discovery-see-all?usecase=PEOPLE_FOLLOWS&followMember=jakobnielsenphd).
|
||||
|
||||
· Read: [article about Jakob Nielsen’s career in UX](https://www.uxtigers.com/post/41-years-in-ux)
|
||||
|
||||
· Watch: [Jakob Nielsen’s first 41 years in UX](https://www.youtube.com/watch?v=MPmVa_vKeF4) (8 min. video)
|
||||
@@ -14,6 +14,9 @@ tags: []
|
||||
|
||||
| 日期 | 时间 | 服务器 | 备份文件 | 状态 |
|
||||
| ---------- | ----- | -------- | ------------------------------------ | ---- |
|
||||
| 2026-04-21 | 22:00 | Mac Mini | openclaw-macmini-20260421220026.tar | ✅ 成功 |
|
||||
| 2026-04-21 | 22:00 | Ubuntu1 | openclaw-ubuntu1-20260421220257.tar | ✅ 成功 |
|
||||
| 2026-04-21 | 22:00 | Ubuntu2 | openclaw-ubuntu2-20260421220530.tar | ✅ 成功 |
|
||||
| 2026-04-20 | 22:00 | Mac Mini | openclaw-macmini-20260420220009.tar | ✅ 成功 |
|
||||
| 2026-04-20 | 22:00 | Ubuntu1 | openclaw-ubuntu1-20260420220049.tar | ✅ 成功 |
|
||||
| 2026-04-20 | 22:00 | Ubuntu2 | openclaw-ubuntu2-20260420220049.tar | ✅ 成功 |
|
||||
|
||||
136
wiki/Caddy.md
Normal file
136
wiki/Caddy.md
Normal file
@@ -0,0 +1,136 @@
|
||||
---
|
||||
title: "Caddy"
|
||||
type: entity
|
||||
aliases: [Caddy Web Server, Caddy反代]
|
||||
tags: [web-server, reverse-proxy, https, open-source]
|
||||
---
|
||||
|
||||
# Caddy
|
||||
|
||||
## Overview
|
||||
**Caddy** 是一个用 Go 语言编写的开源 Web 服务器,以自动 HTTPS 和简洁配置著称。相比 Nginx,Caddy 默认启用 HTTPS(Let's Encrypt 自动证书),配置语法更简洁直观。
|
||||
|
||||
## Core Features
|
||||
|
||||
| 特性 | 说明 |
|
||||
|------|------|
|
||||
| **自动 HTTPS** | 自动从 Let's Encrypt 申请和续期 SSL 证书 |
|
||||
| **自动 HTTP→HTTPS 重定向** | 无需手动配置 |
|
||||
| **TLS 1.3 支持** | 现代加密标准 |
|
||||
| **配置热加载** | 修改配置无需重启服务 |
|
||||
| **反向代理** | 支持 HTTP/2、WebSocket |
|
||||
| **Markdown 渲染** | 内置静态文件服务 |
|
||||
|
||||
## Installation (Ubuntu/Debian)
|
||||
```bash
|
||||
sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https curl
|
||||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
|
||||
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
|
||||
sudo apt update
|
||||
sudo apt install caddy
|
||||
```
|
||||
|
||||
## Basic Configuration (Caddyfile)
|
||||
|
||||
### 简单反向代理
|
||||
```
|
||||
n8n.ishenwei.online {
|
||||
reverse_proxy 127.0.0.1:15678
|
||||
}
|
||||
```
|
||||
|
||||
### 多域名配置
|
||||
```
|
||||
nas.ishenwei.online {
|
||||
reverse_proxy 127.0.0.1:15000
|
||||
}
|
||||
|
||||
grafana.ishenwei.online {
|
||||
reverse_proxy 127.0.0.1:13000
|
||||
}
|
||||
```
|
||||
|
||||
### 带认证的反向代理
|
||||
```
|
||||
n8n.ishenwei.online {
|
||||
basicauth /* {
|
||||
admin JDJhJDE0JDN3ZXVhV2YyZG9SY2hvYzVmZ2h3QUlVblpOMU4vS1ptcENrSlhySElMb3l5dytOMkh0Tk93
|
||||
}
|
||||
reverse_proxy 127.0.0.1:15678
|
||||
}
|
||||
```
|
||||
|
||||
## Integration with frp
|
||||
|
||||
典型架构:frp 建立内网隧道 → Caddy 反向代理到本地端口 → 自动 HTTPS
|
||||
|
||||
```
|
||||
用户请求 https://n8n.ishenwei.online
|
||||
↓
|
||||
阿里云 DNS → VPS 公网 IP
|
||||
↓
|
||||
Caddy (443端口) 接收请求
|
||||
↓
|
||||
Caddyfile 配置匹配 n8n.ishenwei.online
|
||||
↓
|
||||
reverse_proxy 127.0.0.1:15678
|
||||
↓
|
||||
frpc 在 VPS 15000 端口监听
|
||||
↓
|
||||
frp 隧道 → 内网 Ubuntu 5678 端口
|
||||
↓
|
||||
n8n 服务
|
||||
```
|
||||
|
||||
## Common Commands
|
||||
|
||||
```bash
|
||||
# 验证配置文件语法
|
||||
sudo caddy validate --config /etc/caddy/Caddyfile
|
||||
|
||||
# 重载配置(热加载)
|
||||
sudo systemctl reload caddy
|
||||
|
||||
# 重启服务
|
||||
sudo systemctl restart caddy
|
||||
|
||||
# 查看状态
|
||||
sudo systemctl status caddy
|
||||
|
||||
# 紧急恢复(服务卡死时)
|
||||
sudo systemctl stop caddy
|
||||
sudo pkill -9 caddy
|
||||
sudo systemctl start caddy
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Caddyfile 语法检查
|
||||
```bash
|
||||
sudo caddy validate --config /etc/caddy/Caddyfile
|
||||
# 输出 "Valid configuration" 表示语法正确
|
||||
```
|
||||
|
||||
### 端口被占用
|
||||
如果 Caddy 启动失败,检查端口是否被占用:
|
||||
```bash
|
||||
ss -ltnp | grep ':80\|:443'
|
||||
```
|
||||
|
||||
### Caddy 意外占用端口
|
||||
某些一键脚本可能配置 Caddy 监听非标准端口,检查是否有:
|
||||
```
|
||||
:7000 {
|
||||
reverse_proxy ...
|
||||
}
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
- [[反向代理]] — Caddy 的核心功能
|
||||
- [[Let's Encrypt]] — Caddy 自动使用的 SSL 证书提供商
|
||||
- [[frp]] — Caddy 常与 frp 配合使用
|
||||
- [[VPS]] — Caddy 通常部署在公网 VPS
|
||||
|
||||
## References
|
||||
- 官网: https://caddyserver.com/
|
||||
- 文档: https://caddyserver.com/docs/
|
||||
64
wiki/LinuxServer.io.md
Normal file
64
wiki/LinuxServer.io.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# LinuxServer.io
|
||||
|
||||
## Type
|
||||
- Entity
|
||||
- Organization / Open Source Project
|
||||
|
||||
## Description
|
||||
LinuxServer.io 是一个社区驱动的开源组织,专门为流行的自托管应用维护高质量的 Docker 镜像。所有镜像均遵循标准化配置模式,支持 PUID/PGID 环境变量、统一的目录结构、Web UI 默认端口约定,以及完整的文档支持。
|
||||
|
||||
## Key Facts
|
||||
- **Focus**: Self-hosted applications on Docker
|
||||
- **Image Registry**: lscr.io (Docker Hub official partner)
|
||||
- **Standard Variables**: PUID, PGID, TZ, UMASK_SET
|
||||
- **Notable Images**: transmission, jellyfin, navidrome, plex, sonarr, radarr, jackett, SABnzbd, home-assistant, nginx, nextcloud, portainer, it-tools, etc.
|
||||
- **License**: Various (per image)
|
||||
|
||||
## Standard Configuration Pattern
|
||||
|
||||
所有 LinuxServer.io 镜像遵循统一配置规范:
|
||||
|
||||
### Environment Variables
|
||||
```yaml
|
||||
environment:
|
||||
- PUID=1000 # Process UID (宿主用户ID)
|
||||
- PGID=1000 # Process GID (宿主组ID)
|
||||
- TZ=Etc/UTC # Timezone
|
||||
# Image-specific variables below
|
||||
```
|
||||
|
||||
### Volume Mounts
|
||||
```yaml
|
||||
volumes:
|
||||
- /path/to/config:/config # 配置目录
|
||||
- /path/to/downloads:/downloads # 可选:下载目录
|
||||
```
|
||||
|
||||
### Restart Policy
|
||||
```yaml
|
||||
restart: unless-stopped # 容器异常退出后自动重启
|
||||
```
|
||||
|
||||
### Network Mode
|
||||
```yaml
|
||||
network_mode: bridge # 桥接网络,支持端口映射
|
||||
```
|
||||
|
||||
## Notable Images in shenwei's Home Server
|
||||
|
||||
| Image | Purpose | Port |
|
||||
|-------|---------|------|
|
||||
| [[Transmission]] | BT 下载客户端 | 9091 |
|
||||
| [[Jellyfin]] | 视频流媒体服务器 | 8096 |
|
||||
| [[Navidrome]] | 音乐流媒体服务器 | 4533 |
|
||||
| portainer | Docker 容器管理 | 9000 |
|
||||
| it-tools | IT 工具集 | 8080 |
|
||||
| home-assistant | 智能家居控制 | 8123 |
|
||||
|
||||
## Connections
|
||||
- [[Transmission]] — maintained image
|
||||
- [[Jellyfin]] — maintained image
|
||||
- [[Navidrome]] — maintained image
|
||||
- [[Docker]] — deployment platform
|
||||
- [[群晖 NAS]] — 常见部署平台
|
||||
- [[Docker Compose]] — recommended deployment method
|
||||
43
wiki/PUID-PGID.md
Normal file
43
wiki/PUID-PGID.md
Normal file
@@ -0,0 +1,43 @@
|
||||
# PUID/PGID
|
||||
|
||||
## Type
|
||||
- Concept
|
||||
|
||||
## Definition
|
||||
PUID(Process User ID)和 PGID(Process Group ID)是 LinuxServer.io Docker 镜像专用的环境变量,用于将容器内进程以宿主机指定用户/组的身份运行,从根本上解决 Docker 容器创建文件的权限归属问题。
|
||||
|
||||
## Mechanism
|
||||
|
||||
### 核心原理
|
||||
```
|
||||
宿主机:用户 shenwei (UID=1000, GID=1000)
|
||||
↓ 设置 PUID=1000, PGID=1000
|
||||
容器内:Transmission 进程以 UID=1000, GID=1000 运行
|
||||
↓ 结果
|
||||
容器创建的文件 → 归属 shenwei:shenwei → 宿主机可直接读写
|
||||
```
|
||||
|
||||
### 获取宿主机 UID/GID
|
||||
```bash
|
||||
id shenwei
|
||||
# 输出:uid=1000(shenwei) gid=1000(shenwei) groups=1000(shenwei),...
|
||||
```
|
||||
|
||||
### Docker Compose 配置示例
|
||||
```yaml
|
||||
environment:
|
||||
- PUID=1000 # 对应宿主机用户 ID
|
||||
- PGID=1000 # 对应宿主机组 ID
|
||||
```
|
||||
|
||||
## Key Claims
|
||||
- PUID/PGID 是 LinuxServer.io 镜像的标准化用户配置,与非 root 用户运行(USER 环境变量)不同
|
||||
- 不设置 PUID/PGID 时,容器进程以 root(UID=0)运行,创建的文件归属 root:root,导致宿主机用户无法直接管理
|
||||
- PUID/PGID 解决了"容器内 root vs 宿主机普通用户"的跨环境文件权限冲突
|
||||
- 与 `user: "1000:1000"` Docker Compose 顶级键效果类似,但 PUID/PGID 由 LinuxServer.io 镜像内部脚本处理
|
||||
|
||||
## Relationship to [[LinuxServer.io]]
|
||||
PUID/PGID 是 LinuxServer.io 所有镜像的标准化配置环境变量,属于其官方推荐的最佳实践。
|
||||
|
||||
## Sources
|
||||
- [[用docker安装transmission]]
|
||||
112
wiki/TCP隧道.md
Normal file
112
wiki/TCP隧道.md
Normal file
@@ -0,0 +1,112 @@
|
||||
---
|
||||
title: "TCP 隧道"
|
||||
type: concept
|
||||
aliases: [TCP Tunnel, TCP端口转发, TCP代理]
|
||||
tags: [network, tunneling, protocol]
|
||||
---
|
||||
|
||||
# TCP 隧道
|
||||
|
||||
## Definition
|
||||
**TCP 隧道**是通过在两个端点之间建立虚拟连接,将本地 TCP 端口的流量透明传输到远程端点的技术。TCP 隧道是构建 VPN、反向代理和内网穿透的基础机制之一。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ 本地机器 │ │ VPS │
|
||||
│ │ │ │
|
||||
│ 应用 → :22 │ ──── TCP 隧道 ──── → │ :60022 ← ──│──── 外部 SSH 客户端
|
||||
│ │ (frp/nc/socat) │ │
|
||||
└─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
## TCP vs HTTP/HTTPS 隧道
|
||||
|
||||
| 特性 | TCP 隧道 | HTTP/HTTPS 隧道 |
|
||||
|------|----------|----------------|
|
||||
| **协议** | 任意 TCP | HTTP/HTTPS |
|
||||
| **应用层解析** | ❌ 不解析 | ✅ 可解析 |
|
||||
| **WebSocket** | ❌ 不支持 | ✅ 支持 |
|
||||
| **使用场景** | SSH、数据库、任意 TCP | Web 服务 |
|
||||
| **Caddy 支持** | ❌ 不支持 | ✅ 支持 |
|
||||
|
||||
## frp TCP 映射
|
||||
|
||||
### 配置示例
|
||||
```ini
|
||||
# frpc.ini
|
||||
[ssh]
|
||||
type = tcp
|
||||
local_ip = 127.0.0.1
|
||||
local_port = 22
|
||||
remote_port = 60022
|
||||
```
|
||||
|
||||
### 访问链路
|
||||
```
|
||||
外部 SSH 客户端
|
||||
↓
|
||||
ssh -p 60022 user@vps-ip
|
||||
↓
|
||||
VPS :60022 (frps 监听)
|
||||
↓
|
||||
frp 隧道
|
||||
↓
|
||||
内网机器 :22
|
||||
↓
|
||||
本地 SSH 服务
|
||||
```
|
||||
|
||||
## Common Tools for TCP Tunneling
|
||||
|
||||
| 工具 | 特点 | 使用场景 |
|
||||
|------|------|---------|
|
||||
| **frp** | 高性能、支持 Dashboard | 内网穿透 |
|
||||
| **socat** | 简单单次转发 | 临时调试 |
|
||||
| **netcat (nc)** | 基础端口转发 | 快速测试 |
|
||||
| **ssh -L** | SSH 内置隧道 | 临时访问 |
|
||||
| **stunnel** | SSL 隧道加密 | 安全转发 |
|
||||
|
||||
## SSH 原生隧道 (替代方案)
|
||||
|
||||
### 临时隧道
|
||||
```bash
|
||||
# 本地端口转发:访问远程内网服务
|
||||
ssh -L 8080:internal:80 user@vps
|
||||
|
||||
# 远程端口转发:暴露本地服务到公网
|
||||
ssh -R 60022:localhost:22 user@vps
|
||||
```
|
||||
|
||||
### 持久化问题
|
||||
SSH 隧道在网络中断后不会自动重连,生产环境推荐使用 frp。
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### TCP 隧道注意事项
|
||||
1. **不经过 HTTP 代理**:TCP 流量不被 Caddy/Nginx 解析
|
||||
2. **直接暴露端口**:SSH 22 端口不宜直接暴露,使用非标准端口
|
||||
3. **IP 白名单**:防火墙限制来源 IP
|
||||
4. **公钥认证**:禁用密码登录
|
||||
|
||||
### 建议配置
|
||||
```bash
|
||||
# VPS 防火墙:只允许特定 IP
|
||||
sudo ufw allow from <home_ip> to any port 60022 proto tcp
|
||||
|
||||
# SSH 配置:禁用密码
|
||||
sudo nano /etc/ssh/sshd_config
|
||||
PasswordAuthentication no
|
||||
PubkeyAuthentication yes
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
- [[内网穿透]] — TCP 隧道的典型应用场景
|
||||
- [[frp]] — 实现 TCP 隧道的工具
|
||||
- [[反向代理]] — HTTP/HTTPS 层面的代理(与 TCP 层互补)
|
||||
- [[VPS]] — TCP 隧道的公网端点
|
||||
|
||||
## References
|
||||
- frp TCP: https://gofrp.org/docs/features/common-address-types/#tcp
|
||||
- SSH Tunneling: https://www.ssh.com/academy/ssh/tunneling
|
||||
77
wiki/Transmission.md
Normal file
77
wiki/Transmission.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Transmission
|
||||
|
||||
## Type
|
||||
- Entity
|
||||
- Software
|
||||
|
||||
## Description
|
||||
Transmission 是一个开源的 BitTorrent 下载客户端,提供简洁的 Web UI 界面,支持远程管理和自动化下载。是 Home Server 媒体中心的核心组件,负责 BT 种子下载环节。
|
||||
|
||||
## Aliases
|
||||
- Transmission BitTorrent
|
||||
- lscr.io/linuxserver/transmission
|
||||
|
||||
## Key Facts
|
||||
- **License**: MIT
|
||||
- **Language**: C ( GTK+ / Qt GUI / Web UI )
|
||||
- **Official Image**: lscr.io/linuxserver/transmission:latest
|
||||
- **Deployment**: Docker Compose on Synology NAS / Home Server
|
||||
- **Web UI Port**: 9091
|
||||
- **Peer Port**: 51413 (TCP + UDP)
|
||||
- **Author**: shenwei
|
||||
|
||||
## Configuration
|
||||
|
||||
### Web UI Access
|
||||
通过 http://host:9091 访问 Web 管理界面,支持:
|
||||
- 种子管理(添加/暂停/删除/优先级)
|
||||
- 下载/上传速度限制
|
||||
- 连接peer管理
|
||||
- RSS自动下载(需插件)
|
||||
|
||||
### Authentication
|
||||
Web UI 认证通过 USER/PASS 环境变量配置:
|
||||
```yaml
|
||||
environment:
|
||||
- USER=shenwei
|
||||
- PASS=<password>
|
||||
```
|
||||
|
||||
### Docker Deployment (shenwei)
|
||||
```yaml
|
||||
services:
|
||||
transmission:
|
||||
image: lscr.io/linuxserver/transmission:latest
|
||||
container_name: transmission
|
||||
restart: unless-stopped
|
||||
network_mode: bridge
|
||||
ports:
|
||||
- "9091:9091" # Web UI
|
||||
- "51413:51413" # Peer TCP
|
||||
- "51413:51413/udp" # Peer UDP
|
||||
environment:
|
||||
- PUID=1000
|
||||
- PGID=1000
|
||||
- TZ=Etc/UTC
|
||||
- USER=shenwei
|
||||
- PASS=<password>
|
||||
volumes:
|
||||
- /home/shenwei/Docker/transmission/data:/config
|
||||
- /home/shenwei/Downloads:/downloads
|
||||
```
|
||||
|
||||
## Connections
|
||||
|
||||
### Upstream
|
||||
- [[LinuxServer.io]] — 官方 Docker 镜像维护者
|
||||
|
||||
### Downstream
|
||||
- [[Jellyfin]] — 视频播放(Transmission 下载 → Jellyfin 播放)
|
||||
- [[Navidrome]] — 音乐播放(Transmission 下载 → Navidrome 播放)
|
||||
- [[Docker Compose]] — 部署方式
|
||||
|
||||
### Related
|
||||
- [[群晖 NAS]] — 部署平台(Synology NAS Docker 环境)
|
||||
|
||||
## Sources
|
||||
- [[用docker安装transmission]]
|
||||
87
wiki/concepts/AI-ChatOps.md
Normal file
87
wiki/concepts/AI-ChatOps.md
Normal file
@@ -0,0 +1,87 @@
|
||||
---
|
||||
title: "AI ChatOps"
|
||||
tags:
|
||||
- devops
|
||||
- chatops
|
||||
- ai
|
||||
- collaboration
|
||||
- observability
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# AI ChatOps
|
||||
|
||||
## Definition
|
||||
|
||||
AI ChatOps 是通过自然语言接口(Slack / Teams / CLI)进行故障排查,AI 提供日志分析和解决方案建议的运维协作模式。Agentic AI 作为 24/7 的运维助手,工程师随时可通过对话获取即时支持。
|
||||
|
||||
## 与 Traditional ChatOps 的区别
|
||||
|
||||
| 维度 | Traditional ChatOps | AI ChatOps |
|
||||
|------|--------------------|------------|
|
||||
| 响应能力 | 依赖人工在线 | 24/7 即时响应 |
|
||||
| 问题诊断 | 人工搜索日志 | AI 自动分析 + 建议 |
|
||||
| 知识依赖 | 依赖个人经验 | 跨团队知识聚合 |
|
||||
| 学习能力 | 经验不可复制 | 持续学习 + 知识积累 |
|
||||
| 平均响应 | 数分钟至数小时 | 毫秒级 |
|
||||
|
||||
## Agentic AI ChatOps 能力
|
||||
|
||||
```python
|
||||
ChatOps_Capabilities = {
|
||||
"Log Query": "自然语言查询日志: 'Show me errors from API service in last hour'",
|
||||
"Incident Summary": "AI 生成事故摘要: 'This is caused by X, fix is Y'",
|
||||
"Runbook Suggestion": "AI 推荐运维手册: 'Based on error pattern, try runbook #42'",
|
||||
"Metric Correlation": "AI 关联指标: 'CPU spike correlates with DB connection pool'",
|
||||
"Action Execution": "AI 执行操作: '/runbook restart-service api-gateway'",
|
||||
"Post-mortem": "AI 生成复盘报告: 自动生成 incident timeline"
|
||||
}
|
||||
```
|
||||
|
||||
## 示例
|
||||
|
||||
> Engineer in Slack:
|
||||
> `@ai-ops Our API is slow, users are complaining`
|
||||
>
|
||||
> AI Response:
|
||||
> ```
|
||||
> 🔍 Analysis complete:
|
||||
>
|
||||
> Root Cause: External payment API timeout (upstream)
|
||||
> - Payment API p99 latency: 15,000ms (normally 200ms)
|
||||
> - Correlated: API gateway retries causing backpressure
|
||||
>
|
||||
> Suggested Actions:
|
||||
> 1. Enable circuit breaker (auto-deploy via /ops deploy)
|
||||
> 2. Fallback to cache for payment status (auto via /ops deploy)
|
||||
> 3. Monitor: https://grafana.link/d/abc123
|
||||
>
|
||||
> Shall I proceed with option 1? (yes/no)
|
||||
> ```
|
||||
|
||||
## 与 [[AIOps]] 的关系
|
||||
|
||||
AI ChatOps 是 [[AIOps]] 能力矩阵的用户交互层:
|
||||
|
||||
```python
|
||||
AIOps_Capabilities = {
|
||||
"Anomaly Detection": "检测异常模式",
|
||||
"Root Cause Analysis": "自动诊断",
|
||||
"Predictive Maintenance": "预测性维护",
|
||||
"Smart Alerting": "减少告警疲劳",
|
||||
"Automated Remediation": "自动修复",
|
||||
"Capacity Optimization": "容量优化",
|
||||
"AI ChatOps ←": "自然语言交互层" # ← 本页
|
||||
}
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[AIOps]] — ChatOps 是 AIOps 的用户交互接口
|
||||
- [[Root Cause Analysis]] — ChatOps 依赖 RCA 能力
|
||||
- [[Observability]] — ChatOps 依赖可观测性数据
|
||||
- [[Incident Management]] — ChatOps 加速事故响应
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
44
wiki/concepts/APT-仓库配置.md
Normal file
44
wiki/concepts/APT-仓库配置.md
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: "APT 仓库配置"
|
||||
tags: [apt, ubuntu, repository]
|
||||
date: 2026-04-22
|
||||
---
|
||||
|
||||
# APT 仓库配置
|
||||
|
||||
## Definition
|
||||
APT (Advanced Package Tool) 仓库是 Ubuntu/Debian 系统的软件包来源,通过配置文件指定软件包的下载位置和签名验证方式。
|
||||
|
||||
## Docker 官方仓库配置流程
|
||||
1. 安装 HTTPS 传输依赖:`sudo apt-get install ca-certificates curl`
|
||||
2. 创建密钥目录:`sudo install -m 0755 -d /etc/apt/keyrings`
|
||||
3. 下载并导入 GPG 密钥
|
||||
4. 添加仓库源文件到 `/etc/apt/sources.list.d/`
|
||||
|
||||
## 配置文件位置
|
||||
| 路径 | 作用 |
|
||||
|------|------|
|
||||
| `/etc/apt/sources.list` | 系统主仓库配置 |
|
||||
| `/etc/apt/sources.list.d/*.list` | 第三方仓库配置 |
|
||||
| `/etc/apt/keyrings/` | GPG 密钥存储目录 |
|
||||
|
||||
## Docker 仓库源文件示例
|
||||
```bash
|
||||
echo \
|
||||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
|
||||
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
|
||||
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
- **架构变量** `$(dpkg --print-architecture)`:自动适配 amd64/arm64 等架构
|
||||
- **版本代号** `$VERSION_CODENAME`:Ubuntu 版本代号(如 jammy、noble)
|
||||
- **签名验证** `signed-by`:指定 GPG 密钥路径
|
||||
|
||||
## Related Sources
|
||||
- [[如何在ubuntu-server安装-docker-docker-compose]] — Docker APT 仓库完整配置流程
|
||||
|
||||
## Related Concepts
|
||||
- [[GPG 密钥验证]] — apt 包签名验证机制
|
||||
- [[Docker Engine]] — 通过 APT 仓库安装
|
||||
- [[Ubuntu Server]] — APT 包管理器的宿主系统
|
||||
79
wiki/concepts/Asset-Management.md
Normal file
79
wiki/concepts/Asset-Management.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
title: "Asset Management"
|
||||
type: concept
|
||||
tags: [itsm, operations, finops]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
资产管理(Asset Management)是[[ITSM]]的核心流程之一,负责**跟踪和管理IT资产从采购到退役的完整生命周期**,优化资产利用率、控制成本并确保合规。
|
||||
|
||||
## Asset Lifecycle
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Procure │ → │ Deploy │ → │Operate │ → │Maintain │ → │ Retire │
|
||||
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
Purchase Provision Monitor Patch/Update Archive/
|
||||
Planning & Config & Optimize & Upgrade Dispose
|
||||
```
|
||||
|
||||
## Asset Management Focus Areas
|
||||
|
||||
| 领域 | 描述 |
|
||||
|------|------|
|
||||
| Hardware Lifecycle | 服务器、网络设备生命周期 |
|
||||
| Software Licensing | 软件许可管理(SAM) |
|
||||
| Cloud Optimization | 云资源成本优化 |
|
||||
| Shadow IT Prevention | 影子IT发现与治理 |
|
||||
| Compliance | 许可证合规审计 |
|
||||
|
||||
## Modern Asset Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,资产管理由AI驱动:
|
||||
|
||||
### Intelligent Features
|
||||
|
||||
| 能力 | 描述 | 价值 |
|
||||
|------|------|------|
|
||||
| Automated Lifecycle Tracking | 自动追踪资产状态 | 减少人工 |
|
||||
| License Optimization | 许可证使用分析 | 成本节省 |
|
||||
| Usage Analytics | 利用率分析 | Rightsizing |
|
||||
| Cost Forecasting | 成本预测 | 预算规划 |
|
||||
| Cloud Resource Optimization | 云资源优化 | FinOps |
|
||||
|
||||
### Cloud-Optimized SAM (Software Asset Management)
|
||||
|
||||
```
|
||||
Software License Usage
|
||||
├── Used Licenses: 80%
|
||||
├── Unused Licenses: 15%
|
||||
└── Over-licensed: 5%
|
||||
↓
|
||||
AI Analysis
|
||||
├── Reclaim unused
|
||||
├── Optimize purchases
|
||||
└── Prevent overruns
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| Asset Utilization Rate | 资产利用率 |
|
||||
| License Compliance | 许可证合规率 |
|
||||
| Shadow IT Count | 影子IT数量 |
|
||||
| Cost per Asset | 单资产成本 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[FinOps]] — 云财务管理
|
||||
- [[Rightsizing]] — 资源优化
|
||||
- [[Cloud-Optimization]] — 云优化
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — Intelligent Asset Lifecycle Tracking
|
||||
83
wiki/concepts/Automated-Security-Audit.md
Normal file
83
wiki/concepts/Automated-Security-Audit.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
title: "Automated Security Audit"
|
||||
tags:
|
||||
- devops
|
||||
- security
|
||||
- automation
|
||||
- compliance
|
||||
- ai
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Automated Security Audit
|
||||
|
||||
## Definition
|
||||
|
||||
Automated Security Audit 是通过 AI 自动扫描 IAM 策略、网络规则和容器漏洞,**检测安全风险并自动修复**的能力。Agentic AI 持续监控安全态势,实时执行合规修复。
|
||||
|
||||
## Scope
|
||||
|
||||
| 扫描对象 | 检测内容 | 修复动作 |
|
||||
|---------|---------|---------|
|
||||
| IAM Policies | 过度权限、公共访问风险 | 自动限制权限 |
|
||||
| Network Rules | 开放端口、安全组配置错误 | 自动收紧规则 |
|
||||
| Container Images | 已知漏洞 (CVE) | 触发重建 + 更新 |
|
||||
| S3 Buckets | 公开访问、数据泄露风险 | 自动阻止公共访问 |
|
||||
| Firewalls | 配置错误、入站规则过宽 | 自动修正 |
|
||||
|
||||
## Agentic AI Security Audit 工作流
|
||||
|
||||
```
|
||||
1. 持续扫描 → AWS Inspector / GCP Security Command Center / Azure Defender
|
||||
2. 风险评估 → CVSS 评分 + 业务影响分析
|
||||
3. 自动修复 → 低风险自动修复,高风险人工审批
|
||||
4. 合规验证 → SOC 2 / FedRAMP / PCI DSS 持续检查
|
||||
5. 报告生成 → 安全态势仪表盘 + 合规报告
|
||||
```
|
||||
|
||||
## 与 [[DevSecOps]] 的关系
|
||||
|
||||
Automated Security Audit 是 [[DevSecOps]] 实践的核心组件:
|
||||
|
||||
```python
|
||||
DevSecOps_Pipeline = {
|
||||
"Build": "SAST (Static Application Security Testing)",
|
||||
"Test": "DAST (Dynamic Application Security Testing)",
|
||||
"Deploy": "Container Scanning ←", # 漏洞扫描
|
||||
"Monitor": "Automated Security Audit ←", # ← 本页
|
||||
"Respond": "自动威胁缓解"
|
||||
}
|
||||
```
|
||||
|
||||
## 示例
|
||||
|
||||
> Agentic AI detects an over-permissive IAM role:
|
||||
> - Role: `production-app-read-all`
|
||||
> - Allows: `s3:*` on `arn:aws:s3:::customer-data-*`
|
||||
> - Risk: Public access enabled on bucket
|
||||
> - **AI Action**:
|
||||
> - Immediately restricts bucket policy
|
||||
> - Notifies DevOps team via Slack
|
||||
> - Creates Jira ticket for IAM review
|
||||
> - Logs audit trail for compliance
|
||||
|
||||
## 与合规框架的关系
|
||||
|
||||
| 合规框架 | Agentic AI 支持方式 |
|
||||
|---------|-------------------|
|
||||
| SOC 2 | 持续访问审计 + 变更记录 |
|
||||
| FedRAMP | 安全配置基线检查 + 报告 |
|
||||
| PCI DSS | 数据访问控制 + 加密验证 |
|
||||
| ISO 27001 | 风险评估 + 修复验证 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[DevSecOps]] — Automated Security Audit 是 DevSecOps 的技术基础
|
||||
- [[Cloud Security]] — 审计是云安全的核心实践
|
||||
- [[IAM]] — 主要审计对象之一
|
||||
- [[Compliance]] — 审计支持合规证明
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
- [[cloud-devop-maturity-guideline]]
|
||||
84
wiki/concepts/Blue-Green-Deployment.md
Normal file
84
wiki/concepts/Blue-Green-Deployment.md
Normal file
@@ -0,0 +1,84 @@
|
||||
---
|
||||
title: "Blue-Green Deployment"
|
||||
type: concept
|
||||
tags: [devops, deployment, release-management, high-availability]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
蓝绿部署(Blue-Green Deployment)是一种零停机发布策略,维护两套相同的生产环境(蓝环境和绿环境),通过负载均衡器切换流量实现无缝部署和快速回滚。
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Load Balancer
|
||||
│
|
||||
┌──────────┴──────────┐
|
||||
│ │
|
||||
Blue Env Green Env
|
||||
(Production) (Staging)
|
||||
│ │
|
||||
v1.0 v1.1 (New)
|
||||
│
|
||||
Traffic ON Traffic OFF
|
||||
```
|
||||
|
||||
## Deployment Flow
|
||||
|
||||
```
|
||||
1. Blue (v1.0) serving all traffic
|
||||
2. Deploy to Green (v1.1)
|
||||
3. Test/Verify Green
|
||||
4. Switch LB to Green
|
||||
5. Blue becomes standby (or update to next version)
|
||||
```
|
||||
|
||||
## Key Benefits
|
||||
|
||||
| 优势 | 描述 |
|
||||
|------|------|
|
||||
| 零停机 | 流量切换瞬间完成 |
|
||||
| 快速回滚 | 切换回蓝色环境即可 |
|
||||
| 测试完整性 | 完整生产环境测试 |
|
||||
| 风险可控 | 新版本未暴露给全部用户 |
|
||||
|
||||
## Comparison: Blue-Green vs Canary
|
||||
|
||||
| 维度 | Blue-Green | Canary |
|
||||
|------|-----------|--------|
|
||||
| 流量切换 | 全量切换 | 渐进式 |
|
||||
| 回滚速度 | 秒级 | 秒级 |
|
||||
| 资源成本 | 2x | 增量 |
|
||||
| 适用场景 | 关键系统 | 持续迭代 |
|
||||
| 风险 | 全量暴露 | 逐步暴露 |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM 2.0]]的[[Release-Management]]中,蓝绿部署是关键实践:
|
||||
|
||||
```
|
||||
Release Management 2.0
|
||||
├── Blue-Green Deployment
|
||||
│ ├── 零停机发布
|
||||
│ ├── 秒级回滚
|
||||
│ └── 全量验证
|
||||
├── Canary Release
|
||||
│ ├── 渐进式发布
|
||||
│ └── 实时监控
|
||||
└── DevOps-integrated ITSM
|
||||
├── CI/CD Pipeline
|
||||
└── Automated Governance
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Release-Management]] — 发布管理总框架
|
||||
- [[Canary-Release]] — 金丝雀发布
|
||||
- [[High-Availability]] — 高可用架构
|
||||
- [[Failover]] — 故障转移
|
||||
- [[Deployment-Automation]] — 部署自动化
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — Blue-Green在现代ITSM中的应用
|
||||
64
wiki/concepts/Break-the-Build.md
Normal file
64
wiki/concepts/Break-the-Build.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# Break-the-Build
|
||||
|
||||
## Definition
|
||||
"Break the Build" is a mechanism that stops the development process if security risks are too high until resolved.
|
||||
|
||||
## Concept
|
||||
当 CI/CD 管道中的安全扫描发现高风险问题时,自动阻止构建继续进行,直到安全问题得到修复。
|
||||
|
||||
## How It Works
|
||||
|
||||
### Trigger Conditions
|
||||
- SAST 发现高危漏洞
|
||||
- SCA 发现有漏洞的依赖
|
||||
- 机密信息泄露检测
|
||||
- 许可证合规违规
|
||||
|
||||
### Process Flow
|
||||
```
|
||||
代码提交 → 构建开始 → 安全扫描 →
|
||||
├─ 通过 → 继续部署
|
||||
└─ 失败 → 停止构建 → 通知团队 → 修复 → 重新提交
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
### CI/CD Integration
|
||||
```yaml
|
||||
# GitLab CI Example
|
||||
security_scan:
|
||||
stage: test
|
||||
script:
|
||||
- sast-scan
|
||||
allow_failure: false # 阻止构建
|
||||
```
|
||||
|
||||
### Gatekeeping Strategy
|
||||
| 漏洞等级 | 默认策略 |
|
||||
|---------|---------|
|
||||
| Critical | 强制阻止 |
|
||||
| High | 阻止(可配置) |
|
||||
| Medium | 警告 |
|
||||
| Low | 忽略 |
|
||||
|
||||
## Benefits
|
||||
- 防止不安全代码进入生产环境
|
||||
- 强制开发者及时修复安全问题
|
||||
- 提高整体安全基线
|
||||
- 减少安全债务
|
||||
|
||||
## Best Practices
|
||||
1. 明确定义"阻塞"阈值
|
||||
2. 平衡安全与开发速度
|
||||
3. 提供清晰的错误信息
|
||||
4. 集成通知机制
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — Break-the-Build 是其自动化组件
|
||||
- [[SAST]] — 触发条件来源
|
||||
- [[SCA]] — 触发条件来源
|
||||
- [[CI/CD Pipeline]] — 实施载体
|
||||
- [[Shift-Left-Security]] — 早期发现问题的策略
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
63
wiki/concepts/Bug-Bounty.md
Normal file
63
wiki/concepts/Bug-Bounty.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Bug Bounty
|
||||
|
||||
## Definition
|
||||
Bug Bounty programs incentivize external security researchers to report vulnerabilities in an organization's systems, websites, or applications.
|
||||
|
||||
## Concept
|
||||
Bug Bounty(漏洞赏金)计划通过向外部安全研究人员提供奖励,激励他们报告组织系统、网站或应用程序中的漏洞。
|
||||
|
||||
## How It Works
|
||||
|
||||
### Program Setup
|
||||
1. 定义范围(Scope)
|
||||
2. 制定规则和奖励表
|
||||
3. 建立提交和处理流程
|
||||
4. 部署公开平台或使用第三方服务
|
||||
|
||||
### Researcher Workflow
|
||||
```
|
||||
发现漏洞 → 提交报告 → 厂商验证 → 确认/分类 → 修复 → 发放奖励
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
### For Organizations
|
||||
- 扩展安全测试覆盖面
|
||||
- 成本效益比聘请专职安全团队更高
|
||||
- 获得多样化的安全研究人员视角
|
||||
- 提高安全响应能力
|
||||
|
||||
### For Researchers
|
||||
- 获得经济奖励
|
||||
- 建立安全研究声誉
|
||||
- 学习真实环境漏洞
|
||||
|
||||
## Platforms
|
||||
- HackerOne
|
||||
- Bugcrowd
|
||||
- Open Bug Bounty
|
||||
- 厂商自有平台(Google VRP, Microsoft Bounty)
|
||||
|
||||
## Best Practices
|
||||
|
||||
### For Program Owners
|
||||
1. 清晰的规则和范围定义
|
||||
2. 公平的奖励机制
|
||||
3. 快速响应提交
|
||||
4. 透明的沟通
|
||||
5. 法律保护(Safe Harbor)
|
||||
|
||||
### Responsible Disclosure
|
||||
- 给厂商合理时间修复
|
||||
- 不公开漏洞细节直到修复
|
||||
- 遵循协调漏洞披露(CVD)
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — Bug Bounty 是持续安全改进的一部分
|
||||
- [[Penetration-Testing]] — 正式渗透测试
|
||||
- [[Vulnerability-Scanning]] — 自动化漏洞扫描
|
||||
- [[Incident-Response]] — 漏洞响应
|
||||
- [[Responsible-Disclosure]] — 负责任披露
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
102
wiki/concepts/Business-Impact-Analysis.md
Normal file
102
wiki/concepts/Business-Impact-Analysis.md
Normal file
@@ -0,0 +1,102 @@
|
||||
---
|
||||
title: "Business Impact Analysis (业务影响分析)"
|
||||
tags: [devops, disaster-recovery, risk-management, business-continuity, planning]
|
||||
aliases: [BIA]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Business Impact Analysis (业务影响分析)
|
||||
|
||||
**Business Impact Analysis (BIA)** 是确定不同应用系统故障对业务影响的分析过程,用于设定 [[RTO]] 和 [[RPO]] 目标以及分层保护策略。
|
||||
|
||||
## Aliases
|
||||
- BIA
|
||||
- 业务影响分析
|
||||
|
||||
## Definition
|
||||
|
||||
BIA 的核心问题:
|
||||
1. 如果系统停机 1 小时,会发生什么?
|
||||
2. 如果丢失过去 1 小时的数据,会发生什么?
|
||||
|
||||
回答这些问题,才能科学地确定每个系统的 RTO/RPO 目标,而非凭感觉设定"激进但无法实现"的目标。
|
||||
|
||||
## 关键分析问题
|
||||
|
||||
### 如果系统停机 1 小时?
|
||||
|
||||
- **收入损失**?损失多少?
|
||||
- **客户不满**?影响多少客户?
|
||||
- **员工受阻**?他们能绕过去工作吗?
|
||||
- **合规风险**?法律问题?
|
||||
|
||||
### 如果丢失过去 1 小时的数据?
|
||||
|
||||
- **数据能重建吗**?
|
||||
- **是否涉及资金/交易**?
|
||||
- **用户会注意到吗**?
|
||||
- **合规要求**是否要求保留这些数据?
|
||||
|
||||
## Tiered Protection Model
|
||||
|
||||
基于 BIA 结论,将应用分为不同保护级别:
|
||||
|
||||
| Tier | 场景 | RTO 目标 | RPO 目标 | 投资策略 |
|
||||
|------|------|----------|----------|----------|
|
||||
| **(1) Critical** | 支付处理、用户认证、核心产品功能 | < 5 分钟 | < 1 分钟 | Feature Flag + 自动化监控 + 3AM 告警 + 热备 |
|
||||
| **(2) Important** | 管理后台、报表、客户支持工具 | < 1 小时 | < 15 分钟 | Feature Flag(主要发布)+ 业务时间监控 + 标准备份 |
|
||||
| **(3) Nice-to-have** | 内部工具、开发环境、文档站点 | < 4 小时 | < 1 小时 | 基础监控 + 手动恢复流程 + 每日备份 |
|
||||
|
||||
## 投资优先级
|
||||
|
||||
> "Most teams try to give everything Tier 1 treatment, which can lead to burnout. Be ruthless about what actually matters to your business. You can't do *everything*."
|
||||
|
||||
### Tier 1 投资
|
||||
- [[Feature Flag]] + 自动化监控
|
||||
- 自动化告警(支持 3AM 叫醒)
|
||||
- [[Kill Switch]] + 备用路径就绪
|
||||
- 近实时数据复制
|
||||
|
||||
### Tier 2 投资
|
||||
- [[Feature Flag]](主要发布场景)
|
||||
- 业务时间监控
|
||||
- 文档化的回滚程序
|
||||
- 小时级备份
|
||||
|
||||
### Tier 3 投资
|
||||
- 基础监控
|
||||
- 手动恢复程序
|
||||
- 每日备份
|
||||
- 期望最好的结果
|
||||
|
||||
## BIA 与成本收益
|
||||
|
||||
> "Do the math honestly. What does an hour of downtime actually cost your business? If it's $10K, don't spend $100K/year on infrastructure to prevent it. You're better off accepting some downtime and investing in faster recovery."
|
||||
|
||||
| 系统停机损失 | 合理灾备投资上限(年化) |
|
||||
|-------------|------------------------|
|
||||
| $10K/小时 | $100K/年 |
|
||||
| $100K/小时 | $1M/年 |
|
||||
|
||||
**关键洞察**:预防和恢复的成本必须与实际业务损失相匹配。
|
||||
|
||||
## BIA 与 [[Feature Flag]] 的关系
|
||||
|
||||
BIA 结论指导 Feature Flag 的使用策略:
|
||||
|
||||
- **Tier 1 系统**:必须有 Kill Switch,Progressive Rollout 强制
|
||||
- **Tier 2 系统**:建议 Feature Flag 主要发布场景
|
||||
- **Tier 3 系统**:根据 ROI 决定是否使用 Feature Flag
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[RTO]] — BIA 结论决定 RTO 目标
|
||||
- [[RPO]] — BIA 结论决定 RPO 目标
|
||||
- [[Feature Flag]] — 基于 BIA 分层保护的技术手段
|
||||
- [[Kill Switch]] — Tier 1 系统的必备应急机制
|
||||
- [[Disaster Recovery]] — BIA 是灾备规划的基础
|
||||
- [[Risk-Mitigation]] — BIA 是风险管理的一部分
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
68
wiki/concepts/CMDB.md
Normal file
68
wiki/concepts/CMDB.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: "Configuration Management Database (CMDB)"
|
||||
type: concept
|
||||
tags: [cloud, devops, operations, itsm]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
配置管理数据库(CMDB)是存储IT基础设施中所有配置项(Configuration Items, CI)及其关系关系的数据库。在现代ITSM中,AI驱动的CMDB增强了依赖映射、漂移检测和实时影响分析能力。
|
||||
|
||||
## Traditional vs Modern CMDB
|
||||
|
||||
| 维度 | 传统CMDB | AI-Enhanced CMDB |
|
||||
|------|---------|------------------|
|
||||
| 数据来源 | 手动录入 | 自动发现 |
|
||||
| 关系映射 | 静态 | 动态 |
|
||||
| 漂移检测 | 定期审计 | 实时监控 |
|
||||
| 影响分析 | 人工推断 | AI预测 |
|
||||
| 多云支持 | 有限 | 原生支持 |
|
||||
|
||||
## Key Capabilities (Modern CMDB)
|
||||
|
||||
### 1. Dependency Mapping
|
||||
```
|
||||
服务 → 应用 → 中间件 → 数据库 → 基础设施
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
自动发现配置项及其依赖关系
|
||||
```
|
||||
|
||||
### 2. Drift Detection
|
||||
- 配置变更自动检测
|
||||
- 与预期配置对比
|
||||
- 告警和自动修复
|
||||
|
||||
### 3. Real-time Impact Analysis
|
||||
- 变更前影响预测
|
||||
- 事件影响范围评估
|
||||
- 服务依赖可视化
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM]]中,CMDB是[[Configuration-Management]]的核心:
|
||||
|
||||
```
|
||||
Configuration Management (ITSM 5.0)
|
||||
├── AI-powered CMDB
|
||||
│ ├── 依赖映射
|
||||
│ ├── 漂移检测
|
||||
│ └── 实时影响分析
|
||||
├── Multi-cloud Orchestration
|
||||
│ ├── 公有云
|
||||
│ ├── 私有云
|
||||
│ └── 混合云
|
||||
└── Misconfiguration Prevention
|
||||
└── 安全漏洞预防
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Configuration-Management]] — 配置管理流程
|
||||
- [[ITSM]] — CMDB的父框架
|
||||
- [[AIOps]] — AI驱动的CMDB能力
|
||||
- [[Cloud-Native]] — 多云CMDB支持
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AI-CMDB在现代ITSM中的应用
|
||||
63
wiki/concepts/Canary-Release.md
Normal file
63
wiki/concepts/Canary-Release.md
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: "Canary Release"
|
||||
type: concept
|
||||
tags: [devops, deployment, release-management, automation]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
金丝雀发布(Canary Release)是一种渐进式软件发布策略,将新版本首先部署给小部分用户,验证稳定性后再逐步扩大范围,最终替换全部用户。这种策略降低了全量发布风险,实现近乎零干扰的发布。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
Phase 1: 5% Traffic Phase 2: 20% Traffic Phase 3: 100% Traffic
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ New Version │ │ New Version │ │ Old Version │
|
||||
│ ██ │ │ ████ │ │ ░░░░ │
|
||||
│ 5% users │ │ 20% users │ │ Rollback if │
|
||||
└─────────────────┘ └─────────────────┘ │ issues │
|
||||
↓ ↓ └─────────────────┘
|
||||
Monitor metrics Monitor + Auto-rollback ↓
|
||||
Auto-promote if failure rate ↑ Full Deployment
|
||||
```
|
||||
|
||||
## Key Metrics to Monitor
|
||||
|
||||
| 指标 | 描述 | 告警阈值 |
|
||||
|------|------|---------|
|
||||
| Error Rate | 错误率变化 | +2% |
|
||||
| Latency | 响应时间 | +50ms |
|
||||
| Business KPIs | 转化率、订单量 | -5% |
|
||||
| Resource Usage | CPU/内存 | +30% |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM 2.0]]的[[Release-Management]]中,金丝雀发布是关键实践:
|
||||
|
||||
```
|
||||
Release Management 2.0
|
||||
├── Canary Release
|
||||
│ ├── 渐进式流量转移
|
||||
│ ├── 实时监控
|
||||
│ └── 自动回滚
|
||||
├── Blue-Green Deployment
|
||||
│ ├── 零停机切换
|
||||
│ └── 快速回滚
|
||||
└── Feature Flags
|
||||
├── 精细化控制
|
||||
└── A/B Testing
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Release-Management]] — 发布管理总框架
|
||||
- [[Blue-Green-Deployment]] — 蓝绿部署
|
||||
- [[Feature-Flag]] — 特性开关
|
||||
- [[Progressive-Rollout]] — 渐进式发布
|
||||
- [[Deployment-Automation]] — 部署自动化
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — Canary Release在现代ITSM中的应用
|
||||
38
wiki/concepts/Centralized-Logging.md
Normal file
38
wiki/concepts/Centralized-Logging.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: Centralized Logging
|
||||
type: concept
|
||||
tags: [DevOps, Observability, CloudOps, AWS]
|
||||
date: 2025-10-24
|
||||
---
|
||||
|
||||
## Definition
|
||||
Centralized Logging(集中日志)是一种将分散在多个系统、账户、服务或地理位置的日志汇总到单一中心位置进行统一管理的模式。核心目标是在分布式系统中消除监控盲区,提供全局可观测性。
|
||||
|
||||
## Core Properties
|
||||
- **聚合**:将多个来源的日志合并到单一存储
|
||||
- **统一查询**:跨来源的集中搜索和分析
|
||||
- **集中告警**:基于聚合数据的统一告警策略
|
||||
- **合规保留**:统一的数据保留和合规策略
|
||||
|
||||
## Related Concepts
|
||||
- [[Multi-Account Deployment]]:多账户场景是集中日志的主要驱动因素
|
||||
- [[Cross-Account Monitoring]]:跨账户监控依赖集中日志基础设施
|
||||
- [[StackSets Deployment Visibility]]:StackSets 部署可观测性依赖集中日志
|
||||
- [[Event Sourcing]]:集中日志可以视为事件溯源的一种实现
|
||||
- [[APM]](Application Performance Monitoring):APM 工具通常依赖集中日志数据
|
||||
- [[CloudWatch Logs]]:AWS 生态系统中的集中日志存储服务
|
||||
- [[Prometheus]]:时间序列监控,可与集中日志互补
|
||||
|
||||
## Implementation Patterns
|
||||
1. **日志采集层**:Agent/Fluentd/Firelens 收集各来源日志
|
||||
2. **传输层**:EventBridge/Kinesis/Firehose 传输日志事件
|
||||
3. **存储层**:CloudWatch Logs/OpenSearch/S3 + Athena
|
||||
4. **分析层**:CloudWatch Logs Insights/OpenSearch Dashboards/Grafana Loki
|
||||
5. **告警层**:CloudWatch Alarms/Grafana Alerting/PagerDuty
|
||||
|
||||
## AWS Context
|
||||
- AWS CloudWatch Logs:AWS 原生日志存储和分析服务
|
||||
- AWS EventBridge:事件驱动的日志采集路由
|
||||
- AWS CloudTrail:AWS API 调用的审计日志(集中日志的特殊形式)
|
||||
- AWS Systems Manager OpsCenter:基于集中日志的运营问题管理
|
||||
- [[Centralized Logging]] ← uses ← [[Amazon EventBridge]] ← routes ← [[Amazon CloudWatch Logs]]
|
||||
69
wiki/concepts/Change-Management.md
Normal file
69
wiki/concepts/Change-Management.md
Normal file
@@ -0,0 +1,69 @@
|
||||
---
|
||||
title: "Change Management"
|
||||
type: concept
|
||||
tags: [itsm, devops, operations]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
变更管理(Change Management)是[[ITSM]]的核心流程之一,通过**结构化的方法评估、审批和实施IT变更**,在保持服务稳定性的同时实现业务所需的灵活性。
|
||||
|
||||
## Change Management Process
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Change │ → │ Impact │ → │ CAB │
|
||||
│ Request │ │ Assessment │ │ Approval │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
↓
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┘
|
||||
│ Build & │ ← │ Change │ ← │ Schedule │
|
||||
│ Test │ │ Build │ │ & Prepare │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Change Types
|
||||
|
||||
| 类型 | 描述 | 审批级别 |
|
||||
|------|------|---------|
|
||||
| Emergency | 紧急变更(如P1事故响应) | 快速审批 |
|
||||
| Standard | 标准变更(例行维护) | 自动审批 |
|
||||
| Normal | 常规变更(新功能部署) | CAB审批 |
|
||||
|
||||
## Modern Change Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,变更管理由AI和[[IaC]]驱动:
|
||||
|
||||
### AI-Driven Risk Assessment
|
||||
- **Automated Impact Analysis** — 自动评估变更影响
|
||||
- **Failure Probability Prediction** — AI预测变更失败概率
|
||||
- **Rollback Planning** — 自动生成回滚计划
|
||||
|
||||
### CI/CD Pipeline Governance
|
||||
```
|
||||
Code Commit → Automated Testing → IaC Validation → Risk Score → Approval
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
Git hooks Unit/Int Tests Terraform Plan ML Model Auto/CAB
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| Change Success Rate | 变更成功率 |
|
||||
| Failed Change Rate | 失败变更率 |
|
||||
| Rollback Rate | 回滚率 |
|
||||
| Emergency Change Ratio | 紧急变更比例 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Change-Failure-Rate]] — 变更失败率
|
||||
- [[IaC]] — 基础设施即代码
|
||||
- [[CI/CD-Pipeline]] — CI/CD流水线
|
||||
- [[Rollback]] — 回滚机制
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AI-driven Change Management
|
||||
@@ -1,77 +1,144 @@
|
||||
---
|
||||
title: Cloud Adoption Strategy
|
||||
source: https://www.bacancytechnology.com/blog/cloud-maturity-model
|
||||
tags: [Cloud, Strategy, Transformation, Cloud-Maturity]
|
||||
---
|
||||
|
||||
# Cloud Adoption Strategy
|
||||
|
||||
## Overview
|
||||
> **Cloud Adoption Strategy** — 组织将工作负载、应用程序和数据从本地基础设施迁移到云环境,并持续优化云服务使用的系统性方法。
|
||||
|
||||
A **Cloud Adoption Strategy** is a comprehensive plan that guides organizations through the process of transitioning their workloads, applications, and infrastructure to cloud environments. The Cloud Maturity Model (CMM) provides a structured framework for developing and executing this strategy.
|
||||
## Definition
|
||||
|
||||
## Key Elements
|
||||
云采用策略(Cloud Adoption Strategy)是一份全面的行动计划,定义组织如何:
|
||||
|
||||
### 1. Setting Cloud Adoption Objectives
|
||||
- 评估当前状态(遗留系统、云就绪度)
|
||||
- 设定云采用目标
|
||||
- 选择合适的云平台和部署模型
|
||||
- 规划迁移路径和优先级
|
||||
- 管理变革和人员转型
|
||||
- 持续优化云运营
|
||||
|
||||
Before adopting cloud services, organizations should:
|
||||
## Core Framework
|
||||
|
||||
- **Clarify Motivations** — Focus on cloud economics and Total Cost of Ownership (TCO) to understand how cost savings and efficiency drive adoption
|
||||
- **Determine Business Goals** — Align technical strategies with business objectives to ensure cloud adoption meets organizational needs
|
||||
- **Develop a Business Case** — Create a strong business case to secure support from internal teams, including finance and management
|
||||
根据 Open Alliance for Cloud Adoption (OACA) 的云成熟度模型,云采用策略应包含:
|
||||
|
||||
### 2. Identifying Current Maturity Level
|
||||
### 1. GAP Analysis(差距分析)
|
||||
|
||||
Understanding your current cloud maturity level (0-5) allows for:
|
||||
- Tailored objectives based on current state
|
||||
- More effective cloud adoption strategy
|
||||
- Right balance between cloud-native services and hybrid architecture
|
||||
评估当前状态与目标状态之间的差距:
|
||||
- 技术差距(基础设施、应用、数据)
|
||||
- 流程差距(运营、安全、合规)
|
||||
- 人员差距(技能、文化、能力)
|
||||
|
||||
### 3. Selecting the Right Maturity Model
|
||||
### 2. Cloud Maturity Assessment
|
||||
|
||||
Various models are available:
|
||||
- **OACA Cloud Maturity Model** — General framework, provider-neutral
|
||||
- **AWS Cloud Adoption Framework** — AWS-specific best practices
|
||||
- **Azure Cloud Adoption Framework** — Microsoft Azure guidance
|
||||
- **Google Cloud Adoption Framework** — Google Cloud transition guide
|
||||
- **Cloud Security Maturity Model (CSMM)** — Security-specific assessment
|
||||
| 评估维度 | 评估内容 |
|
||||
|----------|---------|
|
||||
| **人员 (People)** | 技能水平、培训需求、文化适应性 |
|
||||
| **流程 (Processes)** | 现有流程成熟度、改进机会 |
|
||||
| **技术 (Technology)** | 基础设施、应用、数据、集成 |
|
||||
|
||||
### 4. Following Governance and Compliance
|
||||
### 3. Migration Strategy
|
||||
|
||||
Establish:
|
||||
- Framework defining roles, responsibilities, and decision-making
|
||||
- Comprehensive policies (security, access controls, data protection, cost management, incident response)
|
||||
- Alignment with industry regulations (HIPAA, PCI-DSS)
|
||||
|
||||
### 5. Security and Risk Management
|
||||
|
||||
- Encryption and access controls
|
||||
- Regular backups and monitoring
|
||||
- Frequent risk assessments
|
||||
- Security awareness training
|
||||
|
||||
## Relationship with Cloud Maturity Model
|
||||
|
||||
The CMM serves as both a diagnostic tool and a roadmap:
|
||||
- **Diagnostic** — Assess current state across people, processes, and technology
|
||||
- **Roadmap** — Guide progression through 5 maturity levels
|
||||
- **Benchmarking** — Compare progress against industry standards
|
||||
| 策略类型 | 适用场景 | 风险/收益 |
|
||||
|----------|---------|-----------|
|
||||
| **Rehost (Lift & Shift)** | 快速迁移、时间紧迫 | 低风险/低收益 |
|
||||
| **Replatform** | 部分优化需求 | 中风险/中收益 |
|
||||
| **Repurchase** | SaaS 替代 | 低风险/中收益 |
|
||||
| **Refactor** | 云原生需求 | 高风险/高收益 |
|
||||
| **Retire** | 淘汰遗留系统 | 降低复杂性 |
|
||||
| **Retain** | 暂不迁移 | 战略保留 |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. Avoid skipping maturity levels — sustainable transformation requires incremental progress
|
||||
2. Focus on long-term sustainability over rapid technological change
|
||||
3. Selectively adopt Level 5 elements that bring clear business benefits
|
||||
4. Establish clear KPIs for cloud utilization
|
||||
5. Invest in education and training programs
|
||||
### 1. Set Up Cloud Adoption Objectives
|
||||
|
||||
## Related Concepts
|
||||
**Clarify Motivations**
|
||||
- 关注云经济学和总拥有成本 (TCO)
|
||||
- 量化成本节省和效率提升
|
||||
|
||||
- [[Cloud-Maturity-Model]]
|
||||
- [[Multi-Cloud-Strategy]]
|
||||
- [[Cloud-Native]]
|
||||
- [[FinOps]]
|
||||
**Determine Business Goals**
|
||||
- 将技术战略与业务目标对齐
|
||||
- 确保云采用满足组织需求
|
||||
|
||||
## Sources
|
||||
**Develop a Business Case**
|
||||
- 创建有说服力的业务案例
|
||||
- 争取财务和管理团队支持
|
||||
|
||||
- [[sources/cloud-maturity-model-a-detailed-guide-for-cloud-adoption.md]]
|
||||
### 2. Identify Current Maturity Level
|
||||
|
||||
- 使用 Cloud Maturity Model 评估当前状态
|
||||
- 设定切合实际的改进目标
|
||||
- 平衡完全云原生与混合架构需求
|
||||
|
||||
### 3. Pick the Right Maturity Model
|
||||
|
||||
| 模型 | 适用场景 | 特点 |
|
||||
|------|---------|------|
|
||||
| **OACA CMM** | 通用云采用 | 供应商中立 |
|
||||
| **AWS CAF** | AWS 环境 | AWS 特定 |
|
||||
| **Azure CAF** | Azure 环境 | Azure 特定 |
|
||||
| **GCP CAF** | GCP 环境 | GCP 特定 |
|
||||
| **CSMM** | 云安全成熟度 | 安全评估 |
|
||||
|
||||
### 4. Follow Governance and Compliance
|
||||
|
||||
- 建立治理框架定义角色和职责
|
||||
- 制定安全、访问控制、数据保护政策
|
||||
- 符合 HIPAA、PCI-DSS 等行业法规
|
||||
|
||||
### 5. Security and Risk Management
|
||||
|
||||
- 部署加密和访问控制
|
||||
- 定期备份和威胁监控
|
||||
- 持续风险评估
|
||||
- 安全意识培训
|
||||
|
||||
## Cloud Deployment Models
|
||||
|
||||
| 模型 | 描述 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| **Public Cloud** | 共享基础设施 | 弹性扩展、成本优化 |
|
||||
| **Private Cloud** | 专用基础设施 | 高安全、合规需求 |
|
||||
| **Hybrid Cloud** | 公私混合 | 灵活性和控制平衡 |
|
||||
| **Multi-Cloud** | 多云平台 | 避免锁定、优化性能 |
|
||||
|
||||
## Key Considerations
|
||||
|
||||
### CAPEX vs OPEX
|
||||
|
||||
| 维度 | CAPEX (本地) | OPEX (云) |
|
||||
|------|-------------|-----------|
|
||||
| 前期成本 | 高 | 低 |
|
||||
| 持续成本 | 维护、折旧 | 按需付费 |
|
||||
| 灵活性 | 低 | 高 |
|
||||
| 可扩展性 | 有限 | 弹性 |
|
||||
| 可见性 | 固定 | 按使用付费 |
|
||||
|
||||
### TCO (Total Cost of Ownership)
|
||||
|
||||
评估云采用总成本时考虑:
|
||||
- 直接成本(计算、存储、网络)
|
||||
- 间接成本(培训、迁移、集成)
|
||||
- 隐性成本(治理、安全、合规)
|
||||
- 机会成本(创新速度)
|
||||
|
||||
## Phases of Cloud Adoption
|
||||
|
||||
```
|
||||
Phase 1: Discovery & Assessment
|
||||
↓
|
||||
Phase 2: Strategy & Planning
|
||||
↓
|
||||
Phase 3: Foundation & Readiness
|
||||
↓
|
||||
Phase 4: Migration
|
||||
↓
|
||||
Phase 5: Optimization
|
||||
↓
|
||||
Phase 6: Innovation & Transformation
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Maturity Model]] — 成熟度框架
|
||||
- [[Cloud Maturity Levels]] — 成熟度级别
|
||||
- [[Cloud Migration]] — 云迁移
|
||||
- [[Cloud Governance]] — 云治理
|
||||
- [[Multi-Cloud Strategy]] — 多云策略
|
||||
- [[FinOps]] — 云财务管理
|
||||
- [[Cloud-Native]] — 云原生
|
||||
|
||||
73
wiki/concepts/Cloud-Cost-Optimization.md
Normal file
73
wiki/concepts/Cloud-Cost-Optimization.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: "Cloud Cost Optimization"
|
||||
type: concept
|
||||
tags: [Cloud, FinOps, Cost Management, Cloud Operations]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Cloud Cost Optimization (云成本优化)
|
||||
|
||||
## Definition
|
||||
**Cloud Cost Optimization** is the practice of reducing unnecessary cloud spending while maintaining performance, security, and reliability. It involves monitoring, analyzing, and adjusting cloud resource usage to achieve maximum value.
|
||||
|
||||
## Key Tactics
|
||||
|
||||
### 1. Reserved Instances & Spot Instances
|
||||
- **Reserved Instances**: 40-70% cost reduction compared to on-demand
|
||||
- **Spot Instances**: Up to 90% discount for interruptible workloads
|
||||
- Strategic commitment for predictable workloads
|
||||
|
||||
### 2. Auto-Scaling & Right-Sizing
|
||||
- Automatically adjust resources based on demand
|
||||
- Match instance types to actual workload needs
|
||||
- Terminate underutilized resources
|
||||
|
||||
### 3. Resource Tagging & Monitoring
|
||||
- Track spending by teams, projects, and workloads
|
||||
- Real-time cost visibility
|
||||
- Budget alerts and anomaly detection
|
||||
|
||||
### 4. Storage Optimization
|
||||
- Delete unused snapshots and volumes
|
||||
- Use lifecycle policies
|
||||
- Choose appropriate storage classes
|
||||
|
||||
### 5. Network Cost Optimization
|
||||
- Minimize data transfer costs
|
||||
- Use VPC endpoints
|
||||
- Optimize traffic routes
|
||||
|
||||
## Cloud Provider Cost Management Tools
|
||||
|
||||
| Provider | Tool | Key Features |
|
||||
|----------|------|-------------|
|
||||
| AWS | AWS Cost Explorer | Real-time cost monitoring, savings plans, budget alerts |
|
||||
| Azure | Azure Cost Management | Cost tracking, reserved instances, predictive analysis |
|
||||
| GCP | GCP Billing Reports | AI-driven cost insights, budget tracking |
|
||||
|
||||
## FinOps Integration
|
||||
- Cloud Cost Optimization is a core component of [[FinOps]]
|
||||
- Continuous optimization cycle: Inform → Optimize → Operate
|
||||
- Collaboration between finance, engineering, and operations
|
||||
|
||||
## Case Studies
|
||||
|
||||
### E-commerce Company
|
||||
- Leveraged Auto-Scaling and Reserved Instances across AWS and Azure
|
||||
- **Result**: $500,000 annual billing savings
|
||||
|
||||
### SaaS Company
|
||||
- Used Reserved Instances and Autoscaling Policies
|
||||
- **Result**: 35% reduction in cloud costs
|
||||
|
||||
## Related Concepts
|
||||
- [[FinOps]]
|
||||
- [[Cloud Operating Model]]
|
||||
- [[Cloud Governance]]
|
||||
- [[Rightsizing]]
|
||||
- [[Pay-as-you-go]]
|
||||
|
||||
## Related Entities
|
||||
- [[AWS]]
|
||||
- [[Azure]]
|
||||
- [[Google-Cloud]]
|
||||
68
wiki/concepts/Cloud-Governance.md
Normal file
68
wiki/concepts/Cloud-Governance.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: "Cloud Governance"
|
||||
type: concept
|
||||
tags: [Cloud, Governance, Compliance, Security, Cloud Operations]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Cloud Governance (云治理)
|
||||
|
||||
## Definition
|
||||
**Cloud Governance** is the set of policies, processes, and controls that ensure cloud resources are used securely, efficiently, and in compliance with regulatory requirements. It provides the framework for managing cloud chaos, security loopholes, and cost overruns.
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. Identity & Access Management (IAM)
|
||||
- Role-based access control (RBAC)
|
||||
- Principle of least privilege
|
||||
- Multi-factor authentication
|
||||
|
||||
### 2. Security & Compliance
|
||||
- Policy-as-Code for automated enforcement
|
||||
- Continuous compliance monitoring
|
||||
- Automated compliance checks
|
||||
|
||||
### 3. Cost Management & Governance
|
||||
- Real-time cost tracking
|
||||
- Budget alerts and allocation
|
||||
- Resource tagging strategies
|
||||
|
||||
### 4. Resource Governance
|
||||
- Guardrails for resource provisioning
|
||||
- Tagging standards
|
||||
- Resource lifecycle management
|
||||
|
||||
## Cloud Governance by Provider
|
||||
|
||||
| Aspect | AWS | Azure | GCP |
|
||||
|--------|-----|-------|-----|
|
||||
| IAM | AWS IAM | Azure AD | Google IAM |
|
||||
| Security Tools | AWS Security Hub | Microsoft Defender | Security Command Center |
|
||||
| Cost Control | AWS Cost Explorer | Azure Cost Management | GCP Billing Reports |
|
||||
| Policy Enforcement | AWS Organizations & SCPs | Azure Policy | GCP Organization Policies |
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Define IAM roles and policies upfront** — avoid giving excessive permissions
|
||||
2. **Use automated compliance checks** — detect misconfigurations
|
||||
3. **Implement guardrails** — prevent unauthorized resource provisioning
|
||||
4. **Establish tagging standards** — track resources by teams, projects, workloads
|
||||
5. **Enable real-time monitoring** — detect anomalies and compliance violations
|
||||
|
||||
## Relationship to Cloud Operating Model
|
||||
- Cloud Governance is a **core pillar** of the Cloud Operating Model
|
||||
- Provides the guardrails that enable secure and efficient cloud operations
|
||||
- Works alongside Automation, Security, and FinOps
|
||||
|
||||
## Related Concepts
|
||||
- [[Cloud Operating Model]]
|
||||
- [[Policy-as-Code]]
|
||||
- [[Compliance-Automation]]
|
||||
- [[FinOps]]
|
||||
- [[Zero-Trust-Architecture]]
|
||||
- [[IAM]]
|
||||
|
||||
## Related Entities
|
||||
- [[AWS]]
|
||||
- [[Azure]]
|
||||
- [[Google-Cloud]]
|
||||
@@ -1,90 +1,181 @@
|
||||
---
|
||||
title: Cloud Maturity Levels (0-5)
|
||||
source: https://www.bacancytechnology.com/blog/cloud-maturity-model
|
||||
tags: [Cloud, Maturity, Levels, Assessment, Transformation]
|
||||
---
|
||||
# Cloud Maturity Levels
|
||||
|
||||
# Cloud Maturity Levels (0-5)
|
||||
> **Cloud Maturity Levels** — 云成熟度5级模型的详细定义,每级对应组织在云采用旅程中的特定能力水平。
|
||||
|
||||
## Overview
|
||||
|
||||
The Cloud Maturity Model defines **5 maturity levels** (Level 0-5) that represent stages of organizational cloud adoption capability. These levels provide a structured assessment framework for evaluating current state and planning progression.
|
||||
Cloud Maturity Model 将组织的云成熟度分为 5 个级别(Level 0-5),每个级别代表不同的技术能力、流程成熟度和战略整合程度。
|
||||
|
||||
## The Six Maturity Levels
|
||||
## The 5 Maturity Levels
|
||||
|
||||
### Level 0: Legacy (No Cloud Readiness)
|
||||
- Company doesn't use the cloud at all
|
||||
- Relies solely on outdated systems
|
||||
- No plans to adopt cloud services
|
||||
- Starting new projects is slow and difficult
|
||||
- Often due to strict regulations (high security or data requirements) rather than lack of readiness
|
||||
### Level 0: No Cloud Readiness (Legacy)
|
||||
|
||||
**定义**: 组织完全不使用云服务,仅依赖遗留系统。
|
||||
|
||||
**特征**:
|
||||
- 无虚拟化
|
||||
- 所有工作负载在本地
|
||||
- 新项目启动缓慢困难
|
||||
- 通常受严格监管约束(高安全或数据要求)
|
||||
|
||||
**典型场景**:
|
||||
- 高度监管行业(某些金融、医疗)
|
||||
- 数据主权要求极高的场景
|
||||
- 历史遗留系统迁移困难的组织
|
||||
|
||||
---
|
||||
|
||||
### Level 1: Initial Readiness (Ad hoc)
|
||||
- Company has assessed software and services for cloud integration
|
||||
- Some initial experience with cloud services
|
||||
- Possibly migrating a few systems
|
||||
- Still operates primarily on legacy and non-virtualized systems
|
||||
- Cloud mainly used for SaaS or specific business unit needs
|
||||
- No clear overall strategy
|
||||
|
||||
**Key Challenges:** Limited cloud knowledge, minimal leadership support, absence of clear strategy, undefined processes
|
||||
**定义**: 组织开始评估云服务,零星采用但无整体战略。
|
||||
|
||||
**特征**:
|
||||
- 已评估软件和服务的云集成可能性
|
||||
- 部分工作负载尝试迁移
|
||||
- 仍主要运行在遗留和非虚拟化系统
|
||||
- 主要使用 SaaS 或特定业务单元需求
|
||||
- 缺乏清晰的整体云战略
|
||||
|
||||
**常见挑战**:
|
||||
|
||||
| 挑战 | 解决方案 |
|
||||
|------|---------|
|
||||
| 云技术知识有限 | 获得高管对云计划的支持 |
|
||||
| 领导层支持不足 | 使用非关键应用进行多个 PoC |
|
||||
| 缺乏资金 | 获取云服务的综合访问资金 |
|
||||
| 无清晰战略 | 为当前团队制定云技术有效使用战略 |
|
||||
| 无定义流程/指南 | 通过教育培训增强云知识 |
|
||||
| 无云使用优化 | 建立明确的 KPI(如降低 25% 基础设施成本) |
|
||||
| 缺乏云安全意识 | 通过培训加深云安全风险理解 |
|
||||
|
||||
**进阶路径**: 高管支持 → PoC 验证 → 资金保障 → 战略制定 → KPI 建立
|
||||
|
||||
---
|
||||
|
||||
### Level 2: Repeatable, Opportunistic
|
||||
- Established IT and procurement procedures for cloud services
|
||||
- Decided who can subscribe and how
|
||||
- Processes are defined and repeatable
|
||||
- Cloud services used extensively
|
||||
- Approach isn't yet fully systematic and clearly defined
|
||||
|
||||
**Key Challenges:** Cost control concerns, lack of documented policies, over-reliance on manual tasks, limited cloud usage visibility
|
||||
**定义**: 组织建立了使用云服务的 IT 和采购流程,广泛使用云但方法不够系统。
|
||||
|
||||
**特征**:
|
||||
- 已建立云订阅和访问流程
|
||||
- 流程已定义且可重复
|
||||
- 云服务使用广泛
|
||||
- 尚无完全系统化的方法
|
||||
|
||||
**常见挑战**:
|
||||
|
||||
| 挑战 | 解决方案 |
|
||||
|------|---------|
|
||||
| 成本控制与管理问题 | 将云使用与业务目标对齐(市场扩张、新品发布) |
|
||||
| 缺乏文档化政策 | 建立云卓越中心(CCoE) |
|
||||
| 过度依赖手动任务 | 形成专门的云治理团队 |
|
||||
| 云使用可见性有限 | 优先优化云采用总成本(TCO) |
|
||||
| 云采用 ROI 和时间线担忧 | 采用标准化、可重复性和自动化 |
|
||||
| 不愿从旧系统迁移 | 使用容器而非虚拟机部署应用 |
|
||||
| 安全与合规顾虑 | 考虑多样化部署模型(私有、混合、多云) |
|
||||
| 云团队/流程/迁移管理复杂性 | 制定详细的云运营指南和协议 |
|
||||
| 加密和身份认证问题 | 将关键生产工作负载迁移到云 |
|
||||
| 最小化云系统停机时间 | 确保所有云服务最小停机 |
|
||||
|
||||
**进阶路径**: CCoE → 治理团队 → 标准化 → 自动化 → 容器化
|
||||
|
||||
---
|
||||
|
||||
### Level 3: Systematic and Documented
|
||||
- Implemented process or outsourced service to manage cloud subscriptions
|
||||
- Monitor existing services systematically
|
||||
- Operations are more efficient and systematic
|
||||
- Documented practices and compliance in place
|
||||
- Includes documented cloud management processes and updated operational policies
|
||||
|
||||
**Key Challenges:** Ensuring consistency, staff training, effective environment management, workload optimization
|
||||
**定义**: 组织实施流程或外包服务来管理云订阅和监控现有服务,运营更加高效和系统化。
|
||||
|
||||
**特征**:
|
||||
- 已文档化的云管理流程
|
||||
- 运营策略已更新
|
||||
- 流程和实践系统化
|
||||
- 可能外包云管理服务
|
||||
|
||||
**常见挑战**:
|
||||
|
||||
| 挑战 | 解决方案 |
|
||||
|------|---------|
|
||||
| 确保云流程一致性 | 获得对完全 IT 分权的支持 |
|
||||
| 员工培训提升能力 | 制定全面的应用迁移到目标环境战略 |
|
||||
| 云环境有效管理 | 增强发布、密钥和策略管理 |
|
||||
| 分析工作负载优化机会 | 建立稳健的治理和管理实践 |
|
||||
| 识别适合自动化的任务 | 将所有相关工作负载和数据迁移到云 |
|
||||
| 环境管理担忧 | 尝试高级云服务(AI、机器学习等) |
|
||||
| 应用和系统迁移 | 拥抱完全自动化和编排 |
|
||||
|
||||
**⚠️ 警惕**: 许多企业试图跳过 Level 2 和 3,直接从 Level 0/1 到 Level 4。虽然技术驱动的云转型框架使这看起来很诱人,但确保长期可持续性至关重要。
|
||||
|
||||
**进阶路径**: IT 分权支持 → 迁移战略 → 治理强化 → 自动化 → 高级云服务
|
||||
|
||||
---
|
||||
|
||||
### Level 4: Measured
|
||||
- Cloud-native applications used extensively in daily operations
|
||||
- Widely adopted across organization
|
||||
- Utilizes private, public, and hybrid cloud platforms
|
||||
- Often partially reached — some capabilities may still be at levels 2 or 3
|
||||
- Transparent governance model to manage and measure cloud operations
|
||||
- Measuring end-to-end process performance and data usage
|
||||
|
||||
**Key Challenge:** Need for governance model when deploying cloud services quickly
|
||||
**定义**: 组织广泛使用云原生应用,采用私有、公共和混合云平台。
|
||||
|
||||
### Level 5: Optimized (Highest Level)
|
||||
- Open and interoperable cloud environment
|
||||
- Actively developed using metrics and data
|
||||
- Processes are optimized
|
||||
- Decisions are data-driven
|
||||
- Adeptly use various cloud platforms
|
||||
- Flexibly move workloads between platforms
|
||||
**特征**:
|
||||
- 云原生应用在日常运营中广泛使用
|
||||
- 跨多云平台(私有/公共/混合)
|
||||
- 透明治理模型管理云运营
|
||||
- 端到端流程性能可衡量
|
||||
- 数据使用和优化持续改进
|
||||
|
||||
**Reality Check:** Often more aspirational than real. Companies usually lag in optimizing processes and fully leveraging data. Can be overinvestment if extensive hybrid cloud solutions are optional.
|
||||
**常见挑战**:
|
||||
- 快速部署云服务时需要治理模型
|
||||
- 数据利用需要特定技能和工具优化
|
||||
- 部分组织可能仅部分达到 Level 4(某些能力仍在 Level 2/3)
|
||||
|
||||
## Common Anti-Pattern: Skipping Levels
|
||||
**关键成功因素**:
|
||||
- 透明治理模型
|
||||
- 端到端性能测量
|
||||
- 数据驱动决策
|
||||
- 持续优化文化
|
||||
|
||||
> "Often, businesses try to skip levels 2 and 3, aiming directly from level 0 or 1 to level 4 using technology solutions. While rapid technological change may seem attractive, ensuring long-term sustainability is crucial."
|
||||
---
|
||||
|
||||
## Key Insights
|
||||
### Level 5: Optimized
|
||||
|
||||
1. **Incremental Progress** — Sustainable cloud maturity requires incremental advancement through each level
|
||||
2. **Partial Maturity is Normal** — Organizations often partially reach level 4, with some capabilities still at levels 2 or 3
|
||||
3. **Not All Levels Are Necessary** — Selectively adopting Level 5 elements that bring clear business benefits may be more practical than full Level 5 achievement
|
||||
4. **Governance is Critical** — A transparent governance model becomes essential from Level 4 onwards
|
||||
**定义**: 最高成熟度级别,组织在开放互通的云环境中运营,基于指标和数据积极开发。
|
||||
|
||||
## Related Concepts
|
||||
**特征**:
|
||||
- 流程优化,数据驱动决策
|
||||
- 熟练使用各种云平台
|
||||
- 灵活跨平台迁移工作负载
|
||||
- 持续创新和优化
|
||||
|
||||
- [[Cloud-Maturity-Model]]
|
||||
- [[Cloud-Adoption-Strategy]]
|
||||
- [[Cloud-Native]]
|
||||
- [[DevOps-Maturity]]
|
||||
**⚠️ 现实检视**:
|
||||
- Level 5 往往比现实更理想化
|
||||
- 许多公司可能开发开放互通的云环境
|
||||
- 在流程优化和充分利用数据方面通常落后
|
||||
- 如果广泛的混合云解决方案是可选的,Level 5 可能是过度投资
|
||||
|
||||
## Sources
|
||||
**最佳建议**: 不要追求完整的 Level 5,而是选择性采纳能带来明确业务价值的要素。
|
||||
|
||||
- [[sources/cloud-maturity-model-a-detailed-guide-for-cloud-adoption.md]]
|
||||
## Level Progression Insights
|
||||
|
||||
```
|
||||
Level 0 ──→ Level 1 ──→ Level 2 ──→ Level 3 ──→ Level 4 ──→ Level 5
|
||||
(无云) (评估) (可重复) (系统化) (可衡量) (优化)
|
||||
↑ ↑ ↑
|
||||
└──────────────┴──── ⚠️ 跳过级别可能导致
|
||||
后续挑战和不必要的成本
|
||||
```
|
||||
|
||||
## Key Metrics by Level
|
||||
|
||||
| 维度 | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
|
||||
|------|---------|---------|---------|---------|---------|
|
||||
| 成本优化 | 初始评估 | TCO 分析 | 持续优化 | 自动化调优 | 预测性优化 |
|
||||
| 自动化 | 手动 | 部分自动化 | 流程自动化 | 编排 | 自主运营 |
|
||||
| 治理 | 无 | 基础政策 | 文档化治理 | 透明治理 | 动态治理 |
|
||||
| 安全性 | 基础 | 合规检查 | 主动安全 | 持续监控 | 主动防御 |
|
||||
| 数据利用 | 有限 | 收集 | 分析 | 洞察 | 预测/AI |
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Maturity Model]] — 主体框架
|
||||
- [[Cloud Adoption Strategy]] — 云采用策略
|
||||
- [[DevOps Maturity]] — DevOps 成熟度
|
||||
- [[DORA Metrics]] — DORA 指标
|
||||
- [[Cloud Governance]] — 云治理
|
||||
- [[FinOps]] — 云财务管理
|
||||
|
||||
134
wiki/concepts/Cloud-Native-Maturity-Model.md
Normal file
134
wiki/concepts/Cloud-Native-Maturity-Model.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Cloud Native Maturity Model
|
||||
|
||||
> **Cloud Native Maturity Model** — 评估组织在采用云原生技术和服务方面成熟度的框架,基于 CNCF 云原生景观。
|
||||
|
||||
## Definition
|
||||
|
||||
云原生成熟度模型评估组织在以下方面的成熟度:
|
||||
|
||||
- **容器化和打包**
|
||||
- **编排和自动化**
|
||||
- **微服务和 API**
|
||||
- **可观测性**
|
||||
- **服务网格**
|
||||
- **DevOps 实践**
|
||||
|
||||
## CNCF云原生成熟度层级
|
||||
|
||||
### Level 1: Containerized(容器化)
|
||||
|
||||
**特征**
|
||||
- 应用程序已容器化(Docker)
|
||||
- 使用容器镜像仓库
|
||||
- 基本的 CI/CD 流水线
|
||||
|
||||
**成熟度指标**
|
||||
- [ ] 80%+ 应用已容器化
|
||||
- [ ] 标准化基础镜像
|
||||
- [ ] 镜像安全扫描集成
|
||||
|
||||
---
|
||||
|
||||
### Level 2: Orchestrated(编排)
|
||||
|
||||
**特征**
|
||||
- Kubernetes 集群管理
|
||||
- 自动化部署和扩缩容
|
||||
- 基础资源调度
|
||||
|
||||
**成熟度指标**
|
||||
- [ ] 生产环境使用 K8s
|
||||
- [ ] HPA(水平 Pod 自动扩缩容)
|
||||
- [ ] 命名空间隔离
|
||||
- [ ] 基础调度策略
|
||||
|
||||
---
|
||||
|
||||
### Level 3: Microservices(微服务)
|
||||
|
||||
**特征**
|
||||
- 应用程序拆分为微服务
|
||||
- 服务间通过 API 通信
|
||||
- 异步消息队列使用
|
||||
- 服务发现
|
||||
|
||||
**成熟度指标**
|
||||
- [ ] 微服务数量 > 10
|
||||
- [ ] 12-Factor App 遵循
|
||||
- [ ] API 网关使用
|
||||
- [ ] 消息队列集成
|
||||
|
||||
---
|
||||
|
||||
### Level 4: Meshed(服务网格)
|
||||
|
||||
**特征**
|
||||
- 服务网格部署(Istio/Linkerd)
|
||||
- 零信任网络安全
|
||||
- 细粒度流量管理
|
||||
- 分布式追踪
|
||||
|
||||
**成熟度指标**
|
||||
- [ ] 服务网格生产使用
|
||||
- [ ] mTLS 强制执行
|
||||
- [ ] 金丝雀发布/AB 测试
|
||||
- [ ] 分布式追踪完整覆盖
|
||||
|
||||
---
|
||||
|
||||
### Level 5: Auto-Pilot(自动驾驶)
|
||||
|
||||
**特征**
|
||||
- 策略即代码
|
||||
- 自动化安全策略执行
|
||||
- 自愈能力
|
||||
- 智能扩缩容
|
||||
|
||||
**成熟度指标**
|
||||
- [ ] OPA/Kyverno 策略强制
|
||||
- [ ] 自动故障恢复
|
||||
- [ ] 预测性扩缩容
|
||||
- [ ] FinOps 自动化
|
||||
|
||||
## CNCF云原生技术全景
|
||||
|
||||
### Container Runtime
|
||||
- containerd, CRI-O
|
||||
|
||||
### Orchestration & Management
|
||||
- Kubernetes, Helm, Kustomize
|
||||
|
||||
### Coordination & Service Discovery
|
||||
- etcd, CoreDNS
|
||||
|
||||
### Networking & Service Mesh
|
||||
- CNI (Calico, Flannel)
|
||||
- Envoy, Istio, Linkerd
|
||||
|
||||
### Observability
|
||||
- Prometheus, Grafana, Jaeger, Loki, OpenTelemetry
|
||||
|
||||
### Serverless / Faas
|
||||
- Knative, OpenFaaS, AWS Lambda, Azure Functions
|
||||
|
||||
### Application Definition
|
||||
- Helm, Kustomize, OAM, Dapr
|
||||
|
||||
## Assessment Criteria
|
||||
|
||||
| 能力 | L1 | L2 | L3 | L4 | L5 |
|
||||
|------|----|----|----|----|----|
|
||||
| **容器化** | Ad-hoc | 标准镜像 | 统一流水线 | 自动化扫描 | 签名验证 |
|
||||
| **编排** | 手动 | K8s 基础 | 自动部署 | 策略驱动 | 自愈 |
|
||||
| **架构** | 单体 | 模块化 | 微服务 | 服务网格 | 混合 |
|
||||
| **网络** | 基础 | VPC | API 网关 | 服务网格 | 零信任 |
|
||||
| **可观测** | 基础日志 | 指标 | 追踪 | 完整 | 预测 |
|
||||
| **安全** | 手动 | IAM | 扫描 | 策略即代码 | 自适应 |
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud-Native]] — 云原生核心概念
|
||||
- [[Cloud Maturity Model]] — 云成熟度模型
|
||||
- [[DevOps Maturity]] — DevOps 成熟度
|
||||
- [[Kubernetes]] — Kubernetes
|
||||
- [[Service Mesh]] — 服务网格
|
||||
@@ -1,35 +1,156 @@
|
||||
# Cloud-Native
|
||||
|
||||
> **Cloud-Native** — 一种构建和运行应用程序的方法,充分利用云计算交付模型,将计算、存储和网络作为可弹性扩展的服务。
|
||||
|
||||
## Definition
|
||||
Cloud-native is an approach to building and running applications that fully exploits the advantages of cloud computing delivery model.
|
||||
|
||||
## Core Characteristics
|
||||
- **Microservices Architecture**: Applications built as small, independently deployable services
|
||||
- **Containers**: Lightweight, portable packaging for applications
|
||||
- **Dynamic Orchestration**: Automated management of containers (e.g., Kubernetes)
|
||||
- **API-Based Communication**: Services communicate via lightweight APIs
|
||||
- **DevOps Practices**: Continuous integration and delivery
|
||||
云原生(Cloud-Native)是一套技术方法论和最佳实践,用于:
|
||||
|
||||
## Key Technologies
|
||||
- **Containers**: Docker, containerd, Podman
|
||||
- **Orchestration**: Kubernetes, Amazon EKS, Azure AKS, Google GKE
|
||||
- **Service Mesh**: Istio, Linkerd, Consul Connect
|
||||
- **Serverless**: AWS Lambda, Azure Functions, Google Cloud Functions
|
||||
- 构建可在云环境中弹性运行的应用程序
|
||||
- 利用云平台的托管服务和自动化能力
|
||||
- 采用 DevOps 和 CI/CD 实践
|
||||
- 使用容器、无服务器和微服务架构
|
||||
|
||||
## Benefits
|
||||
- Scalability and elasticity
|
||||
- Resilience and fault isolation
|
||||
- Faster deployment cycles
|
||||
- Resource efficiency
|
||||
- Portability across cloud providers
|
||||
## Cloud-Native Computing Foundation (CNCF)
|
||||
|
||||
## Sources
|
||||
- [[sources/cloud-devop-maturity-guideline.md]]
|
||||
CNCF 定义了云原生的核心技术要素:
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
- [[concepts/CI-CD-Pipeline]]
|
||||
- [[concepts/Infrastructure-as-Code]]
|
||||
```
|
||||
Cloud-Native Stack:
|
||||
├── Container Runtime (containerd, CRI-O)
|
||||
├── Orchestration (Kubernetes)
|
||||
├── Service Mesh (Istio, Linkerd)
|
||||
├── Observability (Prometheus, Grafana)
|
||||
├── Networking (CNI, Envoy)
|
||||
└── Storage (CSI)
|
||||
```
|
||||
|
||||
## Ingested
|
||||
- Date: 2026-04-21
|
||||
## 12-Factor App Methodology
|
||||
|
||||
云原生应用遵循 12-Factor 原则:
|
||||
|
||||
| # | 因素 | 描述 |
|
||||
|---|------|------|
|
||||
| 1 | Codebase | 单一代码库,版本控制 |
|
||||
| 2 | Dependencies | 明确声明依赖 |
|
||||
| 3 | Config | 配置与环境分离 |
|
||||
| 4 | Backing Services | 将服务视为资源 |
|
||||
| 5 | Build, Release, Run | 严格分离构建和运行 |
|
||||
| 6 | Processes | 无状态进程 |
|
||||
| 7 | Port Binding | 通过端口绑定导出服务 |
|
||||
| 8 | Concurrency | 通过进程模型扩展 |
|
||||
| 9 | Disposability | 快速启动和优雅关闭 |
|
||||
| 10 | Dev/Prod Parity | 开发、预发、生产环境一致 |
|
||||
| 11 | Logs | 将日志视为事件流 |
|
||||
| 12 | Admin Processes | 将管理任务作为一次性进程 |
|
||||
|
||||
## Core Technologies
|
||||
|
||||
### Containers
|
||||
|
||||
**优势**
|
||||
- 轻量级虚拟化
|
||||
- 环境一致性
|
||||
- 快速启动
|
||||
- 资源隔离
|
||||
|
||||
**关键工具**
|
||||
- Docker, containerd, Podman
|
||||
- OCI 镜像规范
|
||||
|
||||
### Kubernetes
|
||||
|
||||
**核心概念**
|
||||
- Pod(最小调度单元)
|
||||
- Deployment(声明式更新)
|
||||
- Service(网络抽象)
|
||||
- Ingress(HTTP 路由)
|
||||
- ConfigMap/Secret(配置管理)
|
||||
|
||||
**生态组件**
|
||||
- Helm (包管理)
|
||||
- Kustomize (配置管理)
|
||||
- Operators (自愈自动化)
|
||||
|
||||
### Service Mesh
|
||||
|
||||
**功能**
|
||||
- 零信任网络安全
|
||||
- 可观测性(追踪、指标、日志)
|
||||
- 流量管理(A/B 测试、金丝雀发布)
|
||||
- 熔断和限流
|
||||
|
||||
**工具**
|
||||
- Istio, Linkerd, Consul Connect
|
||||
|
||||
### Serverless
|
||||
|
||||
**优势**
|
||||
- 零服务器管理
|
||||
- 弹性扩展
|
||||
- 按使用付费
|
||||
- 快速原型
|
||||
|
||||
**服务**
|
||||
- AWS Lambda, Azure Functions, GCP Cloud Functions
|
||||
- Knative, OpenFaaS
|
||||
|
||||
### Observability
|
||||
|
||||
**三大支柱**
|
||||
- **指标 (Metrics)** — Prometheus, Datadog
|
||||
- **日志 (Logs)** — ELK Stack, Loki
|
||||
- **追踪 (Traces)** — Jaeger, Zipkin
|
||||
|
||||
## Cloud-Native Maturity Model
|
||||
|
||||
| Level | 特征 |
|
||||
|-------|------|
|
||||
| **L1: Containerized** | 应用程序容器化 |
|
||||
| **L2: Orchestrated** | 容器编排(K8s) |
|
||||
| **L3: Microservices** | 微服务架构 |
|
||||
| **L4: Meshed** | 服务网格 |
|
||||
| **L5: Auto-Pilot** | 自主运维、自愈 |
|
||||
|
||||
## Cloud-Native vs Traditional
|
||||
|
||||
| 维度 | Cloud-Native | Traditional |
|
||||
|------|-------------|-------------|
|
||||
| **架构** | 微服务 | 单体 |
|
||||
| **部署** | 容器 | 物理机/VM |
|
||||
| **扩展** | 自动、弹性 | 手动、垂直 |
|
||||
| **交付** | CI/CD | 传统发布 |
|
||||
| **可用性** | 多副本 | 主备 |
|
||||
| **成本** | 按需 | 固定 |
|
||||
| **恢复** | 自动故障转移 | 手动恢复 |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Design for Failure
|
||||
|
||||
- 幂等性设计
|
||||
- 超时和重试
|
||||
- 熔断器模式
|
||||
- 限流保护
|
||||
|
||||
### Observability
|
||||
|
||||
- 结构化日志
|
||||
- 分布式追踪
|
||||
- 指标和告警
|
||||
- 健康检查
|
||||
|
||||
### Security
|
||||
|
||||
- 零信任架构
|
||||
- 最小权限
|
||||
- 密钥管理
|
||||
- 镜像安全扫描
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Maturity Model]] — 云成熟度模型
|
||||
- [[DevOps Maturity]] — DevOps 成熟度
|
||||
- [[Kubernetes]] — K8s
|
||||
- [[Microservices]] — 微服务
|
||||
- [[Service Mesh]] — 服务网格
|
||||
|
||||
125
wiki/concepts/Cloud-Operating-Model.md
Normal file
125
wiki/concepts/Cloud-Operating-Model.md
Normal file
@@ -0,0 +1,125 @@
|
||||
---
|
||||
title: "Cloud Operating Model"
|
||||
type: concept
|
||||
tags: [Cloud, Cloud Strategy, Cloud Governance, Cloud Operations]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Cloud Operating Model (云运营模型)
|
||||
|
||||
## Definition
|
||||
A **Cloud Operating Model (COM)** is a framework that standardizes how organizations manage cloud resources, security, automation, and costs across cloud environments. It provides guardrails for constructing a secure framework for cloud operations and management from cost and risk standpoint.
|
||||
|
||||
## Core Pillars
|
||||
|
||||
### 1. Governance & Compliance (治理与合规)
|
||||
- Standardized policies ensuring compliance across cloud environments
|
||||
- Security, access control, and compliance policies
|
||||
- Teams follow best practices while maintaining agility
|
||||
|
||||
### 2. Automation & Orchestration (自动化与编排)
|
||||
- Infrastructure as Code (IaC) for deployment automation
|
||||
- CI/CD pipelines for continuous software delivery
|
||||
- Event-driven automation (e.g., AWS Lambda, Azure Functions)
|
||||
|
||||
### 3. Security & Risk Management (安全与风险管理)
|
||||
- Zero Trust Security Model (no implicit trust, continuous verification)
|
||||
- Real-time threat detection
|
||||
- Automated security patching
|
||||
|
||||
### 4. Cloud Financial Management - FinOps (云财务管理)
|
||||
- Real-time cost tracking and allocation
|
||||
- Reserved Instances & Spot Instances for cost optimization
|
||||
- Budget alerts and predictive analysis
|
||||
|
||||
## Six-Step Design Process
|
||||
|
||||
1. **Assess Cloud Maturity & Business Objectives**
|
||||
- Ad-hoc Cloud Adoption → Cloud-First Strategy → Cloud-Native Enterprise
|
||||
|
||||
2. **Create Governance & Compliance Framework**
|
||||
- Define IAM roles and policies
|
||||
- Automated compliance checks
|
||||
- Guardrails for resource provisioning
|
||||
|
||||
3. **Automate Cloud Operations (IaC, DevOps)**
|
||||
- Terraform, CloudFormation, Azure Bicep
|
||||
- CI/CD with GitHub Actions, CodePipeline
|
||||
- Serverless automation
|
||||
|
||||
4. **Implement Cost Management & Optimization (FinOps)**
|
||||
- Reserved/Spot Instances (40-70% compute cost reduction)
|
||||
- Auto-scaling & Right-sizing
|
||||
- Resource tagging and monitoring
|
||||
|
||||
5. **Strengthen Security & Risk Mitigation**
|
||||
- Zero Trust Security Model
|
||||
- Real-time threat detection (GuardDuty, Sentinel)
|
||||
- Automated security patching
|
||||
|
||||
6. **Continuous Monitoring & AI-Driven Optimization**
|
||||
- Observability & AIOps
|
||||
- Real-time cloud monitoring (CloudWatch, Azure Monitor)
|
||||
- Self-healing systems
|
||||
|
||||
## Key Benefits
|
||||
|
||||
| Benefit | Description |
|
||||
|---------|-------------|
|
||||
| Standardized Governance | Ensures compliance across cloud environments |
|
||||
| Cost Optimization | Implements FinOps strategies to prevent overspending |
|
||||
| Improved Security | Automates security policies and access controls |
|
||||
| Operational Agility | Enables DevOps, CI/CD, and auto-scaling |
|
||||
| Multi-Cloud Flexibility | Reduces vendor lock-in and enhances resilience |
|
||||
|
||||
## Industry Use Cases
|
||||
|
||||
### Financial Services
|
||||
- Regulatory compliance automation (GDPR, PCI-DSS, SOC 2)
|
||||
- FinOps for cost tracking and optimization
|
||||
- Zero Trust security model for data protection
|
||||
|
||||
### Healthcare
|
||||
- HIPAA, HITRUST, GDPR compliance enforcement
|
||||
- Data encryption and multi-layer access control
|
||||
- AI/ML for diagnostics
|
||||
|
||||
### Retail & E-Commerce
|
||||
- Auto-scaling for peak demand
|
||||
- Multi-cloud strategy to avoid vendor lock-in
|
||||
- Personalized customer experiences via AI
|
||||
|
||||
### SaaS & Tech Companies
|
||||
- CI/CD pipelines for continuous updates
|
||||
- Serverless and containerized architectures
|
||||
- DevSecOps for security-first development
|
||||
|
||||
## Challenges & Solutions
|
||||
|
||||
| Challenge | Solution |
|
||||
|-----------|----------|
|
||||
| Vendor Lock-In | Multi-cloud strategy + Docker/Kubernetes + Terraform |
|
||||
| Cost Overruns | FinOps + Reserved/Spot instances + automated shutdown |
|
||||
| Compliance Risks | Policy-as-Code + AWS Config/Azure Policy + RBAC |
|
||||
| Skills Gap | Automation tools + workforce upskilling |
|
||||
|
||||
## Related Concepts
|
||||
- [[Cloud Governance]]
|
||||
- [[FinOps]]
|
||||
- [[Zero-Trust-Security]]
|
||||
- [[Multi-Cloud Strategy]]
|
||||
- [[Infrastructure as Code]]
|
||||
- [[AIOps]]
|
||||
- [[Cloud Cost Optimization]]
|
||||
- [[DevOps Maturity]]
|
||||
- [[Policy-as-Code]]
|
||||
|
||||
## Related Entities
|
||||
- [[AWS]]
|
||||
- [[Azure]]
|
||||
- [[Google-Cloud]]
|
||||
- [[Terraform]]
|
||||
- [[Kubernetes]]
|
||||
|
||||
## References
|
||||
- [Bacancy Technology: Cloud Operating Model](https://www.bacancytechnology.com/blog/cloud-operating-model)
|
||||
148
wiki/concepts/Cloud-Security-Maturity-Model.md
Normal file
148
wiki/concepts/Cloud-Security-Maturity-Model.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Cloud Security Maturity Model (CSMM)
|
||||
|
||||
> **Cloud Security Maturity Model (CSMM)** — 评估组织云安全计划成熟度的框架,覆盖12个安全领域的3个域。
|
||||
|
||||
## Definition
|
||||
|
||||
CSMM 是一个供应商中立的安全成熟度评估框架,帮助组织:
|
||||
|
||||
- 评估当前云安全状态
|
||||
- 识别安全差距
|
||||
- 制定安全改进路线图
|
||||
- 量化安全投资回报
|
||||
|
||||
## CSMM Structure
|
||||
|
||||
### 3 Domains
|
||||
|
||||
| 域 | 描述 |
|
||||
|----|------|
|
||||
| **Governance & Strategy** | 安全治理、战略和风险管理 |
|
||||
| **Technical Controls** | 实施的安全技术和控制 |
|
||||
| **Operational Processes** | 安全运营流程和实践 |
|
||||
|
||||
### 12 Security Domains
|
||||
|
||||
| 域 | 类别 |
|
||||
|----|------|
|
||||
| **Asset Management** | 资产发现、分类、清单 |
|
||||
| **Compliance Management** | 法规遵从、审计 |
|
||||
| **Data Security** | 数据分类、加密、生命周期 |
|
||||
| **Governance & Risk** | 安全策略、风险评估 |
|
||||
| **Identity & Access** | IAM、特权访问、MFA |
|
||||
| **Infrastructure Security** | 网络、计算、存储安全 |
|
||||
| **Application Security** | 安全开发、测试、部署 |
|
||||
| **Endpoint Security** | 终端保护、EDR |
|
||||
| **Logging & Monitoring** | SIEM、日志管理、告警 |
|
||||
| **Incident Response** | 检测、响应、恢复 |
|
||||
| **Supply Chain Security** | 第三方风险管理 |
|
||||
| **Human Factors** | 安全意识、培训、文化 |
|
||||
|
||||
## 5 Maturity Levels
|
||||
|
||||
| Level | 名称 | 描述 |
|
||||
|-------|------|------|
|
||||
| **1** | Initial | 无正式流程,响应式安全 |
|
||||
| **2** | Developing | 基础控制,有文档 |
|
||||
| **3** | Defined | 标准流程,全面覆盖 |
|
||||
| **4** | Managed | 持续监控,量化管理 |
|
||||
| **5** | Optimizing | 持续改进,主动防御 |
|
||||
|
||||
## Level Characteristics
|
||||
|
||||
### Level 1: Initial
|
||||
|
||||
- 无正式安全流程
|
||||
- 响应式问题处理
|
||||
- 依赖个人知识
|
||||
- 无安全指标
|
||||
|
||||
### Level 2: Developing
|
||||
|
||||
- 基本安全策略
|
||||
- 有限的 IAM 控制
|
||||
- 基础日志记录
|
||||
- 安全事件记录
|
||||
|
||||
### Level 3: Defined
|
||||
|
||||
- 文档化安全策略
|
||||
- 全面的 IAM/MFA
|
||||
- SIEM 部署
|
||||
- 安全培训计划
|
||||
- 事件响应流程
|
||||
|
||||
### Level 4: Managed
|
||||
|
||||
- 持续安全监控
|
||||
- 自动化安全控制
|
||||
- KPI 追踪
|
||||
- 定期渗透测试
|
||||
- 威胁情报集成
|
||||
|
||||
### Level 5: Optimizing
|
||||
|
||||
- AI/ML 驱动的安全
|
||||
- 自动化响应
|
||||
- 预测性威胁分析
|
||||
- 零信任架构
|
||||
- 安全即代码
|
||||
|
||||
## Assessment Areas
|
||||
|
||||
### Governance & Strategy
|
||||
|
||||
| 评估项 | 成熟度等级 |
|
||||
|--------|-----------|
|
||||
| 安全策略文档 | L1-L5 |
|
||||
| 风险评估流程 | L1-L5 |
|
||||
| 安全指标和报告 | L3-L5 |
|
||||
| 董事会参与 | L4-L5 |
|
||||
|
||||
### Technical Controls
|
||||
|
||||
| 评估项 | 成熟度等级 |
|
||||
|--------|-----------|
|
||||
| IAM/MFA | L2-L5 |
|
||||
| 网络分段 | L2-L5 |
|
||||
| 数据加密 | L3-L5 |
|
||||
| 容器安全 | L3-L5 |
|
||||
| 云安全态势管理 | L4-L5 |
|
||||
|
||||
### Operational Processes
|
||||
|
||||
| 评估项 | 成熟度等级 |
|
||||
|--------|-----------|
|
||||
| 漏洞管理 | L2-L5 |
|
||||
| 事件响应 | L3-L5 |
|
||||
| 安全运营 | L4-L5 |
|
||||
| 合规监控 | L3-L5 |
|
||||
|
||||
## Implementation
|
||||
|
||||
### Step 1: Assessment
|
||||
- 自我评估或第三方评估
|
||||
- 问卷调查
|
||||
- 技术验证
|
||||
|
||||
### Step 2: Gap Analysis
|
||||
- 对标 CSMM 成熟度等级
|
||||
- 识别优先改进项
|
||||
|
||||
### Step 3: Roadmap
|
||||
- 制定改进计划
|
||||
- 分配资源和责任
|
||||
- 设定时间线
|
||||
|
||||
### Step 4: Execution
|
||||
- 实施安全改进
|
||||
- 持续监控进度
|
||||
- 定期复盘
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[Cloud Governance]] — 云治理
|
||||
- [[Cloud Maturity Model]] — 云成熟度模型
|
||||
- [[Zero Trust]] — 零信任
|
||||
- [[CSPM]] — 云安全态势管理
|
||||
69
wiki/concepts/Compliance-Automation.md
Normal file
69
wiki/concepts/Compliance-Automation.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Compliance Automation
|
||||
|
||||
## Definition
|
||||
Compliance automation uses technology to automatically enforce, monitor, and validate security and regulatory compliance requirements.
|
||||
|
||||
## Aliases
|
||||
- Automated Compliance
|
||||
- Policy Automation
|
||||
- Regulatory Automation
|
||||
|
||||
## Concept
|
||||
合规自动化使用技术手段自动执行、监控和验证安全及监管合规要求。
|
||||
|
||||
## Key Frameworks
|
||||
|
||||
### SOC 2
|
||||
System and Organization Controls 2 — 针对服务组织的安全、可用性、处理完整性、保密性和隐私控制的合规框架。
|
||||
|
||||
### ISO 27001
|
||||
国际信息安全管理体系标准,提供建立、实施、维护和持续改进信息安全管理系统的要求。
|
||||
|
||||
### GDPR
|
||||
欧盟通用数据保护条例,规定个人数据处理和隐私保护要求。
|
||||
|
||||
### HIPAA
|
||||
美国医疗健康信息隐私法规,保护医疗信息的机密性、完整性和可用性。
|
||||
|
||||
## Automation Tools
|
||||
- Chef InSpec — 合规即代码
|
||||
- Ansible — 配置和合规自动化
|
||||
- AWS Config — 云资源合规
|
||||
- Azure Policy — Azure 合规
|
||||
- Terraform Sentinel — IaC 合规
|
||||
|
||||
## Implementation
|
||||
|
||||
### Policy as Code
|
||||
```ruby
|
||||
# Chef InSpec 示例
|
||||
control 'cis-aws-foundations-1.1' do
|
||||
impact 1.0
|
||||
title 'Ensure MFA is enabled for all IAM users'
|
||||
describe aws_iam_users.where(attached_managed_policies: []) do
|
||||
its('entries') { should eq [] }
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
### Continuous Compliance
|
||||
- 实时监控配置状态
|
||||
- 自动修复违规
|
||||
- 合规报告生成
|
||||
|
||||
## Benefits
|
||||
- 减少人工审计成本
|
||||
- 持续合规而非间歇性合规
|
||||
- 快速响应监管变化
|
||||
- 减少人为错误
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 合规自动化是 DevSecOps 的重要组成
|
||||
- [[Policy-as-Code]] — 以代码管理策略
|
||||
- [[ISO-27001]] — 信息安全管理标准
|
||||
- [[HIPAA]] — 医疗健康合规
|
||||
- [[GDPR]] — 数据保护法规
|
||||
- [[Continuous-Compliance]] — 持续合规
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
72
wiki/concepts/Configuration-Management.md
Normal file
72
wiki/concepts/Configuration-Management.md
Normal file
@@ -0,0 +1,72 @@
|
||||
---
|
||||
title: "Configuration Management"
|
||||
type: concept
|
||||
tags: [itsm, devops, operations]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
配置管理(Configuration Management)是[[ITSM]]的核心流程之一,负责**识别、记录、控制和追踪IT环境中的所有配置项(Configuration Items, CI)及其关系**,为变更影响分析和事件诊断提供基础数据。
|
||||
|
||||
## Configuration Item Types
|
||||
|
||||
| CI类型 | 示例 |
|
||||
|--------|------|
|
||||
| Hardware | 服务器、网络设备、存储 |
|
||||
| Software | 操作系统、应用、中间件 |
|
||||
| Documentation | 架构图、流程文档 |
|
||||
| People | 运维人员、服务所有者 |
|
||||
| Services | 应用服务、API接口 |
|
||||
|
||||
## Configuration Management Process
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ CI │ → │ Relationship│ → │ Impact │
|
||||
│ Discovery │ │ Mapping │ │ Analysis │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
↓ ↓ ↓
|
||||
Auto Scan Dependency Change Planning
|
||||
+ Manual Entry Graph + Incident RCA
|
||||
```
|
||||
|
||||
## Modern Configuration Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,配置管理由AI驱动的[[CMDB]]支撑:
|
||||
|
||||
### AI-Enhanced Capabilities
|
||||
|
||||
| 能力 | 描述 | 价值 |
|
||||
|------|------|------|
|
||||
| Dependency Mapping | 自动发现服务依赖 | 变更影响分析 |
|
||||
| Drift Detection | 配置漂移实时检测 | 安全合规 |
|
||||
| Real-time Impact Analysis | 实时影响分析 | 快速决策 |
|
||||
| Multi-cloud Orchestration | 多云配置管理 | 统一视图 |
|
||||
|
||||
### [[CMDB]] in Action
|
||||
|
||||
```
|
||||
Multi-cloud Environment
|
||||
├── Public Cloud (AWS/Azure/GCP)
|
||||
├── Private Cloud (VMware/OpenStack)
|
||||
└── Hybrid Environment
|
||||
↓
|
||||
AI-CMDB
|
||||
├── 自动发现CI
|
||||
├── 关系映射
|
||||
├── 漂移检测
|
||||
└── 影响预测
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[CMDB]] — 配置管理数据库
|
||||
- [[Change-Management]] — 变更管理
|
||||
- [[Multi-Cloud]] — 多云环境
|
||||
- [[IaC]] — 基础设施即代码
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AI-powered CMDB for Configuration Management
|
||||
98
wiki/concepts/Cron定时任务.md
Normal file
98
wiki/concepts/Cron定时任务.md
Normal file
@@ -0,0 +1,98 @@
|
||||
---
|
||||
title: "Cron定时任务"
|
||||
tags: [linux, automation, devops, ubuntu]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Cron定时任务 (Cron Job)
|
||||
|
||||
## Definition
|
||||
Cron 是 Linux/Unix 系统的定时任务调度器,允许用户配置在指定时间自动执行命令或脚本。Cron 通过 crontab(cron table)配置文件管理所有定时任务。
|
||||
|
||||
## Basic Format
|
||||
```
|
||||
┌───────────── 分钟 (0-59)
|
||||
│ ┌───────────── 小时 (0-23)
|
||||
│ │ ┌───────────── 日期 (1-31)
|
||||
│ │ │ ┌───────────── 月份 (1-12)
|
||||
│ │ │ │ ┌───────────── 星期 (0-7, 0和7都是周日)
|
||||
│ │ │ │ │
|
||||
* * * * * command
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### 每日凌晨3点执行备份
|
||||
```
|
||||
0 3 * * * /usr/local/bin/rsync_backup.sh
|
||||
```
|
||||
|
||||
### 每6小时执行一次
|
||||
```
|
||||
0 */6 * * * /path/to/script.sh
|
||||
```
|
||||
|
||||
### 每周日凌晨2点执行
|
||||
```
|
||||
0 2 * * 0 /path/to/weekly_backup.sh
|
||||
```
|
||||
|
||||
### 每月的第一天凌晨4点执行
|
||||
```
|
||||
0 4 1 * * /path/to/monthly_backup.sh
|
||||
```
|
||||
|
||||
## Crontab Commands
|
||||
```bash
|
||||
# 编辑当前用户的 crontab
|
||||
crontab -e
|
||||
|
||||
# 查看当前用户的 crontab
|
||||
crontab -l
|
||||
|
||||
# 删除当前用户的 crontab
|
||||
crontab -r
|
||||
|
||||
# 以 root 身份编辑系统 crontab
|
||||
sudo crontab -e
|
||||
```
|
||||
|
||||
## Best Practices for Backup Jobs
|
||||
|
||||
### 1. 使用绝对路径
|
||||
Cron 任务的执行环境与交互式 shell 不同,PATH 环境变量可能不完整:
|
||||
```bash
|
||||
0 3 * * * /usr/local/bin/rsync_backup.sh # ✅ 使用绝对路径
|
||||
```
|
||||
|
||||
### 2. 使用 nohup 确保后台运行
|
||||
```bash
|
||||
0 3 * * * sudo nohup /usr/local/bin/rsync_backup.sh > /dev/null 2>&1 &
|
||||
```
|
||||
|
||||
### 3. 使用锁文件防止重复执行
|
||||
```bash
|
||||
LOCKFILE="/tmp/rsync_backup.lock"
|
||||
if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then
|
||||
echo "备份任务已在运行中,跳过本次执行。"
|
||||
exit
|
||||
fi
|
||||
echo $$ > ${LOCKFILE}
|
||||
```
|
||||
|
||||
### 4. 日志记录
|
||||
```bash
|
||||
LOG="/var/log/rsync_backup.log"
|
||||
echo "--- 开始备份: $(date) ---" >> "$LOG"
|
||||
rsync -azR ... >> "$LOG" 2>&1
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
- [[增量备份]] — Cron 定时任务自动执行增量备份
|
||||
- [[进程管理]] — 后台进程的控制与监控
|
||||
- [[挂载点检查]] — 备份前的安全检查
|
||||
- [[永久挂载]] — 确保 Cron 任务执行时存储可用
|
||||
|
||||
## See Also
|
||||
- [[Disaster-Recovery]] — Cron 任务实现自动化的灾备恢复点
|
||||
- [[RTO]] — Cron 任务频率影响恢复时间目标
|
||||
45
wiki/concepts/Cross-Account-Monitoring.md
Normal file
45
wiki/concepts/Cross-Account-Monitoring.md
Normal file
@@ -0,0 +1,45 @@
|
||||
---
|
||||
title: Cross-Account Monitoring
|
||||
type: concept
|
||||
tags: [AWS, Security, CloudOps, Multi-Account]
|
||||
date: 2025-10-24
|
||||
---
|
||||
|
||||
## Definition
|
||||
Cross-Account Monitoring(跨账户监控)是指在 AWS 多账户环境中,通过安全配置的跨账户访问机制,实现对分布在多个账户的资源、日志和指标的集中监控能力。是 AWS 多账户策略的核心运营支柱之一。
|
||||
|
||||
## Core Properties
|
||||
- **最小权限原则**:仅授予必要的跨账户读取权限
|
||||
- **集中可见性**:单一管理界面覆盖所有账户
|
||||
- **安全边界**:IAM 角色信任策略定义清晰的信任边界
|
||||
- **审计追踪**:所有跨账户访问均留下 CloudTrail 记录
|
||||
|
||||
## AWS Implementation Mechanisms
|
||||
- **AWS Organizations + SCPs**:通过 Service Control Policies 定义账户权限边界
|
||||
- **IAM Cross-Account Roles**:跨账户角色切换实现安全访问
|
||||
- **Amazon EventBridge**:事件驱动的跨账户事件转发(该方案的核心机制)
|
||||
- **AWS CloudWatch Cross-Account Observability**:CloudWatch 原生跨账户可观测性
|
||||
- **AWS Security Hub**:跨账户安全态势集中管理
|
||||
|
||||
## Related Concepts
|
||||
- [[AWS Organizations]]:提供多账户层级结构,是跨账户监控的基础设施
|
||||
- [[Multi-Account Deployment]]:跨账户监控支撑多账户部署的可观测性
|
||||
- [[Centralized Logging]]:集中日志是跨账户监控的数据基础
|
||||
- [[StackSets Deployment Visibility]]:StackSets 部署监控是跨账户监控的具体应用场景
|
||||
- [[Landing Zone Architecture]]:AWS Landing Zone 推荐架构中包含跨账户监控设计
|
||||
- [[DevSecOps]]:跨账户安全监控是 DevSecOps 的重要组成部分
|
||||
|
||||
## Architecture Patterns
|
||||
1. **Hub-and-Spoke**:管理账户作为中心(Hub),成员账户作为辐射(Spoke)
|
||||
2. **Event-Driven Fan-out**:通过 EventBridge 将事件从各账户汇聚到管理账户
|
||||
3. **Aggregated Dashboards**:Grafana/CloudWatch Dashboards 聚合多账户视图
|
||||
4. **Centralized Alerting**:告警规则在管理账户统一定义,跨账户触发
|
||||
|
||||
## AWS Context
|
||||
- AWS Organizations Management Account:管理账户,通常承载中心监控功能
|
||||
- AWS Organizations Member Accounts:成员账户,被监控的资源所在
|
||||
- Organizational Units (OUs):组织单元,用于分组管理成员账户
|
||||
- Trusted Access:AWS StackSets 受信任访问,允许多账户协调操作
|
||||
- [[Cross-Account Monitoring]] ← enabled_by ← [[AWS Organizations]] Trusted Access
|
||||
- [[Cross-Account Monitoring]] ← uses ← [[Amazon EventBridge]] Custom Event Bus
|
||||
- [[Cross-Account Monitoring]] ← stores ← [[CloudWatch Logs (central-cloudformation-logs)]]
|
||||
64
wiki/concepts/DAST.md
Normal file
64
wiki/concepts/DAST.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# DAST (Dynamic Application Security Testing)
|
||||
|
||||
## Definition
|
||||
DAST tools simulate external attacks on applications to uncover vulnerabilities from an outsider's viewpoint. These tools are essential for identifying weaknesses that attackers could exploit.
|
||||
|
||||
## Aliases
|
||||
- Dynamic Application Security Testing
|
||||
- Black-box testing
|
||||
- Vulnerability scanning
|
||||
|
||||
## Characteristics
|
||||
- **运行时分析**:在应用运行时进行测试
|
||||
- **黑盒测试**:不了解内部代码结构
|
||||
- **测试/部署阶段适用**:在应用运行时进行测试
|
||||
- **模拟真实攻击**:从攻击者角度发现漏洞
|
||||
|
||||
## What DAST Detects
|
||||
- 认证和授权问题
|
||||
- API 安全漏洞
|
||||
- 配置错误
|
||||
- 会话管理问题
|
||||
- 业务逻辑漏洞
|
||||
- API 端点暴露
|
||||
|
||||
## Tools
|
||||
- OWASP ZAP (Zed Attack Proxy)
|
||||
- Burp Suite
|
||||
- Acunetix
|
||||
- Netsparker
|
||||
- AppScan
|
||||
|
||||
## Integration
|
||||
DAST 工具通常用于:
|
||||
- CI/CD 管道中的集成测试
|
||||
- 预发布安全扫描
|
||||
- 定期渗透测试
|
||||
- 生产环境监控
|
||||
|
||||
## Comparison with Other Testing Methods
|
||||
|
||||
| 维度 | SAST | DAST | IAST |
|
||||
|------|------|------|------|
|
||||
| **测试方式** | 白盒(静态) | 黑盒(动态) | 灰盒(运行时) |
|
||||
| **需要代码** | 是 | 否 | 部分 |
|
||||
| **误报率** | 中等 | 低 | 低 |
|
||||
| **检测范围** | 代码层 | 应用层 | 代码+应用层 |
|
||||
| **适用阶段** | 开发 | 测试/部署 | 测试 |
|
||||
|
||||
## Limitations
|
||||
- 无法定位具体代码行
|
||||
- 无法检测源代码级别的漏洞
|
||||
- 扫描速度相对较慢
|
||||
- 可能产生误报
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — DAST 是其重要组件
|
||||
- [[SAST]] — 静态应用安全测试(白盒)
|
||||
- [[IAST]] — 交互式应用安全测试
|
||||
- [[SCA]] — 软件组成分析
|
||||
- [[Penetration-Testing]] — 渗透测试
|
||||
- [[Vulnerability-Scanning]] — 漏洞扫描
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
77
wiki/concepts/DRaaS.md
Normal file
77
wiki/concepts/DRaaS.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
title: "Disaster Recovery as a Service (DRaaS)"
|
||||
type: concept
|
||||
tags: [cloud, disaster-recovery, business-continuity, security]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
灾备即服务(DRaaS)是一种云原生灾难恢复解决方案,提供基于云的故障转移能力,使组织能够快速恢复关键业务系统,同时降低传统灾备基础设施的成本和复杂性。
|
||||
|
||||
## Core Metrics
|
||||
|
||||
### RTO (Recovery Time Objective)
|
||||
- 灾难发生后系统恢复的最大可接受时间
|
||||
- [[ITSM]]中业务连续性的关键指标
|
||||
|
||||
### RPO (Recovery Point Objective)
|
||||
- 最大可容忍的数据丢失时间窗口
|
||||
- 决定备份频率和策略
|
||||
|
||||
## DRaaS vs Traditional DR
|
||||
|
||||
| 维度 | 传统灾备 | DRaaS |
|
||||
|------|---------|-------|
|
||||
| 成本 | 高CAPEX | 按需付费 |
|
||||
| 恢复速度 | 小时/天 | 分钟 |
|
||||
| 复杂度 | 高 | 托管服务 |
|
||||
| 测试 | 困难 | 自动化测试 |
|
||||
| 可扩展性 | 有限 | 云弹性 |
|
||||
|
||||
## Key Features (Modern DRaaS)
|
||||
|
||||
### AI-Driven Automated Failover
|
||||
```
|
||||
监控检测 → 故障确认 → 自动触发 → 故障转移 → 服务恢复
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
AIOps ML模型 策略执行 DNS切换 健康检查
|
||||
```
|
||||
|
||||
### Multi-Cloud Support
|
||||
- 跨云故障转移
|
||||
- 混合云灾备
|
||||
- 数据主权合规
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM]]中,DRaaS是[[Disaster-Recovery]]流程的核心:
|
||||
|
||||
```
|
||||
Disaster Recovery & Business Continuity (ITSM 8.0)
|
||||
├── AI-driven Automated Failover
|
||||
│ ├── 智能故障检测
|
||||
│ ├── 策略驱动的故障转移
|
||||
│ └── 自动服务恢复
|
||||
├── RTO/RPO Optimization
|
||||
│ ├── 连续复制
|
||||
│ ├── 增量备份
|
||||
│ └── 快速恢复
|
||||
└── Cloud-native DRaaS
|
||||
├── 按需扩展
|
||||
├── 托管服务
|
||||
└── 成本优化
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Disaster-Recovery]] — 灾备总框架
|
||||
- [[RTO]] — 恢复时间目标
|
||||
- [[RPO]] — 恢复点目标
|
||||
- [[Failover]] — 故障转移机制
|
||||
- [[Business-Impact-Analysis]] — 业务影响分析
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — DRaaS在现代ITSM中的应用
|
||||
- [[rto-vs-rpo-key-differences-for-modern-disaster-recovery]] — RTO/RPO详解
|
||||
74
wiki/concepts/Data-Governance.md
Normal file
74
wiki/concepts/Data-Governance.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
title: "Data Governance (Cloud)"
|
||||
type: concept
|
||||
tags: [cloud-computing, governance, data-management]
|
||||
date: 2025-03-02
|
||||
---
|
||||
|
||||
# Data Governance (Cloud)
|
||||
|
||||
**Data Governance**(数据治理)是云平台提供的用于管理、监控和保护数据的工具和策略,确保企业对其数据拥有完全的控制权。
|
||||
|
||||
## Definition
|
||||
|
||||
云数据治理涵盖数据的整个生命周期:创建、存储、使用、共享、归档和销毁。核心目标是确保数据的安全性、完整性和合规性。
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. Access Control
|
||||
- 基于角色的访问控制(RBAC)
|
||||
- 基于属性的访问控制(ABAC)
|
||||
- 最小权限原则
|
||||
- 定期访问审查
|
||||
|
||||
### 2. Data Encryption
|
||||
- 传输中加密(TLS 1.3)
|
||||
- 静态加密(AES-256)
|
||||
- 客户管理密钥(CMK)
|
||||
- 密钥管理服务(KMS)
|
||||
|
||||
### 3. Audit & Monitoring
|
||||
- 访问日志(CloudTrail, Azure Monitor, Cloud Logging)
|
||||
- 实时告警
|
||||
- 合规报告
|
||||
- 数据血缘追踪
|
||||
|
||||
### 4. Data Classification
|
||||
- 敏感数据识别
|
||||
- 数据标签/标记
|
||||
- 自动分类
|
||||
- 基于分类的策略执行
|
||||
|
||||
### 5. Retention & Lifecycle
|
||||
- 数据保留策略
|
||||
- 自动归档
|
||||
- 安全删除
|
||||
- 合规保留
|
||||
|
||||
## Cloud Myths Context
|
||||
|
||||
数据治理能力是反驳"云中失去数据控制"误解的核心证据:
|
||||
- 云平台提供细粒度的权限管理,企业完全控制谁能访问什么数据
|
||||
- 混合云和多云选项允许企业决定数据存储位置
|
||||
- 审计日志提供完整的访问追踪能力
|
||||
|
||||
## Cloud Provider Tools
|
||||
|
||||
| Provider | Key Tools |
|
||||
|----------|-----------|
|
||||
| **AWS** | IAM, KMS, CloudTrail, Macie, S3 Access Points |
|
||||
| **Azure** | Azure AD, Key Vault, Purview, Defender for Cloud |
|
||||
| **Google Cloud** | IAM, Cloud KMS, Cloud Audit Logs, DLP API |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[cloud-computing]] — 云计算
|
||||
- [[cloud-security]] — 云安全
|
||||
- [[Data-Sovereignty]] — 数据主权
|
||||
- [[Compliance]] — 合规
|
||||
- [[Identity-and-Access-Management]] — 身份与访问管理
|
||||
- [[Cloud-Governance]] — 云治理
|
||||
|
||||
## Sources
|
||||
|
||||
- [[The Myths and Misconceptions About Cloud Computing (LinkedIn)|sources/the-myths-and-misconceptions-about-cloud-computing-linkedin]]
|
||||
75
wiki/concepts/Deployment-Automation.md
Normal file
75
wiki/concepts/Deployment-Automation.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
title: "Deployment Automation"
|
||||
tags:
|
||||
- devops
|
||||
- cicd
|
||||
- automation
|
||||
- ai
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Deployment Automation
|
||||
|
||||
## Definition
|
||||
|
||||
Deployment Automation 是通过自动化工具和 AI 代理实现软件部署的**全流程自动化**(构建 → 测试 → 发布 → 回滚)。Agentic AI 作为 Release Manager,自动执行部署策略(蓝绿/金丝雀)和回滚决策。
|
||||
|
||||
## Agentic AI 作为 Release Manager
|
||||
|
||||
```
|
||||
传统 Release Manager:
|
||||
人工决策 → 发布窗口 → 人工监控 → 人工回滚(可能延迟数小时)
|
||||
|
||||
Agentic AI Release Manager:
|
||||
实时分析 → 自动决策 → 持续发布 → 毫秒级自动回滚
|
||||
```
|
||||
|
||||
### 核心职责
|
||||
|
||||
| 职责 | 传统方式 | AI Release Manager |
|
||||
|------|---------|-------------------|
|
||||
| Feature Flag 测试 | 人工配置 | AI 自动分析指标 + 动态调整 |
|
||||
| 部署策略选择 | 人工决策 | AI 基于流量/错误率选择 |
|
||||
| 回滚触发 | 人工判断(延迟高) | AI 毫秒级自动触发 |
|
||||
| 部署后验证 | 人工检查 | AI 持续监控 + 自动验证 |
|
||||
|
||||
### 部署策略
|
||||
|
||||
- **Blue/Green**: 两套环境,AI 监控 + 自动切换
|
||||
- **Canary**: 灰度流量,AI 动态调整流量比例
|
||||
- **Rolling**: 滚动更新,AI 监控 + 自动暂停/回滚
|
||||
|
||||
## 与 [[CI/CD Pipeline]] 的关系
|
||||
|
||||
Agentic AI 增强 [[CI/CD Pipeline]] 的关键阶段:
|
||||
|
||||
```python
|
||||
CI_CD_Stages = {
|
||||
"Build": "CI Server (Jenkins/GitHub Actions)", # 保持不变
|
||||
"Test": "AI: 自动缺陷预测 + 智能测试选择",
|
||||
"Deploy": "AI Release Manager ←", # ← 本页
|
||||
"Monitor": "AI: 实时监控 + 自动回滚",
|
||||
"Verify": "AI: 自动回归验证"
|
||||
}
|
||||
```
|
||||
|
||||
## 示例
|
||||
|
||||
> An AI agent detects that a new microservice deployment is causing latency issues:
|
||||
> - Error rate spikes from 0.1% to 2.3% within 5 minutes
|
||||
> - AI automatically rolls back the changes
|
||||
> - AI generates fix suggestion: "Connection pool misconfiguration detected"
|
||||
> - AI submits ticket with root cause analysis
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[CI/CD Pipeline]] — Deployment Automation 的基础
|
||||
- [[Self-Healing Systems]] — 自动回滚是 Self-Healing 的体现
|
||||
- [[DORA Metrics]] — Deployment Automation 直接改善 Deployment Frequency 和 Change Failure Rate
|
||||
- [[Blue-Green Deployment]] — 具体部署策略之一
|
||||
- [[Canary Deployment]] — 具体部署策略之一
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
- [[cloud-devop-maturity-guideline]]
|
||||
102
wiki/concepts/Deployment-vs-Release.md
Normal file
102
wiki/concepts/Deployment-vs-Release.md
Normal file
@@ -0,0 +1,102 @@
|
||||
---
|
||||
title: "Deployment vs Release (部署与发布分离)"
|
||||
tags: [devops, continuous-delivery, feature-management, release-management]
|
||||
aliases: [Decoupled Deployment and Release]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Deployment vs Release (部署与发布分离)
|
||||
|
||||
**Deployment vs. Release** 是 [[Feature Flag]] 实现的核心理念:代码**部署**(Deploy)到生产环境,与功能对用户**可见**(Release),是两个独立的事件。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Traditional deployments are all-or-nothing: you push code and everyone gets it immediately. This is why deployments are scary and why teams deploy at 2 AM 'just in case.'"
|
||||
|
||||
> "Deploy whenever you want, release when you're ready."
|
||||
|
||||
## Aliases
|
||||
- Decoupled Deployment and Release
|
||||
- 部署发布分离
|
||||
|
||||
## 核心对比
|
||||
|
||||
| 维度 | 传统方式(部署=发布) | Feature Flag(部署≠发布) |
|
||||
|------|---------------------|--------------------------|
|
||||
| 代码到达生产 | 与用户可见同步 | 提前到达,用户不可见 |
|
||||
| 回滚能力 | 需要重新部署 | 配置变更,秒级 |
|
||||
| 发布时机 | 必须与部署同步 | 任意时刻可发布 |
|
||||
| 团队压力 | 2AM 部署"以防万一" | 白天从容发布 |
|
||||
| 实验能力 | 低(全量或无) | 高(灰度、可控放量) |
|
||||
|
||||
## 生命周期对比
|
||||
|
||||
### 传统方式
|
||||
|
||||
```
|
||||
开发 → 测试 → 合并 → 部署 → 发布(全量)
|
||||
↑
|
||||
部署=发布
|
||||
```
|
||||
|
||||
### Feature Flag 方式
|
||||
|
||||
```
|
||||
开发 → 测试 → 合并 → 部署 → 发布控制 → 渐进放量
|
||||
↑ ↑
|
||||
代码到达 用户可见
|
||||
生产环境 由开关控制
|
||||
```
|
||||
|
||||
## 关键价值
|
||||
|
||||
### 1. 降低部署风险
|
||||
> "Feature flags change this. You can deploy code to production without releasing it to users."
|
||||
|
||||
代码可以在生产环境中就绪,但功能对用户保持隐藏,直到团队确认质量。
|
||||
|
||||
### 2. 加速交付
|
||||
团队不再需要等待"完美时机"发布功能。代码就绪即可部署,功能发布由业务决定。
|
||||
|
||||
### 3. 赋能团队
|
||||
- 产品:随时决定发布节奏
|
||||
- 工程:随时部署,无需等待发布窗口
|
||||
- 运营:渐进放量,数据驱动决策
|
||||
|
||||
### 4. 重新定义 RTO
|
||||
当部署≠发布时,恢复(Rollback)不再是回滚部署,而是关闭 Feature Flag。
|
||||
|
||||
## 与 [[CI-CD-Pipeline]] 的关系
|
||||
|
||||
部署与发布的分离重构了 CI/CD 流程:
|
||||
|
||||
| 阶段 | 传统 CI/CD | Decoupled CI/CD |
|
||||
|------|-----------|-----------------|
|
||||
| Build | 构建产物 | 构建产物 |
|
||||
| Test | 单元/集成测试 | 单元/集成测试 |
|
||||
| Deploy | 部署到 prod | 部署到 prod(用户不可见) |
|
||||
| Release | — | Feature Flag 控制 |
|
||||
| Monitor | 部署后监控 | 渐进放量期间监控 |
|
||||
| Rollback | 重新部署旧版本 | 关闭 Feature Flag |
|
||||
|
||||
## 风险模型转变
|
||||
|
||||
| 维度 | 传统模型 | Decoupled 模型 |
|
||||
|------|----------|----------------|
|
||||
| 风险集中点 | 部署时刻 | 功能发布时刻 |
|
||||
| 风险暴露范围 | 全量用户 | 当前放量比例 |
|
||||
| 应急响应 | 小时级回滚 | 秒级开关 |
|
||||
| 团队心态 | 防御性(害怕部署) | 进攻性(敢于实验) |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Feature Flag]] — 实现部署与发布分离的核心机制
|
||||
- [[CI-CD-Pipeline]] — Decoupled Deployment 是现代 CI/CD 的重要理念
|
||||
- [[Progressive Rollout]] — 部署与发布分离使渐进放量成为可能
|
||||
- [[Kill Switch]] — 发布控制权的极端体现(紧急关闭)
|
||||
- [[RTO]] — 部署与发布分离将 RTO 从部署回滚转向配置变更
|
||||
- [[Micro-Recovery]] — 部署与发布分离使 feature 级别恢复成为可能
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
@@ -1,49 +1,61 @@
|
||||
# DevSecOps
|
||||
|
||||
## Definition
|
||||
DevSecOps integrates security practices into the DevOps process, embedding security throughout the entire software development lifecycle rather than treating it as a separate phase.
|
||||
DevSecOps(Development-Security-Operations)是将安全实践深度集成到软件开发全生命周期的方法论,使安全成为开发、运维、安全团队的共同责任,而非独立环节。
|
||||
|
||||
## Key Principles
|
||||
- **Shift Left**: Integrate security early in the development process
|
||||
- **Automation**: Security checks automated in CI/CD pipelines
|
||||
- **Continuous Compliance**: Ongoing security validation and compliance monitoring
|
||||
- **Proactive Vulnerability Management**: Early detection and remediation of security issues
|
||||
## Core Principles
|
||||
- **安全即代码**:安全策略、测试和合规检查均以代码形式实现
|
||||
- **共享责任**:安全是每个人的责任,而非仅安全团队的工作
|
||||
- **自动化优先**:通过自动化减少人为错误,提高安全检查效率
|
||||
- **持续安全**:安全贯穿开发、测试、部署、运营全阶段
|
||||
|
||||
## Core Practices
|
||||
- Static Application Security Testing (SAST)
|
||||
- Dynamic Application Security Testing (DAST)
|
||||
- Software Composition Analysis (SCA)
|
||||
- Container security scanning
|
||||
- Infrastructure as Code security validation
|
||||
- Secret management and rotation
|
||||
## Key Components
|
||||
|
||||
## Tools
|
||||
- SAST: SonarQube, Checkmarx, Semgrep
|
||||
- Container scanning: Trivy, Clair, Snyk
|
||||
- Secret management: HashiCorp Vault, AWS Secrets Manager
|
||||
### 1. Collaboration(协作)
|
||||
安全任务在开发和运维团队间共享,安全团队确保安全标准嵌入整个开发流程。
|
||||
|
||||
## Security Progression Across DevOps Maturity Levels
|
||||
### 2. Communication(沟通)
|
||||
安全专业人员需要用开发者理解的简单语言解释安全控制,建立共同的安全认知。
|
||||
|
||||
| Maturity | Security Integration Level |
|
||||
|----------|--------------------------|
|
||||
| Phase 1 | Security involvement only weeks before release, minimal compliance scans |
|
||||
| Phase 2 | Security operates separately from the rest of the team |
|
||||
| Phase 3 | Security involved in design, architecture, and operations discussions; scans integrated throughout development |
|
||||
| Phase 4 | Dependency vulnerability management; continuous security monitoring across the team |
|
||||
| Phase 5 | Prevent insecure/non-compliant code from reaching production; high-level security integration |
|
||||
### 3. Automation(自动化)
|
||||
- 将自动化安全测试添加到 CI/CD 管道
|
||||
- "Break the Build" 机制在安全风险过高时停止构建
|
||||
- 确保软件依赖保持最新
|
||||
|
||||
## Sources
|
||||
- [[sources/cloud-devop-maturity-guideline.md]]
|
||||
- [[sources/what-is-devsecops-best-practices-benefits-and-tools.md]]
|
||||
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
||||
### 4. Tool & Architecture Security(工具与架构安全)
|
||||
- 选择和审查安全工具
|
||||
- 谨慎管理用户访问(多因素认证、最小权限)
|
||||
- 定期监控漏洞和打补丁
|
||||
- 扫描代码中的敏感数据
|
||||
|
||||
### 5. Testing(测试)
|
||||
在每个开发阶段集成安全测试,使用 SAST/DAST/IAST/SCA 等工具。
|
||||
|
||||
## DevSecOps vs DevOps
|
||||
|
||||
| 维度 | DevOps | DevSecOps |
|
||||
|------|--------|-----------|
|
||||
| **定义** | 强调开发与运维协作加速交付 | 将安全实践集成到开发过程 |
|
||||
| **安全角色** | 安全单独处理或最后处理 | 从一开始就将安全嵌入每个步骤 |
|
||||
| **团队参与** | 开发与运维协作 | 开发、运维、安全三方协作 |
|
||||
| **合规方式** | 开发后进行合规检查 | 开发部署全程确保合规 |
|
||||
|
||||
## Benefits
|
||||
- 早期发现漏洞,修复成本降低可达 100 倍
|
||||
- 70% 的上线后发现的安全漏洞可在开发阶段预防
|
||||
- 安全与开发速度实现双赢
|
||||
- 持续合规,减少审计压力
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
- [[concepts/CI-CD-Pipeline]]
|
||||
- [[concepts/Infrastructure-as-Code]]
|
||||
- [[concepts/DORA-Metrics]]
|
||||
- [[concepts/Change-Failure-Rate]]
|
||||
- [[Shift-Left-Security]] — 安全测试左移到开发早期
|
||||
- [[Shift-Right-Security]] — 生产环境持续安全监控
|
||||
- [[SAST]] — 静态应用安全测试
|
||||
- [[DAST]] — 动态应用安全测试
|
||||
- [[IAST]] — 交互式应用安全测试
|
||||
- [[SCA]] — 软件组成分析
|
||||
- [[CI/CD Pipeline]] — DevSecOps 的载体
|
||||
- [[Policy-as-Code]] — 以代码管理安全策略
|
||||
- [[Break-the-Build]] — 安全失败时停止构建
|
||||
|
||||
## Ingested
|
||||
- Date: 2026-04-21
|
||||
- Date: 2026-04-24 (updated with maturity level progression)
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
|
||||
56
wiki/concepts/Docker-Compose.md
Normal file
56
wiki/concepts/Docker-Compose.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Docker Compose
|
||||
|
||||
## Description
|
||||
Docker Compose 是 Docker 官方提供的多容器 Docker 应用定义和运行工具。通过 `docker-compose.yml`(或 `compose.yaml`)配置文件,使用 YAML 格式声明式定义多容器服务的网络、卷、端口映射、环境变量等,实现一键部署复杂应用。
|
||||
|
||||
## Version
|
||||
- **V1 (独立包)**:`docker-compose` 命令(已弃用)
|
||||
- **V2 (插件)**:`docker compose` 命令(当前主流),通过 `docker-compose-plugin` 包安装,集成到 Docker CLI
|
||||
|
||||
## V1 vs V2 Command Reference
|
||||
| V1 (独立包) | V2 (插件) |
|
||||
|------------|-----------|
|
||||
| `docker-compose up -d` | `docker compose up -d` |
|
||||
| `docker-compose ps` | `docker compose ps` |
|
||||
| `docker-compose down` | `docker compose down` |
|
||||
| `docker-compose -f xxx.yml config` | `docker compose -f xxx.yml config` |
|
||||
|
||||
## Core Concepts
|
||||
- **Services**: 定义每个容器服务(镜像、构建、端口、卷、环境变量)
|
||||
- **Volumes**: 命名数据卷,持久化容器数据
|
||||
- **Networks**: 容器网络配置(bridge、host、overlay)
|
||||
- **Version**: `version: '3.8'` 为当前主流版本规范
|
||||
|
||||
## Example
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
it-tools:
|
||||
image: corentinth/it-tools:latest
|
||||
container_name: it-tools
|
||||
restart: unless-stopped
|
||||
stdin_open: true
|
||||
tty: true
|
||||
ports:
|
||||
- "8999:80"
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
```
|
||||
|
||||
## Used By
|
||||
- [[用docker安装it-tools]]
|
||||
- [[用docker安装transmission]]
|
||||
- [[Navidrome]]
|
||||
- [[Jellyfin]]
|
||||
- [[RSSHub]]
|
||||
|
||||
## Related Concepts
|
||||
- [[Docker-Image]]
|
||||
- [[Docker-Save]]
|
||||
- [[Docker-Load]]
|
||||
- [[容器资源限制]]
|
||||
- [[容器重启策略]]
|
||||
- [[端口映射]]
|
||||
- [[桥接网络]]
|
||||
38
wiki/concepts/Docker-用户组.md
Normal file
38
wiki/concepts/Docker-用户组.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Docker 用户组"
|
||||
tags: [docker, security, permissions]
|
||||
date: 2026-04-22
|
||||
---
|
||||
|
||||
# Docker 用户组
|
||||
|
||||
## Definition
|
||||
Docker 用户组是 Linux 系统中的用户组机制,允许组成员无需 `sudo` 前缀直接运行 Docker 命令。
|
||||
|
||||
## Configuration
|
||||
```bash
|
||||
# 将用户添加到 docker 用户组
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# 刷新组成员资格(需重新登录或重启终端)
|
||||
newgrp docker
|
||||
# 或直接登出再登入
|
||||
```
|
||||
|
||||
## Security Implications
|
||||
⚠️ **安全警告**:docker 用户组成员拥有与 root 用户等效的权限,因为 Docker 容器默认以 root 身份运行。攻击者如果能访问 docker 用户组的成员账户,可能通过容器逃逸获得宿主机 root 权限。
|
||||
|
||||
## Best Practices
|
||||
1. 仅将受信任的用户添加到 docker 用户组
|
||||
2. 优先使用非 root 用户运行容器(`PUID/PGID` 环境变量)
|
||||
3. 定期审查 docker 用户组成员
|
||||
4. 考虑使用 Podman 作为替代方案(支持无 root 模式)
|
||||
|
||||
## Related Sources
|
||||
- [[如何在ubuntu-server安装-docker-docker-compose]] — docker 用户组配置步骤
|
||||
|
||||
## Related Concepts
|
||||
- [[Docker Engine]] — 被无 sudo 访问的组件
|
||||
- [[用户权限]] — Linux 用户组机制
|
||||
- [[容器资源限制]] — 配合非 root 用户的安全实践
|
||||
- [[PUID/PGID]] — Docker 容器的非 root 用户映射
|
||||
85
wiki/concepts/Event-Correlation.md
Normal file
85
wiki/concepts/Event-Correlation.md
Normal file
@@ -0,0 +1,85 @@
|
||||
---
|
||||
title: "Event Correlation"
|
||||
type: concept
|
||||
tags: [aiops, monitoring, incident-management, operations]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
事件关联(Event Correlation)是[[AIOps]]的核心技术之一,通过算法将大量分散的监控告警和系统事件归类为少量有意义的事件组,减少告警噪音,加速[[Incident-Management]]和[[Root-Cause-Analysis]]。
|
||||
|
||||
## The Problem
|
||||
|
||||
```
|
||||
Without Event Correlation:
|
||||
─────────────────────────────
|
||||
Alert #1: CPU High on Server A
|
||||
Alert #2: Memory High on Server A
|
||||
Alert #3: Disk I/O High on Server A
|
||||
Alert #4: Network Latency on Server A
|
||||
Alert #5: App Response Slow
|
||||
Alert #6: Database Connection Pool Full
|
||||
Alert #7: API Timeout
|
||||
... (100+ alerts for ONE root cause)
|
||||
```
|
||||
|
||||
## Event Correlation Techniques
|
||||
|
||||
### 1. Rule-Based Correlation
|
||||
```
|
||||
IF alerts occur within time window T
|
||||
AND involve same source/host/service
|
||||
THEN group as single incident
|
||||
```
|
||||
|
||||
### 2. Statistical Correlation
|
||||
- Time series analysis
|
||||
- Pattern matching
|
||||
- Anomaly detection
|
||||
|
||||
### 3. AI/ML Correlation
|
||||
- Root cause inference
|
||||
- Causal graph models
|
||||
- Predictive correlation
|
||||
|
||||
## Benefits
|
||||
|
||||
| 收益 | 描述 |
|
||||
|------|------|
|
||||
| 告警降噪 | 减少90%+噪音 |
|
||||
| 加速RCA | 快速定位根因 |
|
||||
| MTTR降低 | 减少人工分析时间 |
|
||||
| SLA保障 | 更快响应 |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM 2.0]]的[[Incident-Management]]中,事件关联是关键能力:
|
||||
|
||||
```
|
||||
Incident Management 2.0
|
||||
├── Event Correlation (ML-enhanced)
|
||||
│ ├── 告警去重
|
||||
│ ├── 根因推断
|
||||
│ └── 关联推理
|
||||
├── AIOps-powered Analysis
|
||||
│ ├── 异常检测
|
||||
│ ├── 模式识别
|
||||
│ └── 预测分析
|
||||
└── Self-Healing Automation
|
||||
├── 自动诊断
|
||||
└── 自动修复
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[AIOps]] — 事件关联的AI引擎
|
||||
- [[Incident-Management]] — 事件管理的应用场景
|
||||
- [[Root-Cause-Analysis]] — 根因分析
|
||||
- [[MTTR]] — 平均恢复时间
|
||||
- [[Self-Healing-Systems]] — 自愈系统
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — ML-enhanced Event Correlation
|
||||
- [[what-i-know-about-cloud-service-delivery-1]] — AIOps中的事件关联
|
||||
64
wiki/concepts/Exporter.md
Normal file
64
wiki/concepts/Exporter.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: "Exporter"
|
||||
type: concept
|
||||
aliases: [Prometheus Exporter, 指标导出器, Prometheus指标导出器]
|
||||
tags: [prometheus, monitoring, metrics, infrastructure]
|
||||
date: 2025-11-11
|
||||
---
|
||||
|
||||
# Exporter
|
||||
|
||||
## Overview
|
||||
Exporter(指标导出器)是 Prometheus 生态中的核心组件,负责从各类目标系统采集指标数据,并通过 HTTP `/metrics` 端点暴露 Prometheus 格式的指标,供 Prometheus 服务器定期抓取。Exporter 不处理数据存储和告警,只做"采集 + 暴露"这一件事。
|
||||
|
||||
## Design Philosophy
|
||||
- **无代理(Agentless)**:大多数 exporter 作为独立进程运行,不需要在被监控系统上安装额外软件
|
||||
- **HTTP 暴露**:通过标准的 `/metrics` 端点提供文本格式的指标
|
||||
- **松耦合**:Exporter 与 Prometheus 通过 HTTP 协议解耦,可独立部署和升级
|
||||
- **Pull 模式**:Prometheus 主动抓取,而非 exporter 主动推送
|
||||
|
||||
## Official Exporters
|
||||
| Exporter | 指标来源 | 默认端口 |
|
||||
|----------|----------|----------|
|
||||
| [[node_exporter]] | Linux/Unix 主机(CPU/内存/磁盘/网络) | 9100 |
|
||||
| [[cAdvisor]] | Docker 容器 | 8080 |
|
||||
| [[blackbox_exporter]] | HTTP/TCP/DNS/TLS 端点 | 9115 |
|
||||
| alertmanager | Alertmanager 实例 | 9093 |
|
||||
| postgres_exporter | PostgreSQL 数据库 | 9187 |
|
||||
| mysql_exporter | MySQL/MariaDB 数据库 | 9104 |
|
||||
| redis_exporter | Redis 缓存 | 9121 |
|
||||
| nginx-exporter | Nginx 状态页 | 9113 |
|
||||
|
||||
## /metrics Format
|
||||
```text
|
||||
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
|
||||
# TYPE node_cpu_seconds_total counter
|
||||
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.789
|
||||
node_cpu_seconds_total{cpu="0",mode="user"} 98765.432
|
||||
```
|
||||
|
||||
## Key Fields
|
||||
- `# HELP`:指标说明(人类可读描述)
|
||||
- `# TYPE`:指标类型(gauge / counter / histogram / summary)
|
||||
- 指标行:`<metric_name>{<labels>} <value>`
|
||||
|
||||
## Prometheus scrape_config
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'node_exporter'
|
||||
static_configs:
|
||||
- targets: ['192.168.1.10:9100']
|
||||
```
|
||||
|
||||
## Related Entities
|
||||
- [[Prometheus]] — 数据消费者
|
||||
- [[node_exporter]] — 主机指标 exporter
|
||||
- [[cAdvisor]] — 容器指标 exporter
|
||||
- [[blackbox_exporter]] — 网络探测 exporter
|
||||
- [[Alertmanager]] — 告警 exporter
|
||||
|
||||
## Related Concepts
|
||||
- [[PromQL]] — 指标查询语言
|
||||
- [[Prometheus告警规则]] — 基于 exporter 指标的告警条件
|
||||
- [[时序数据库]] — exporter 产出的数据存储模型
|
||||
- [[System Monitoring]] — 应用领域
|
||||
57
wiki/concepts/Failover.md
Normal file
57
wiki/concepts/Failover.md
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: "Failover"
|
||||
type: concept
|
||||
tags: [cloud-computing, reliability, high-availability]
|
||||
date: 2025-03-02
|
||||
---
|
||||
|
||||
# Failover
|
||||
|
||||
**Failover**(故障转移)是高可用性系统的核心机制,当主系统发生故障时,自动切换到备用系统,确保服务连续性。
|
||||
|
||||
## Definition
|
||||
|
||||
故障转移是一种自动化的冗余机制,监控系统检测到主节点故障后,自动将流量或工作负载切换到备用节点,用户通常无感知。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **自动化**:无需人工干预,自动检测和切换
|
||||
- **快速恢复**:切换时间可从几分钟缩短到秒级
|
||||
- **透明切换**:用户无感知或感知极小中断
|
||||
- **健康检查**:持续监控主节点健康状态
|
||||
|
||||
## Failover Patterns in Cloud
|
||||
|
||||
| Pattern | Description |
|
||||
|---------|-------------|
|
||||
| **Active-Passive** | 主节点处理流量,备用节点待命;故障时切换 |
|
||||
| **Active-Active** | 多个节点同时处理流量;故障节点自动剔除 |
|
||||
| **Geo-Failover** | 跨地理区域的故障转移 |
|
||||
| **Multi-Region** | 多区域部署,单区域故障不影响其他区域 |
|
||||
|
||||
## Cloud Myths Context
|
||||
|
||||
Failover 是反驳"云不可靠"误解的关键机制:
|
||||
- 云服务商通过全球分布式架构实现跨区域故障转移
|
||||
- 自动化故障转移 SLA 保障 99.99% 可用性
|
||||
- 传统本地部署难以实现同等水平的故障转移能力
|
||||
|
||||
## Implementation Components
|
||||
|
||||
- **Load Balancer**:健康检查 + 流量分发
|
||||
- **Health Checks**:定期检测服务可用性
|
||||
- **DNS Failover**:Route 53 / Cloud DNS 的 DNS 级切换
|
||||
- **Database Replication**:数据库级别的同步/异步复制
|
||||
- **Auto Scaling Groups**:实例级别的自动替换
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[High-Availability]] — 高可用性
|
||||
- [[cloud-computing]] — 云计算
|
||||
- [[Scalability]] — 可扩展性
|
||||
- [[Disaster-Recovery]] — 灾难恢复
|
||||
- [[cloud-migration]] — 云迁移
|
||||
|
||||
## Sources
|
||||
|
||||
- [[The Myths and Misconceptions About Cloud Computing (LinkedIn)|sources/the-myths-and-misconceptions-about-cloud-computing-linkedin]]
|
||||
123
wiki/concepts/Feature-Flag.md
Normal file
123
wiki/concepts/Feature-Flag.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
title: "Feature Flag (功能开关)"
|
||||
tags: [devops, continuous-delivery, deployment, risk-mitigation, feature-management]
|
||||
aliases: [Feature Flagging, Feature Toggle, Feature Switch]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Feature Flag (功能开关)
|
||||
|
||||
**Feature Flag**(功能开关/功能标记)是一种将代码部署(Deploy)与功能发布(Release)解耦的技术机制。通过在代码中嵌入条件判断(开关),团队可以在不重新部署的情况下控制功能的可见性和行为。
|
||||
|
||||
## Aliases
|
||||
- Feature Flagging
|
||||
- Feature Toggle
|
||||
- Feature Switch
|
||||
- Kill Switch(紧急情况下的特殊用法)
|
||||
|
||||
## Core Mechanism
|
||||
|
||||
```javascript
|
||||
if (featureFlag.enabled('new-checkout-flow')) {
|
||||
return newCheckoutProcess();
|
||||
} else {
|
||||
return oldCheckoutProcess();
|
||||
}
|
||||
```
|
||||
|
||||
**关键洞察**:部署代码 ≠ 发布功能。代码可以在任何时间部署到生产环境,但功能发布由开关控制。
|
||||
|
||||
## Key Capabilities
|
||||
|
||||
### 1. Decoupled Deployment & Release
|
||||
|
||||
| 传统方式 | Feature Flag 方式 |
|
||||
|----------|-------------------|
|
||||
| 部署 = 发布 | 部署 ≠ 发布 |
|
||||
| 2AM 部署"以防万一" | 随时部署,择机发布 |
|
||||
| 全有或全无 | 灰度发布,渐进放量 |
|
||||
|
||||
### 2. Kill Switch(应急切断)
|
||||
|
||||
紧急情况下,无需重新部署即可关闭故障功能:
|
||||
|
||||
- 支付网关异常?切换到备用提供商(秒级)
|
||||
- 搜索结果异常?回退到旧算法(秒级)
|
||||
- AI 模型产生幻觉?切换回上一版本(秒级)
|
||||
|
||||
> "Instead of debugging under pressure while users suffer, you flip a switch and fix the problem properly later."
|
||||
|
||||
### 3. Progressive Rollout(渐进式放量)
|
||||
|
||||
分阶段向用户群发布功能,控制影响范围:
|
||||
|
||||
```
|
||||
1% 用户 → 监控错误率、性能指标
|
||||
5% 用户 → 监控转化率、用户反馈
|
||||
25% 用户 → 检查下游系统负载
|
||||
100% 用户 → 完成全量发布
|
||||
```
|
||||
|
||||
如果 5% 阶段出现问题,RTO 以秒计(只需关闭开关),而不是小时级(需要紧急回滚部署)。
|
||||
|
||||
### 4. Micro-Recovery(Feature 级别微恢复)
|
||||
|
||||
不再将整个应用视为单一系统,不同功能有不同的风险和业务影响:
|
||||
|
||||
| 功能 | RTO 目标 | RPO 目标 |
|
||||
|------|----------|----------|
|
||||
| 核心支付处理 | 秒级 | 零丢失 |
|
||||
| 新推荐引擎 | 5 分钟 | 15 分钟 |
|
||||
| Beta 仪表盘功能 | 30 分钟 | 1 小时 |
|
||||
|
||||
### 5. 定向回滚(Targeted Rollback)
|
||||
|
||||
如果某个功能只影响欧洲移动用户,可以仅针对该用户群禁用该功能,其他用户不受影响。
|
||||
|
||||
## Feature Flag vs. 传统灾备
|
||||
|
||||
| 维度 | 传统灾备 | Feature Flag |
|
||||
|------|----------|--------------|
|
||||
| 目标故障类型 | 硬件故障 | 代码变更 |
|
||||
| RTO | 小时级(从备份恢复) | 秒级(配置变更) |
|
||||
| RPO | 取决于备份频率 | 近零(不触碰数据层) |
|
||||
| 故障频率 | 低(年均几次) | 高(每周可能发生) |
|
||||
| 成本 | 高(冗余基础设施) | 低(软件工具) |
|
||||
|
||||
## 商业案例数据
|
||||
|
||||
| 公司 | 改进前 | 改进后 |
|
||||
|------|--------|--------|
|
||||
| HP | 回滚时间:小时级 | 分钟级 |
|
||||
| Christian Dior | 回滚时间:15 分钟 | 即时切换 |
|
||||
| LaunchDarkly 客户 | — | 86% 在一天内恢复 |
|
||||
| LaunchDarkly 客户 | — | 42% 在小时级(甚至分钟级)恢复 |
|
||||
|
||||
**成本效益**:59% 的 LaunchDarkly 客户表示运维成本降低 11%-50%,8% 表示降低超过 50%。
|
||||
|
||||
## 核心价值
|
||||
|
||||
> "Deploy with confidence, recover instantly, and focus on building features instead of fixing outages."
|
||||
|
||||
Feature Flag 将问题响应从**被动救火**转变为**主动预防**:
|
||||
|
||||
- **预防优于恢复**:在 1% 用户中发现问题,比全量发布后止损更有价值
|
||||
- **减少焦虑**:部署不再可怕,随时可以回退
|
||||
- **提高迭代速度**:团队敢于快速实验
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Kill Switch]] — Feature Flag 在紧急情况下的用法
|
||||
- [[Progressive Rollout]] — Feature Flag 支持的渐进式放量策略
|
||||
- [[Micro-Recovery]] — Feature Flag 实现的 feature 级别细粒度恢复
|
||||
- [[Deployment-vs-Release]] — Feature Flag 实现的部署与发布解耦
|
||||
- [[RTO]] — Feature Flag 将 RTO 从小时降至秒级
|
||||
- [[RPO]] — Feature Flag 保护 RPO(不丢失数据)
|
||||
- [[LaunchDarkly]] — Feature Flag 管理平台
|
||||
- [[CI-CD-Pipeline]] — Feature Flag 是现代 CI/CD 的核心基础设施
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
- [[sources/devops-culture-and-transformation-fostering-collaboration-agile-practices-and-innovation-linkedin.md]]
|
||||
- [[sources/Deployment-Automation.md]]
|
||||
105
wiki/concepts/FinOps.md
Normal file
105
wiki/concepts/FinOps.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# FinOps
|
||||
|
||||
> **FinOps** — Cloud Financial Management,云财务管理,是一种将财务责任、跨团队协作和业务价值最大化相结合的云成本管理实践。
|
||||
|
||||
## Definition
|
||||
|
||||
FinOps(Financial Operations)是云时代的一种运营框架:
|
||||
|
||||
- **可见性** — 了解云支出去向
|
||||
- **优化** — 持续减少浪费
|
||||
- **业务价值** — 最大化云投资回报
|
||||
|
||||
## FinOps Core Principles
|
||||
|
||||
| 原则 | 描述 |
|
||||
|------|------|
|
||||
| **云是一个 marketplaces** | 实时价格,竞争驱动 |
|
||||
| **财务责任人人有责** | 集中团队无法独自优化 |
|
||||
| **按需 vs 承诺** | 两者混合以平衡灵活性和成本 |
|
||||
| **持续优化** | 定期评估和调整 |
|
||||
| **业务价值驱动** | 成本透明支撑业务决策 |
|
||||
|
||||
## FinOps Maturity Model
|
||||
|
||||
| Level | 描述 | 特征 |
|
||||
|-------|------|------|
|
||||
| **Crawl** | 可见性 | 建立标签、成本分配、基础监控 |
|
||||
| **Walk** | 优化 | .right-sizing、预留购买、自动化 |
|
||||
| **Run** | 自动化 | 实时优化、自动策略执行 |
|
||||
|
||||
## Key Practices
|
||||
|
||||
### 1. Chargeback / Showback
|
||||
|
||||
| 模型 | 描述 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| **Showback** | 向部门展示成本 | 培养成本意识 |
|
||||
| **Chargeback** | 部门承担实际成本 | 强化责任 |
|
||||
|
||||
### 2. Resource Tagging
|
||||
|
||||
**必需标签**
|
||||
| 标签 | 示例 | 用途 |
|
||||
|------|------|------|
|
||||
| `Environment` | prod, staging | 环境隔离 |
|
||||
| `Owner` | alice@example.com | 责任追踪 |
|
||||
| `CostCenter` | CC-12345 | 财务归因 |
|
||||
| `Application` | billing-api | 应用关联 |
|
||||
|
||||
### 3. Cost Optimization
|
||||
|
||||
**技术**
|
||||
- .Right-sizing(资源优化)
|
||||
- Reserved Instances / Savings Plans
|
||||
- Spot/Preemptible 实例
|
||||
- 生命周期策略(存储)
|
||||
- 闲置资源清理
|
||||
|
||||
**流程**
|
||||
- 定期成本审视(Weekly/Monthly)
|
||||
- 预算告警
|
||||
- 成本异常检测
|
||||
- 优化建议 Review
|
||||
|
||||
### 4. Unit Economics
|
||||
|
||||
| 指标 | 公式 | 目标 |
|
||||
|------|------|------|
|
||||
| **Cost per Transaction** | 总成本 / 交易数 | 持续降低 |
|
||||
| **Cost per User** | 总成本 / 用户数 | 持续降低 |
|
||||
| **Cost per Revenue** | 总成本 / 收入 | 稳定或降低 |
|
||||
|
||||
## Tools
|
||||
|
||||
| 类别 | 工具 |
|
||||
|------|------|
|
||||
| **原生** | AWS Cost Explorer, Azure Cost Management, GCP Billing |
|
||||
| **第三方** | Spot.io, CloudHealth, Densify, Kubecost |
|
||||
| **BI/可视化** | Tableau, Looker, Power BI |
|
||||
|
||||
## FinOps Team Roles
|
||||
|
||||
| 角色 | 职责 |
|
||||
|------|------|
|
||||
| **FinOps Practitioner** | 日常成本监控和分析 |
|
||||
| **FinOps Lead** | 策略制定、跨团队协调 |
|
||||
| **Cloud Economist** | 云经济学、成本建模 |
|
||||
| **Business Partner** | 业务部门对接 |
|
||||
|
||||
## Integration with Other Practices
|
||||
|
||||
| 实践 | 关系 |
|
||||
|------|------|
|
||||
| **DevOps** | FinOps-aware 开发 |
|
||||
| **SRE** | 可靠性与成本平衡(SLO vs 成本) |
|
||||
| **Cloud Governance** | 成本策略是治理一部分 |
|
||||
| **Cloud Security** | 安全成本量化 |
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Cost Optimization]] — 云成本优化
|
||||
- [[Cloud Governance]] — 云治理
|
||||
- [[Cloud Adoption Strategy]] — 云采用策略
|
||||
- [[Multi-Cloud Strategy]] — 多云策略
|
||||
- [[DORA Metrics]] — DORA 指标
|
||||
42
wiki/concepts/GPG-密钥验证.md
Normal file
42
wiki/concepts/GPG-密钥验证.md
Normal file
@@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "GPG 密钥验证"
|
||||
tags: [gpg, apt, security]
|
||||
date: 2026-04-22
|
||||
---
|
||||
|
||||
# GPG 密钥验证
|
||||
|
||||
## Definition
|
||||
GPG (GNU Privacy Guard) 密钥验证是 APT 包管理器的安全机制,通过 GPG 签名确保从仓库下载的软件包来自可信来源且未被篡改。
|
||||
|
||||
## Docker GPG 密钥配置
|
||||
```bash
|
||||
# 创建密钥目录
|
||||
sudo install -m 0755 -d /etc/apt/keyrings
|
||||
|
||||
# 下载 Docker 官方 GPG 密钥
|
||||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
|
||||
|
||||
# 设置密钥权限(所有人可读)
|
||||
sudo chmod a+r /etc/apt/keyrings/docker.asc
|
||||
```
|
||||
|
||||
## Verification Mechanism
|
||||
1. apt 在下载软件包前,先用 GPG 密钥验证包的签名
|
||||
2. 签名不匹配或密钥缺失时,apt 会拒绝安装并报 GPG 错误
|
||||
3. `signed-by` 参数在 sources.list 条目中指定验证用的密钥路径
|
||||
|
||||
## Common Issues
|
||||
| 问题 | 原因 | 解决 |
|
||||
|------|------|------|
|
||||
| `NO_PUBKEY` | GPG 密钥未导入 | 运行导入命令 |
|
||||
| `GPG error` | 密钥权限不正确 | `chmod a+r` |
|
||||
| `The following signatures couldn't be verified` | 密钥过期或损坏 | 重新下载密钥 |
|
||||
|
||||
## Related Sources
|
||||
- [[如何在ubuntu-server安装-docker-docker-compose]] — Docker GPG 密钥配置步骤
|
||||
|
||||
## Related Concepts
|
||||
- [[APT 仓库配置]] — 密钥与仓库配置的关系
|
||||
- [[Docker Engine]] — 被 GPG 验证的软件包
|
||||
- [[Ubuntu Server]] — GPG 密钥管理的宿主系统
|
||||
69
wiki/concepts/Gatekeeper.md
Normal file
69
wiki/concepts/Gatekeeper.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Gatekeeper
|
||||
|
||||
> macOS 的安全机制,用于验证应用程序是否来自已识别的开发者可信来源。
|
||||
|
||||
## Overview
|
||||
Gatekeeper 是 macOS 的应用安全验证系统,旨在保护用户免受恶意软件的侵害。它会检查应用程序的来源和签名状态,拒绝运行未授权的软件。
|
||||
|
||||
## How It Works
|
||||
Gatekeeper 会在用户尝试运行从互联网下载的应用程序时触发验证流程:
|
||||
1. 检查应用是否来自 App Store
|
||||
2. 检查是否有有效的 Developer ID 签名
|
||||
3. 检查是否被标记为已隔离(quarantined)
|
||||
|
||||
## Quarantine Attribute
|
||||
macOS 使用扩展属性(Extended Attributes)来标记从互联网下载的文件:
|
||||
- `com.apple.quarantine`:标记文件来自互联网下载
|
||||
- `com.apple.metadata`:包含下载来源 URL 等元数据
|
||||
|
||||
## Removing Quarantine
|
||||
```bash
|
||||
# 递归移除 quarantine 属性(适用于目录)
|
||||
xattr -rd com.apple.quarantine /path/to/application/
|
||||
|
||||
# 验证(无输出表示解除成功)
|
||||
xattr /path/to/application/binary
|
||||
|
||||
# 查看 quarantine 状态
|
||||
xattr -l /path/to/application/binary
|
||||
```
|
||||
|
||||
## Gatekeeper Modes
|
||||
```bash
|
||||
# 查看当前 Gatekeeper 状态
|
||||
spctl --status
|
||||
|
||||
# 允许所有来源(不推荐,存在安全风险)
|
||||
sudo spctl --master-disable
|
||||
|
||||
# 查看应用状态
|
||||
spctl --assess --verbose /path/to/application
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
- **Homebrew**:安装后需解除 quarantine 才能运行
|
||||
- **FRP**:从 GitHub 下载的二进制文件需解除限制
|
||||
- **第三方工具**:任何未签名的可执行文件
|
||||
|
||||
## Security Considerations
|
||||
| 方法 | 安全性 | 适用场景 |
|
||||
|------|--------|----------|
|
||||
| Developer ID 签名 | 最高 | 正式发布的软件 |
|
||||
| App Store | 高 | 仅限 App Store 应用 |
|
||||
| 解除 quarantine | 低 | 自托管工具/开发环境 |
|
||||
|
||||
## Best Practices
|
||||
1. **仅对可信来源解除限制**:如 GitHub Release 官方二进制文件
|
||||
2. **使用 -r 递归参数**:确保目录内所有文件解除限制
|
||||
3. **验证文件完整性**:下载后检查 SHA256 校验和
|
||||
4. **保持 Gatekeeper 开启**:除非完全了解风险,否则不要禁用
|
||||
|
||||
## Related Concepts
|
||||
- [[launchd]] — macOS 服务管理器
|
||||
- [[frp]] — 需要解除 Gatekeeper 才能运行
|
||||
- [[Mac Mini M4]] — 需要处理 Gatekeeper 问题
|
||||
|
||||
## References
|
||||
- Apple Support: Safely open apps on your Mac
|
||||
- `man xattr`
|
||||
- `man spctl`
|
||||
74
wiki/concepts/Green-Computing.md
Normal file
74
wiki/concepts/Green-Computing.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
title: "Green Computing"
|
||||
type: concept
|
||||
tags: [Cloud, Sustainability, Environmental, Cloud Operations]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Green Computing (绿色计算)
|
||||
|
||||
## Definition
|
||||
**Green Computing** refers to environmentally sustainable computing practices that reduce energy consumption, minimize carbon footprint, and promote eco-friendly technology usage in data centers and cloud environments.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
### Environmental Impact
|
||||
- **Data centers consume 1% of global electricity** — a number expected to rise (International Energy Agency)
|
||||
- Growing cloud infrastructure increases energy demands
|
||||
- Regulatory bodies pressuring organizations for sustainable solutions
|
||||
|
||||
### Business Benefits
|
||||
- Reduce operational costs through energy efficiency
|
||||
- Meet corporate sustainability goals
|
||||
- Comply with environmental regulations
|
||||
- Enhance brand reputation
|
||||
|
||||
## Key Strategies
|
||||
|
||||
### 1. Serverless Computing
|
||||
- Eliminates unnecessary resource consumption
|
||||
- Pay-only-for-execution model
|
||||
- Automatic resource optimization
|
||||
|
||||
### 2. Sustainable Data Centers
|
||||
- Major providers investing in carbon-neutral infrastructure
|
||||
- AWS, Azure, and Google Cloud commitment to renewable energy
|
||||
- Efficient cooling and power systems
|
||||
|
||||
### 3. Workload Optimization
|
||||
- Shift workloads to energy-efficient regions
|
||||
- Right-size resources to actual needs
|
||||
- Schedule non-critical workloads for off-peak times
|
||||
|
||||
### 4. Cloud Sustainability Tools
|
||||
- Carbon footprint tracking
|
||||
- Energy efficiency dashboards
|
||||
- Resource usage optimization
|
||||
|
||||
## Industry Trends
|
||||
|
||||
### Cloud Provider Commitments
|
||||
- **AWS**: Carbon-neutral by 2040
|
||||
- **Microsoft Azure**: Carbon-negative by 2030
|
||||
- **Google Cloud**: Carbon-free by 2030
|
||||
|
||||
### Regulatory Pressures
|
||||
- Corporate sustainability mandates
|
||||
- Environmental reporting requirements
|
||||
- Carbon tax implications
|
||||
|
||||
## Relationship to Cloud Operating Model
|
||||
- Green Computing is an emerging pillar of modern [[Cloud Operating Model]]
|
||||
- 8% of organizations now prioritize sustainability in cloud adoption
|
||||
- Part of future trends in cloud management
|
||||
|
||||
## Related Concepts
|
||||
- [[Cloud Operating Model]]
|
||||
- [[Serverless Computing]]
|
||||
- [[Cloud Cost Optimization]]
|
||||
- [[Multi-Cloud Strategy]]
|
||||
|
||||
## Related Entities
|
||||
- [[AWS]]
|
||||
- [[Azure]]
|
||||
- [[Google-Cloud]]
|
||||
198
wiki/concepts/Hybrid-Cloud.md
Normal file
198
wiki/concepts/Hybrid-Cloud.md
Normal file
@@ -0,0 +1,198 @@
|
||||
# Hybrid Cloud
|
||||
|
||||
> **Hybrid Cloud** — 混合云是一种同时使用公有云和私有云的计算环境,通过在两者之间按策略分配工作负载来结合各自的优势——公有云的弹性与成本效率 + 私有云的安全性与控制力。
|
||||
|
||||
## Definition
|
||||
|
||||
混合云(Hybrid Cloud)是一种将公有云和私有云整合在一起的云计算架构,允许数据和应用程序在两种环境之间移动。混合云策略基于业务和技术需求(安全性、性能、可扩展性、成本、效率)动态决定哪些工作负载运行在哪种云环境中。
|
||||
|
||||
## Core Principle
|
||||
|
||||
> "The uses of each are driven by business and technical needs around: Security, Performance, Scalability, Cost, Efficiency."
|
||||
|
||||
混合云的核心理念是:**工作负载应该运行在最适合它的环境中**,而不是一刀切地选择单一云部署模型。
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ 混合云架构 │
|
||||
│ │
|
||||
│ ┌─────────────────────┐ ┌─────────────────────┐ │
|
||||
│ │ 私有云环境 │ │ 公有云环境 │ │
|
||||
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
|
||||
│ │ │ 敏感数据工作负载│ │ │ │ 弹性计算工作负载 │ │ │
|
||||
│ │ │ 核心业务系统 │◄─┼─────┼─►│ 峰值流量处理 │ │ │
|
||||
│ │ │ 受监管的工作负载│ │ │ │ 开发测试环境 │ │ │
|
||||
│ │ └───────────────┘ │ │ │ SaaS 应用 │ │ │
|
||||
│ │ │ │ └───────────────┘ │ │ │
|
||||
│ │ 专用物理服务器 │ │ │ │ │
|
||||
│ │ 本地数据中心 │ │ AWS / Azure / GCP │ │ │
|
||||
│ └─────────────────────┘ └─────────────────────┘ │ │
|
||||
│ │ │ │
|
||||
│ ┌──────────┴──────────┐ │
|
||||
│ │ 集成层(网络/数据) │ │
|
||||
│ │ - VPN / 专线连接 │ │
|
||||
│ │ - 数据同步 │ │
|
||||
│ │ - 统一身份管理 │ │
|
||||
│ └─────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. 核心 + 弹性模型
|
||||
```
|
||||
私有云:核心业务应用、数据库、敏感数据
|
||||
公有云:弹性扩展资源、峰值流量处理、CDN
|
||||
|
||||
示例:电商平台
|
||||
- 私有云:订单系统、支付处理、客户数据
|
||||
- 公有云:双11大促峰值扩展
|
||||
```
|
||||
|
||||
### 2. 安全 + 成本模型
|
||||
```
|
||||
私有云:敏感数据、合规要求高的工作负载
|
||||
公有云:非敏感工作负载、开发测试环境
|
||||
|
||||
示例:医疗机构
|
||||
- 私有云:患者病历、HIPAA合规数据
|
||||
- 公有云:公开网站、健康资讯应用
|
||||
```
|
||||
|
||||
### 3. 本地 + 云爆发模型
|
||||
```
|
||||
私有云:日常稳定工作负载
|
||||
公有云:临时性高计算需求(突发工作负载)
|
||||
|
||||
示例:工程仿真
|
||||
- 私有云:日常设计工作
|
||||
- 公有云:新项目仿真计算峰值
|
||||
```
|
||||
|
||||
## Advantages
|
||||
|
||||
### 1. 策略驱动的灵活性
|
||||
- 基于安全、性能、成本需求灵活部署工作负载
|
||||
- 策略即代码(Policy-as-Code)自动化分配
|
||||
- 持续优化工作负载位置
|
||||
|
||||
### 2. 安全地扩展
|
||||
- 公有云的弹性扩展能力,无需将敏感工作负载暴露于安全风险
|
||||
- 敏感工作负载保留在私有云
|
||||
- 按需使用公有云资源,无需长期承诺
|
||||
|
||||
### 3. 最大化可靠性
|
||||
- 跨多个数据中心分发服务(公私混合)
|
||||
- 不将可用性依赖于单一环境
|
||||
- 灾难恢复增强
|
||||
|
||||
### 4. 成本控制与效率
|
||||
- 敏感工作负载运行在私有云专用资源
|
||||
- 常规工作负载分布到廉价公有云基础设施
|
||||
- 投资回报优化
|
||||
|
||||
### 5. 互操作性和移动性
|
||||
- 工作负载在公私环境之间平滑移动
|
||||
- 跨环境访问和使用数据和应用
|
||||
- 统一身份和访问管理
|
||||
|
||||
### 6. 优化的负载分配
|
||||
- 敏感工作负载在私有云处理
|
||||
- 其他所有工作负载在公有云处理
|
||||
- 每个工作负载运行在最适合的环境
|
||||
|
||||
### 7. 业务连续性
|
||||
- 混合分布性质使灾难恢复更简单快速
|
||||
- 分布式架构降低单点故障风险
|
||||
- RTO/RPO优化
|
||||
|
||||
## Drawbacks
|
||||
|
||||
### 1. 复杂的成本管理
|
||||
- 公私之间的切换难以跟踪
|
||||
- 可能导致浪费性支出
|
||||
- 需要精细的 FinOps 实践
|
||||
|
||||
### 2. 集成挑战
|
||||
- 不同位置和类别的云基础设施需要强兼容性
|
||||
- 跨云网络配置复杂
|
||||
- 数据一致性和同步挑战
|
||||
|
||||
### 3. 增加复杂性
|
||||
- 额外的架构复杂性
|
||||
- 需要同时管理公私两种环境
|
||||
- 运维团队技能要求更高
|
||||
|
||||
### 4. 安全风险
|
||||
- 跨云数据传输引入漏洞
|
||||
- 需要额外的网络安全控制
|
||||
- 合规边界更复杂
|
||||
|
||||
## Homogeneous vs Heterogeneous
|
||||
|
||||
选择混合云后,还需决定云供应商策略:
|
||||
|
||||
| 类型 | 定义 | 优势 | 劣势 |
|
||||
|------|------|------|------|
|
||||
| **同构混合云** | 使用单一云厂商的公私产品 | 简化管理、一致工具链、统一支持 | 供应商锁定 |
|
||||
| **异构混合云** | 使用多个云厂商的混合 | 避免锁定、最佳组合 | 管理复杂性高 |
|
||||
|
||||
示例:
|
||||
- 同构:AWS Outposts + AWS 公有云
|
||||
- 异构:Azure Stack + AWS 公有云
|
||||
|
||||
## When to Use Hybrid Cloud
|
||||
|
||||
| 场景 | 说明 |
|
||||
|------|------|
|
||||
| **多垂直领域的组织** | 不同业务线有不同IT安全、监管和性能要求 |
|
||||
| **优化云投资** | 希望在不牺牲价值的前提下优化成本 |
|
||||
| **增强现有云安全** | SaaS产品需要通过安全私有网络交付 |
|
||||
| **战略性云投资** | 持续在最佳可用服务交付模式之间权衡 |
|
||||
| **迁移过渡期** | 从本地向云迁移的渐进式路径 |
|
||||
| **合规+弹性需求** | 既要数据本地化又要弹性扩展 |
|
||||
|
||||
## Decision Framework
|
||||
|
||||
```
|
||||
开始:评估工作负载特征
|
||||
↓
|
||||
这是敏感/受监管数据吗?
|
||||
/是的\ \否/
|
||||
↓ ↓
|
||||
私有云优先 这是稳定的基线负载吗?
|
||||
↓ /是\ \否/
|
||||
需要弹性和峰值容量? ↓ ↓
|
||||
/是\ \否/ 私有云 公有云优先
|
||||
↓ ↓ ↓
|
||||
混合云 完成 需要弹性扩展?
|
||||
↑ /是\ \否/
|
||||
└───────\ ↓
|
||||
需要数据本地化?
|
||||
/是\ \否/
|
||||
↓ ↓
|
||||
混合云 公有云优先
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Public Cloud]] — 公有云
|
||||
- [[Private Cloud]] — 私有云
|
||||
- [[Multi-Cloud Strategy]] — 多云策略(对比)
|
||||
- [[Shared Responsibility Model]] — 共享责任模型
|
||||
- [[Cloud Adoption Strategy]] — 云采用策略
|
||||
- [[FinOps]] — 云财务管理
|
||||
- [[Disaster Recovery Planning]] — 灾难恢复规划
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[Vendor-Lock-In]] — 供应商锁定
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Computing]] — 云计算基础
|
||||
- [[Cloud-Maturity-Model]] — 云成熟度模型
|
||||
- [[Cloud-Adoption-Strategy]] — 云采用策略
|
||||
- [[Cloud-Governance]] — 云治理
|
||||
- [[High Availability]] — 高可用性
|
||||
- [[Scalability]] — 可扩展性
|
||||
61
wiki/concepts/Hyperautomation.md
Normal file
61
wiki/concepts/Hyperautomation.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
title: "Hyperautomation"
|
||||
type: concept
|
||||
tags: [automation, devops, ai, itsm]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
超自动化(Hyperautomation)是Gartner提出的技术趋势,指融合多种自动化技术(RPA、工作流引擎、ML、AI)实现**端到端流程自动化**的最大化。它超越了简单的流程自动化,追求组织内所有可自动化流程的识别和自动化。
|
||||
|
||||
## Core Components
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Hyperautomation Stack │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ Process Discovery → RPA → Workflow → AI/ML → IoT │
|
||||
│ ↓ ↓ ↓ ↓ ↓ │
|
||||
│ 流程识别 机器人 编排 智能决策 边缘自动化 │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Technology Components
|
||||
|
||||
| 技术 | 作用 | 示例 |
|
||||
|------|------|------|
|
||||
| RPA | 模拟人类操作 | UI自动化 |
|
||||
| Workflow Engine | 流程编排 | n8n, Airflow |
|
||||
| AI/ML | 智能决策 | 预测分析 |
|
||||
| iPaaS | 系统集成 | API网关 |
|
||||
| Low-Code | 快速开发 | 流程构建 |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM 2.0]]中,超自动化是核心技术引擎:
|
||||
|
||||
1. **[[Problem-Management]]** — 自动识别重复问题模式
|
||||
2. **[[Incident-Management]]** — 自动分类和路由工单
|
||||
3. **[[Change-Management]]** — 自动影响评估和审批
|
||||
4. **[[Security-and-Compliance]]** — Policy-as-Code自动执行
|
||||
|
||||
## Hyperautomation vs Automation
|
||||
|
||||
| 维度 | 传统自动化 | 超自动化 |
|
||||
|------|-----------|---------|
|
||||
| 范围 | 单点流程 | 端到端 |
|
||||
| 智能 | 规则驱动 | AI增强 |
|
||||
| 发现 | 人工识别 | 自动发现 |
|
||||
| 适应 | 静态 | 动态学习 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[AIOps]] — AI驱动的运维智能
|
||||
- [[Self-Healing-Systems]] — 自愈自动化
|
||||
- [[Policy-as-Code]] — 策略自动化
|
||||
- [[ITSM 2.0]] — 超自动化的主要应用场景
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — ITSM 2.0中的超自动化应用
|
||||
76
wiki/concepts/IAST.md
Normal file
76
wiki/concepts/IAST.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# IAST (Interactive Application Security Testing)
|
||||
|
||||
## Definition
|
||||
IAST tools evaluate applications while they run to detect security issues that SAST or SCA tools might overlook. They are beneficial during testing and deployment phases when examining how different components interact within the application is important.
|
||||
|
||||
## Aliases
|
||||
- Interactive Application Security Testing
|
||||
- Grey-box testing
|
||||
- Instrumentation-based testing
|
||||
|
||||
## Characteristics
|
||||
- **运行时分析**:在应用运行时进行监控
|
||||
- **灰盒测试**:结合白盒和黑盒方法
|
||||
- **精确检测**:能准确定位漏洞位置
|
||||
- **低误报率**:基于实际执行分析
|
||||
|
||||
## How IAST Works
|
||||
|
||||
### Instrumentation
|
||||
IAST 工具在应用中植入代理(Agent):
|
||||
- 监控应用执行路径
|
||||
- 分析数据流
|
||||
- 检测不安全操作
|
||||
|
||||
### Agent Deployment
|
||||
- Web 服务器插件
|
||||
- 应用服务器插件
|
||||
- 容器环境支持
|
||||
- 云函数支持
|
||||
|
||||
## What IAST Detects
|
||||
- 运行时数据流问题
|
||||
- API 安全问题
|
||||
- 认证/授权问题
|
||||
- 配置错误
|
||||
- 与 [[SAST]] 和 [[DAST]] 互补的漏洞
|
||||
|
||||
## Comparison with Other Testing Methods
|
||||
|
||||
| 维度 | SAST | DAST | IAST |
|
||||
|------|------|------|------|
|
||||
| **测试方式** | 白盒(静态) | 黑盒(动态) | 灰盒(运行时) |
|
||||
| **需要代码** | 是 | 否 | 是(代理) |
|
||||
| **误报率** | 中等 | 低 | 低 |
|
||||
| **检测范围** | 代码层 | 应用层 | 代码+应用层 |
|
||||
| **适用阶段** | 开发 | 测试/部署 | 测试 |
|
||||
| **性能影响** | 无 | 中等 | 低-中等 |
|
||||
|
||||
## Tools
|
||||
- Contrast Assess
|
||||
- Hdiv
|
||||
- Quotium Q360
|
||||
- AppCheck
|
||||
|
||||
## Integration
|
||||
IAST 通常集成到:
|
||||
- 自动化测试环境
|
||||
- QA 测试流程
|
||||
- CI/CD 管道(测试阶段)
|
||||
- 预生产环境
|
||||
|
||||
## Advantages
|
||||
- 高准确性(低误报)
|
||||
- 精确的漏洞定位
|
||||
- 不中断开发流程
|
||||
- 可用于生产监控
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — IAST 是其重要组件
|
||||
- [[SAST]] — 静态应用安全测试
|
||||
- [[DAST]] — 动态应用安全测试
|
||||
- [[SCA]] — 软件组成分析
|
||||
- [[RASP]] — 运行时应用自我保护
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
58
wiki/concepts/IP纯净度.md
Normal file
58
wiki/concepts/IP纯净度.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: "IP纯净度"
|
||||
type: concept
|
||||
tags: [ip, risk-assessment, security]
|
||||
date: 2025-12-31
|
||||
---
|
||||
|
||||
# IP纯净度
|
||||
|
||||
## 定义
|
||||
IP纯净度是评定某个IP地址是否安全可靠的风险等级,通过分析IP的历史使用记录、是否被标记为垃圾邮件源、VPN/代理使用情况等因素,计算出一个风险评分。
|
||||
|
||||
## 检测工具
|
||||
- **Scamalytics**: https://scamalytics.com/ — 主流IP风险评估网站
|
||||
- **IP111**: https://ip111.cn/ — IP归属地检测
|
||||
|
||||
## 风险等级
|
||||
|
||||
### 低风险(推荐)
|
||||
- 数值低,分数越低越安全
|
||||
- IP信誉良好,无异常记录
|
||||
- 适合注册重要账号
|
||||
|
||||
### 中等风险(谨慎)
|
||||
- 存在一定的历史问题
|
||||
- 可能被部分平台标记
|
||||
- 增加封号风险
|
||||
|
||||
### 高风险(避免)
|
||||
- IP信誉差,有明显异常记录
|
||||
- 被多个平台标记
|
||||
- 极大概率导致封号
|
||||
|
||||
## 关键原则
|
||||
> **数值越低越安全** — 必须使用低风险IP才能有效降低账号封禁概率
|
||||
|
||||
## IP一致性检测
|
||||
使用多个网站检测,确保IP归属一致:
|
||||
1. 国内IP检测网站
|
||||
2. 国外IP检测网站
|
||||
3. Google IP检测
|
||||
|
||||
**三处必须完全一致**,否则可能被平台判定异常。
|
||||
|
||||
## 影响纯度的因素
|
||||
- 是否为数据中心IP(住宅IP更优)
|
||||
- 历史是否用于发送垃圾邮件
|
||||
- 是否被VPN/代理服务大量使用
|
||||
- 是否在黑名单中
|
||||
- DNS泄漏情况
|
||||
|
||||
## 相关概念
|
||||
- [[Socks5代理]]
|
||||
- [[账号隔离]]
|
||||
- [[指纹浏览器]]
|
||||
|
||||
## 来源
|
||||
- [[如何用指纹浏览器安全注册并订阅claude-pro会员全攻略]]
|
||||
35
wiki/concepts/ISOHybrid镜像.md
Normal file
35
wiki/concepts/ISOHybrid镜像.md
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
title: "ISOHybrid镜像"
|
||||
type: concept
|
||||
tags: [iso, uefi, bios, boot, rufus]
|
||||
date: 2026-04-14
|
||||
aliases: [ISOHybrid, hybrid ISO, 混合镜像]
|
||||
---
|
||||
|
||||
# ISOHybrid镜像
|
||||
|
||||
## Definition
|
||||
一种同时包含 BIOS (MBR) 和 UEFI 两种引导方式的 ISO 镜像文件格式,Ubuntu 官方 ISO 属于此类。使用 Rufus 等工具写入 U 盘时需要明确选择写入模式。
|
||||
|
||||
## Two Writing Modes
|
||||
| 模式 | 适用场景 | 说明 |
|
||||
|------|----------|------|
|
||||
| **ISO 镜像模式** | 推荐首选 | 保留 ISO 结构,兼容性最佳 |
|
||||
| **DD 镜像模式** | 备选(启动失败时) | 逐字节复制,适合某些顽固设备 |
|
||||
|
||||
## Why It Matters
|
||||
Rufus 写入 ISOHybrid 镜像时会弹出模式选择对话框。选错模式会导致 U 盘在目标设备上无法启动,尤其是 HP ZBook 等 UEFI 严格模式设备。
|
||||
|
||||
## HP ZBook 的 ISOHybrid 配置
|
||||
- **分区方案**:GPT(必须,配合 UEFI)
|
||||
- **目标系统类型**:UEFI (non CSM)(自动匹配)
|
||||
- **文件系统**:FAT32(UEFI 标准)
|
||||
|
||||
## Related
|
||||
- [[Rufus]] — 写入工具
|
||||
- [[HP ZBook]] — 目标设备
|
||||
- [[GPT分区表]] — 分区方案
|
||||
- [[ISOHybrid镜像]] ← 由 [[Rufus]] 写入至 [[HP ZBook]]
|
||||
|
||||
## Sources
|
||||
- [[安装ubuntu-24-04-2在hp-zbook工作站笔记本上]]
|
||||
56
wiki/concepts/ITSM-2.0.md
Normal file
56
wiki/concepts/ITSM-2.0.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
title: "ITSM 2.0"
|
||||
type: concept
|
||||
tags: [cloud, devops, ai, itsm]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
ITSM 2.0是IT服务管理的下一代范式,融合[[AIOps]]和[[Hyperautomation]]技术,具备**自学习、预测性和自主化**能力。它代表了从传统反应式服务管理向主动预防式智能运营的根本转变。
|
||||
|
||||
## Core Characteristics
|
||||
|
||||
| 特性 | 描述 | 技术支撑 |
|
||||
|------|------|---------|
|
||||
| **Self-Learning** | 系统从历史数据中学习,自动优化运维决策 | ML模型、反馈循环 |
|
||||
| **Predictive** | 预测潜在故障,在问题发生前采取措施 | 预测分析、根因预测 |
|
||||
| **Autonomous** | 自动化执行运维任务,减少人工干预 | AIOps、自愈系统 |
|
||||
|
||||
## Key Enablers
|
||||
|
||||
### 1. AIOps Integration
|
||||
```
|
||||
事件数据 → ML模型 → 异常检测 → 根因分析 → 自动修复
|
||||
```
|
||||
|
||||
### 2. Hyperautomation
|
||||
- [[Policy-as-Code]] — 合规策略自动化
|
||||
- [[Self-Healing-Systems]] — 故障自动恢复
|
||||
- 端到端流程机器人
|
||||
|
||||
### 3. ITSM 2.0 Eight Processes
|
||||
|
||||
1. **[[Problem-Management]] 2.0** — AI驱动的根因预测
|
||||
2. **[[Incident-Management]] 2.0** — 自愈驱动的秒级响应
|
||||
3. **[[Change-Management]] 2.0** — 风险预测驱动的智能审批
|
||||
4. **[[Release-Management]] 2.0** — 渐进式交付与灰度发布
|
||||
5. **[[Configuration-Management]] 2.0** — AI增强的CMDB
|
||||
6. **[[Asset-Management]] 2.0** — 智能生命周期管理
|
||||
7. **[[Security-and-Compliance]] 2.0** — ZTA + Policy-as-Code
|
||||
8. **[[Disaster-Recovery]] 2.0** — DRaaS + RTO/RPO优化
|
||||
|
||||
## Industry Trend
|
||||
|
||||
> "The convergence of AIOps, hyperautomation, and ITSM 2.0 is defining a new paradigm: self-learning, predictive, and autonomous IT operations." — shenwei, LinkedIn
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 传统ITSM基础
|
||||
- [[AIOps]] — 运维智能核心
|
||||
- [[Hyperautomation]] — 自动化引擎
|
||||
- [[Self-Healing-Systems]] — 自主恢复能力
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — ITSM 2.0核心概念来源
|
||||
54
wiki/concepts/ITSM.md
Normal file
54
wiki/concepts/ITSM.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "IT Service Management (ITSM)"
|
||||
type: concept
|
||||
tags: [cloud, devops, operations, it-management]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
IT服务管理(ITSM)是一套用于设计、交付、管理和改进IT服务的方法论和实践。传统ITSM以工单处理为中心,而现代ITSM已演进为**企业运营卓越、风险缓解和创新加速的战略推动者**。
|
||||
|
||||
## Core Framework
|
||||
|
||||
### Eight Core Processes (Modern ITSM)
|
||||
|
||||
| 流程 | 核心功能 | 关键技术 |
|
||||
|------|---------|---------|
|
||||
| [[Problem-Management]] | 根因分析、预防重复故障 | AI异常检测、预测分析 |
|
||||
| [[Incident-Management]] | 快速恢复、减少业务中断 | AIOps、自愈系统 |
|
||||
| [[Change-Management]] | 受控变更、风险评估 | AI影响评估、IaC合规 |
|
||||
| [[Release-Management]] | 渐进交付、零干扰发布 | Blue-Green、Canary |
|
||||
| [[Configuration-Management]] | 依赖映射、漂移检测 | AI-CMDB、多云编排 |
|
||||
| [[Asset-Management]] | 生命周期跟踪、成本优化 | SAM、云成本管理 |
|
||||
| [[Security-and-Compliance]] | 风险评分、威胁情报 | ZTA、Policy-as-Code |
|
||||
| [[Disaster-Recovery]] | 业务连续性、容灾 | DRaaS、RTO/RPO优化 |
|
||||
|
||||
## Evolution
|
||||
|
||||
```
|
||||
Traditional ITSM Modern ITSM ITSM 2.0
|
||||
───────────────── → ───────────── → ──────────────
|
||||
- Ticketing-centric - Service-centric - AI-driven
|
||||
- Reactive - Proactive - Predictive
|
||||
- Manual - Automated - Autonomous
|
||||
- Siloed - Integrated - Self-healing
|
||||
```
|
||||
|
||||
## Key Technologies
|
||||
|
||||
- **[[AIOps]]**:机器学习驱动的运维智能
|
||||
- **[[CMDB]]**:AI增强的配置管理数据库
|
||||
- **[[Hyperautomation]]**:端到端流程自动化
|
||||
- **[[Self-Healing-Systems]]**:自动化故障检测与修复
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM-2.0]] — 下一代ITSM,自学习、预测性、自主化
|
||||
- [[DevOps]] — ITSM与DevOps的文化融合
|
||||
- [[SRE]] — 站点可靠性工程与服务级别管理
|
||||
- [[ITIL]] — IT基础设施库,ITSM的方法论框架
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — 现代ITSM八大趋势与AI驱动转型
|
||||
75
wiki/concepts/Immutable-Infrastructure.md
Normal file
75
wiki/concepts/Immutable-Infrastructure.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Immutable Infrastructure
|
||||
|
||||
## Definition
|
||||
Immutable Infrastructure is an approach where components are never modified after deployment. Instead of updating existing components, new versions are created and replaced entirely.
|
||||
|
||||
## Concept
|
||||
不可变基础设施是一种部署策略,其中服务器和基础设施组件一旦部署就不再修改。任何变更都需要创建新版本并替换整个组件。
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Never Modify Running Systems
|
||||
- 不直接在生产环境修改配置
|
||||
- 所有变更通过重新部署实现
|
||||
- 使用版本化配置和模板
|
||||
|
||||
### 2. Replace, Don't Modify
|
||||
- 新版本 = 新环境
|
||||
- 旧版本直接销毁
|
||||
- 保证一致性
|
||||
|
||||
### 3. Infrastructure as Code
|
||||
- 所有基础设施定义代码化
|
||||
- 版本控制所有配置
|
||||
- 可重复的部署流程
|
||||
|
||||
## Benefits for DevSecOps
|
||||
|
||||
### Security Benefits
|
||||
- **减少攻击面**:生产环境无交互式访问
|
||||
- **一致性保证**:每个环境完全相同
|
||||
- **快速回滚**:发现问题时快速切换
|
||||
- **审计简化**:代码即记录
|
||||
|
||||
### Operational Benefits
|
||||
- 环境一致性
|
||||
- 可预测的部署
|
||||
- 简化的故障排除
|
||||
- 更容易扩展
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### Container-Based Approach
|
||||
```
|
||||
容器镜像 = 应用 + 依赖 + 配置
|
||||
每次变更 → 新镜像版本 → 滚动更新
|
||||
```
|
||||
|
||||
### Cloud Infrastructure
|
||||
- AWS:使用 AMI + Auto Scaling
|
||||
- Kubernetes:使用 Pod 重建
|
||||
- Terraform:管理不可变配置
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **使用标签(Tag)管理版本**
|
||||
2. **自动化构建流程**
|
||||
3. **保存历史镜像版本**
|
||||
4. **实施蓝绿部署或滚动更新**
|
||||
5. **监控不可变资源的变更**
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 不可变基础设施是安全架构的重要组成部分
|
||||
- [[Policy-as-Code]] — 策略代码化
|
||||
- [[Container-Lifecycle-Hardening]] — 容器安全加固
|
||||
- [[Blue-Green-Deployment]] — 蓝绿部署模式
|
||||
- [[Infrastructure-as-Code]] — 基础设施即代码
|
||||
|
||||
## Tools
|
||||
- Packer — 镜像构建工具
|
||||
- Terraform — IaC 工具
|
||||
- Kubernetes — 容器编排
|
||||
- Docker — 容器化
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
74
wiki/concepts/Incident-Management.md
Normal file
74
wiki/concepts/Incident-Management.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
title: "Incident Management"
|
||||
type: concept
|
||||
tags: [itsm, operations, reliability]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
事件管理(Incident Management)是[[ITSM]]的核心流程之一,专注于**快速恢复服务正常运作**,将服务中断或降级对业务的影响降到最低。
|
||||
|
||||
## Incident Lifecycle
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Event │ → │ Detect │ → │ Triage │ → │ Resolve │ → │ Review │
|
||||
│ Occurs │ │ & Alert │ │ & Prior │ │ & Recover│ │ & Learn │
|
||||
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
## Modern Incident Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动:
|
||||
|
||||
### Key Capabilities
|
||||
|
||||
| 能力 | 描述 | 技术 |
|
||||
|------|------|------|
|
||||
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
|
||||
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
|
||||
| Dynamic Prioritization | 动态优先级 | ML Models |
|
||||
| Auto-escalation | 自动升级 | Alert Routing |
|
||||
| Self-Healing | 自愈 | Automated Recovery |
|
||||
|
||||
### AIOps-Powered Incident Response
|
||||
|
||||
```
|
||||
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
AIOps ML模型 技能路由 Runbooks 告警升级
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
|
||||
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
|
||||
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
|
||||
| Change Failure Rate | 变更失败率 |
|
||||
|
||||
## Priority Levels
|
||||
|
||||
| 优先级 | 描述 | SLA |
|
||||
|--------|------|-----|
|
||||
| P1/Critical | 核心服务不可用 | 15分钟 |
|
||||
| P2/High | 主要功能不可用 | 1小时 |
|
||||
| P3/Medium | 次要功能受影响 | 4小时 |
|
||||
| P4/Low | 轻微影响 | 24小时 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Problem-Management]] — 问题管理
|
||||
- [[AIOps]] — AI运维能力
|
||||
- [[Self-Healing-Systems]] — 自愈系统
|
||||
- [[MTTR]] — 平均恢复时间
|
||||
- [[MTTD]] — 平均检测时间
|
||||
- [[Event-Correlation]] — 事件关联
|
||||
- [[Root-Cause-Analysis]] — 根因分析
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AIOps-driven Incident Management
|
||||
@@ -13,6 +13,7 @@ Infrastructure as Code is the practice of managing and provisioning infrastructu
|
||||
- **Terraform**: Multi-cloud IaC tool using HCL
|
||||
- **Ansible**: Configuration management and orchestration
|
||||
- **CloudFormation**: AWS-native infrastructure provisioning
|
||||
- **CloudFormation StackSets**: AWS-native cross-account/cross-region deployment extension for CloudFormation
|
||||
- **Pulumi**: IaC using general-purpose programming languages
|
||||
- **Terragrunt**: Wrapper for Terraform providing organization
|
||||
|
||||
|
||||
105
wiki/concepts/Intentional-Cloud-Strategy.md
Normal file
105
wiki/concepts/Intentional-Cloud-Strategy.md
Normal file
@@ -0,0 +1,105 @@
|
||||
---
|
||||
title: Intentional Cloud Strategy
|
||||
type: concept
|
||||
tags: [cloud-computing, strategy, architecture, decision-making]
|
||||
date: 2026-04-19
|
||||
---
|
||||
|
||||
# Intentional Cloud Strategy
|
||||
|
||||
**Intentional Cloud Strategy(有意的云策略)** 是一种系统化的云部署决策框架,要求企业根据工作负载的具体需求(安全性、性能、成本、合规)主动选择合适的云部署模式,而非盲目跟随趋势或单一供应商偏好。
|
||||
|
||||
## Definition
|
||||
|
||||
> "The key element in balancing your choices is to develop an intentional cloud strategy that optimizes your use of each cloud environment. Start with defining the needs of your various workloads, then prioritize them based on the pros and cons of each model." — BMC Blog
|
||||
|
||||
核心理念:**平衡是云架构的核心驱动力**。今天适合企业的选择未必适合未来,云策略需要持续评估和迭代。
|
||||
|
||||
## Decision Framework
|
||||
|
||||
### Step 1: 工作负载分类(Workload Classification)
|
||||
|
||||
按需求对工作负载进行分类:
|
||||
|
||||
| 维度 | 问题 |
|
||||
|------|------|
|
||||
| **安全性** | 数据是否敏感?是否受行业法规约束(HIPAA/GDPR/ISO 27001)? |
|
||||
| **性能** | 是否需要低延迟?是否对 SLA 有严格要求? |
|
||||
| **可扩展性** | 是否有不可预测的流量峰值? |
|
||||
| **成本** | 长期运营成本 vs 前期投入如何权衡? |
|
||||
| **合规** | 数据主权要求?物理位置限制? |
|
||||
|
||||
### Step 2: 工作负载到部署模式的映射
|
||||
|
||||
```
|
||||
工作负载需求 → 推荐部署模式
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
高安全 + 高合规 + 固定负载 → [[Private Cloud]]
|
||||
低安全 + 高弹性 + 成本敏感 → [[Public Cloud]]
|
||||
高安全 + 高弹性 + 跨地域 → [[Hybrid Cloud]]
|
||||
多供应商 + 高可用 → [[Multi-Cloud-Strategy]]
|
||||
```
|
||||
|
||||
### Step 3: 持续评估与平衡
|
||||
|
||||
云策略不是一次性决策,需要:
|
||||
- 定期(季度/年度)审查现有工作负载分布
|
||||
- 跟踪 TCO 变化,识别过度配置
|
||||
- 评估新技术(Serverless、Edge Computing)对架构的影响
|
||||
- 保持对供应商锁定(Vendor Lock-In)的警惕
|
||||
|
||||
## Three Deployment Models Compared
|
||||
|
||||
| 维度 | Public Cloud | Private Cloud | Hybrid Cloud |
|
||||
|------|-------------|---------------|-------------|
|
||||
| **安全性** | 中等(多租户隔离) | 高(专用资源) | 可控(敏感数据放私有) |
|
||||
| **成本效率** | 高(小-中规模)/ 中(超大规模) | 中-高(大规模稳定负载) | 最优(动态分配) |
|
||||
| **弹性扩展** | 极高 | 受限于私有容量 | 高(按需调用公) |
|
||||
| **合规支持** | 基础合规 | 完全控制 | 灵活合规 |
|
||||
| **管理复杂度** | 低 | 高 | 中-高 |
|
||||
|
||||
## Workload Allocation Example (Hybrid)
|
||||
|
||||
```
|
||||
Hybrid Cloud Workload Distribution:
|
||||
┌──────────────────────────────────────────────┐
|
||||
│ Hybrid Cloud │
|
||||
│ │
|
||||
│ Private Cloud Public Cloud │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ 核心业务系统 │←─────→│ 突发流量扩容 │ │
|
||||
│ │ (合规敏感) │ 策略 │ (Dev/Test) │ │
|
||||
│ │ ERP/CRM │ 驱动 │ CDN/静态资源 │ │
|
||||
│ │ 数据库 │ │ 批处理作业 │ │
|
||||
│ └──────────────┘ └──────────────┘ │
|
||||
│ ↕ 数据/应用共享 │
|
||||
└──────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Common Anti-Patterns (违背有意策略)
|
||||
|
||||
1. **全押单一模式**:全部放公有云(忽略合规)或全部私有化(失去弹性)
|
||||
2. **跟随趋势**:盲目追求"混合云"标签而未解决实际业务问题
|
||||
3. **供应商锁定**:过度依赖单一云供应商,迁移成本高
|
||||
4. **忽视 TCO**:只看初期成本,忽视长期运营费用
|
||||
5. **缺乏评估**:工作负载部署后不再复审,资源利用率低下
|
||||
|
||||
## Benefits of Intentional Approach
|
||||
|
||||
- **成本最优化**:每种工作负载放在成本最低的云模式中
|
||||
- **安全性最强化**:敏感资产受专用资源保护
|
||||
- **业务连续性**:混合架构提供更好的灾难恢复能力
|
||||
- **技术敏捷性**:能快速响应业务变化和新技术
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Cloud-Adoption-Strategy]] — 云战略的制定过程
|
||||
- [[Hybrid Cloud]] — 混合云是有意策略的常见实现形式
|
||||
- [[Multi-Cloud-Strategy]] — 多云策略是有意策略的进阶形式
|
||||
- [[Cost-Optimization]] — 有意策略驱动成本优化
|
||||
- [[Vendor-Lock-In]] — 有意策略需考虑避免供应商锁定
|
||||
- [[CapEx vs OpEx]] — 有意策略的财务决策基础
|
||||
|
||||
## Sources
|
||||
|
||||
- [[Public vs Private vs Hybrid Cloud Differences Explained|sources/public-vs-private-vs-hybrid-cloud-differences-explained]]
|
||||
33
wiki/concepts/JFFS双清.md
Normal file
33
wiki/concepts/JFFS双清.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# JFFS双清
|
||||
|
||||
## Definition
|
||||
清理路由器JFFS(Journaling Flash File System)分区中存储的文件系统缓存和数据配置的操作,用于刷机后重置干净的环境。
|
||||
|
||||
## Definition (English)
|
||||
The process of clearing the JFFS (Journaling Flash File System) partition on ASUSWRT-based routers, removing cached data and old configurations to ensure a clean firmware environment.
|
||||
|
||||
## When to Use
|
||||
- 刷机后(从原厂固件刷入梅林固件后)
|
||||
- 固件升级后出现异常
|
||||
- 重置插件配置
|
||||
- 清理旧缓存残留
|
||||
|
||||
## What It Clears
|
||||
- JFFS 分区内容(插件存储、配置文件)
|
||||
- 缓存数据
|
||||
- 旧的软件中心数据
|
||||
|
||||
## How to Perform
|
||||
1. 进入梅林固件后台
|
||||
2. 找到"恢复/重置"选项
|
||||
3. 选择"恢复出厂设置"
|
||||
4. 执行JFFS双清(可能需要手动命令)
|
||||
|
||||
## Importance
|
||||
- 防止旧配置干扰新固件
|
||||
- 清理过期的插件残留
|
||||
- 确保固件环境干净稳定
|
||||
|
||||
## Related
|
||||
- [[固件刷入]] — JFFS双清是刷机流程的一部分
|
||||
- [[梅林固件]] — JFFS是梅林固件的文件系统
|
||||
101
wiki/concepts/Kill-Switch.md
Normal file
101
wiki/concepts/Kill-Switch.md
Normal file
@@ -0,0 +1,101 @@
|
||||
---
|
||||
title: "Kill Switch (应急切断开关)"
|
||||
tags: [devops, disaster-recovery, reliability, feature-management, emergency-response]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Kill Switch (应急切断开关)
|
||||
|
||||
**Kill Switch**(应急切断开关)是 [[Feature Flag]] 的一种紧急用法——在生产环境出现故障时,无需重新部署代码即可绕过或关闭故障组件的机制。Kill Switch 是 [[High Availability]] 的软件层保障。
|
||||
|
||||
## Definition
|
||||
|
||||
Kill Switch 是部署在关键系统路径上的功能开关,用于在紧急情况下将流量切换到备用路径、备用组件或降级服务,从而实现秒级 RTO。
|
||||
|
||||
## Core Use Cases
|
||||
|
||||
| 场景 | Kill Switch 动作 | RTO |
|
||||
|------|-----------------|-----|
|
||||
| 支付网关异常 | 切换到备用支付提供商 | 秒级 |
|
||||
| 搜索结果异常 | 回退到旧搜索算法 | 秒级 |
|
||||
| AI 模型产生幻觉 | 切换回上一版本模型 | 秒级 |
|
||||
| 数据库迁移造成延迟 | 禁用新迁移相关功能 | 秒级 |
|
||||
| 第三方 API 超时 | 切换到缓存数据或模拟响应 | 秒级 |
|
||||
|
||||
## 与传统回滚的对比
|
||||
|
||||
| 维度 | 传统部署回滚 | Kill Switch |
|
||||
|------|-------------|--------------|
|
||||
| 触发时间 | 分钟到小时 | 秒级 |
|
||||
| 数据影响 | 可能丢失新事务数据 | 不触碰数据层 |
|
||||
| 故障窗口 | 整个回滚期间用户受影响 | 切换完成后立即恢复 |
|
||||
| 操作复杂度 | 需要 CI/CD 流程重新部署 | 配置变更,点击开关 |
|
||||
| 适用范围 | 全局,影响所有用户 | 可定向(地区/用户群/设备) |
|
||||
|
||||
## Kill Switch 的价值
|
||||
|
||||
> "Instead of debugging under pressure while users suffer, you flip a switch and fix the problem properly later. Everybody wins."
|
||||
|
||||
**传统方式**:在压力下调试 → 用户持续受损 → 时间窗口内无法保证修复质量
|
||||
|
||||
**Kill Switch 方式**:立即止血 → 切换到安全状态 → 有时间从容分析问题根源
|
||||
|
||||
## 设计原则
|
||||
|
||||
### 1. 识别关键路径
|
||||
- 支付流程
|
||||
- 用户认证
|
||||
- 核心数据写入
|
||||
- 第三方依赖调用
|
||||
|
||||
### 2. 预设备用路径
|
||||
- 备用支付提供商
|
||||
- 旧版本算法
|
||||
- 缓存数据
|
||||
- 降级服务(Degraded Mode)
|
||||
|
||||
### 3. 可观测性优先
|
||||
- Kill Switch 切换前需要明确知道发生了什么
|
||||
- 监控指标、告警、日志必须到位
|
||||
- 切换后需要持续监控备用路径的健康状态
|
||||
|
||||
### 4. 文档化
|
||||
- 每个 Kill Switch 的触发条件需要事先定义
|
||||
- 团队成员需要知道开关在哪里、如何使用
|
||||
- 定期演练(Chaos Engineering)
|
||||
|
||||
## Kill Switch 与 [[RTO]]/[[RPO]]
|
||||
|
||||
Kill Switch 直接影响 RTO 和 RPO:
|
||||
|
||||
- **RTO**:从小时级降至秒级(配置变更,不需要代码部署)
|
||||
- **RPO**:保持近零(不触发数据回滚,不丢失新事务)
|
||||
|
||||
## Kill Switch vs. Circuit Breaker
|
||||
|
||||
| 维度 | Kill Switch | Circuit Breaker |
|
||||
|------|-------------|-----------------|
|
||||
| 触发方式 | 手动(人为决策) | 自动(基于错误率阈值) |
|
||||
| 适用场景 | 已知故障、有预案 | 未知故障、无预期 |
|
||||
| 灵活性 | 高(可指定备用路径) | 中(自动跳转到 fallback) |
|
||||
| 协同需求 | 需要备用方案就绪 | 自动感知系统压力 |
|
||||
|
||||
## 实践建议
|
||||
|
||||
1. **不要过度设计 Kill Switch**:每个关键路径设计 1-2 个关键开关即可
|
||||
2. **开关命名要语义化**:`kill_payment_v2` 而非 `flag_42`
|
||||
3. **测试 Kill Switch**:定期在非紧急情况下测试开关是否正常工作
|
||||
4. **监控开关状态**:确保知道哪些开关处于开启/关闭状态
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Feature Flag]] — Kill Switch 是 Feature Flag 的紧急用法
|
||||
- [[RTO]] — Kill Switch 将 RTO 从小时降至秒级
|
||||
- [[RPO]] — Kill Switch 保护 RPO(不丢失数据)
|
||||
- [[High Availability]] — Kill Switch 是 HA 的软件层保障
|
||||
- [[Disaster Recovery]] — Kill Switch 是现代灾备的重要工具
|
||||
- [[Micro-Recovery]] — Kill Switch 实现 feature 级别的精准恢复
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
93
wiki/concepts/Micro-Recovery.md
Normal file
93
wiki/concepts/Micro-Recovery.md
Normal file
@@ -0,0 +1,93 @@
|
||||
---
|
||||
title: "Micro-Recovery (微恢复)"
|
||||
tags: [devops, disaster-recovery, reliability, feature-management]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Micro-Recovery (微恢复)
|
||||
|
||||
**Micro-Recovery**(微恢复)是指不回滚整个部署,而是针对特定功能(Feature)进行精准恢复的能力。它是 [[Feature Flag]] 带来的核心理念转变:不再将整个应用视为单一恢复单元,而是按功能粒度进行风险管理。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Don't treat your entire app like one big system. Different features have different risks and business impacts, so they should have different recovery targets."
|
||||
|
||||
传统灾备将整个系统作为恢复目标,而 Micro-Recovery 将恢复粒度缩小到单个功能模块。
|
||||
|
||||
## 传统方式 vs. Micro-Recovery
|
||||
|
||||
| 维度 | 传统全量回滚 | Micro-Recovery |
|
||||
|------|-------------|----------------|
|
||||
| 恢复粒度 | 整个部署/系统 | 单个功能 |
|
||||
| RTO | 小时级 | 秒级 |
|
||||
| RPO | 取决于备份频率 | 近零 |
|
||||
| 影响范围 | 全局(所有用户) | 局部(可定向) |
|
||||
| 用户体验 | 可能感知到中断 | 可能完全无感知 |
|
||||
|
||||
## Feature-Level Recovery Targets
|
||||
|
||||
不同功能有不同的风险和业务影响:
|
||||
|
||||
| 功能类型 | RTO 目标 | RPO 目标 | 恢复策略 |
|
||||
|----------|----------|----------|----------|
|
||||
| 核心支付处理 | 秒级 | 零丢失 | Kill Switch → 备用提供商 |
|
||||
| 新推荐引擎 | 5 分钟 | 15 分钟 | Feature Flag → 旧算法 |
|
||||
| Beta 仪表盘功能 | 30 分钟 | 1 小时 | Feature Flag → 禁用该功能 |
|
||||
|
||||
## Micro-Recovery 的优势
|
||||
|
||||
### 1. 精准止血
|
||||
发现某功能异常时,只关闭该功能,其他正常功能不受影响。
|
||||
|
||||
### 2. 用户无感知
|
||||
> "Your checkout flow has a bug? Disable the new version and fall back to the old one in seconds. Users might not even notice."
|
||||
|
||||
### 3. 数据保护
|
||||
[[Feature Flag]] 切换只改变代码执行路径,不触碰数据层,RPO 不受影响。
|
||||
|
||||
### 4. 定向恢复
|
||||
如果某功能只影响特定地区或用户群,可以只针对该群体禁用,其他用户继续使用新功能。
|
||||
|
||||
## 实现方式
|
||||
|
||||
Micro-Recovery 通过 [[Feature Flag]] 实现:
|
||||
|
||||
```javascript
|
||||
// 结账流程示例
|
||||
async function checkoutFlow(userId, cart) {
|
||||
// Feature Flag 控制是否使用新版结账
|
||||
if (await flags.enabled('new-checkout-v2', userId)) {
|
||||
return newCheckoutProcess(cart); // 故障时 → 切换到旧版
|
||||
}
|
||||
return legacyCheckoutProcess(cart);
|
||||
}
|
||||
```
|
||||
|
||||
## Micro-Recovery vs. 其他恢复模式
|
||||
|
||||
| 模式 | 恢复粒度 | RTO | RPO | 复杂度 |
|
||||
|------|----------|-----|-----|--------|
|
||||
| 传统灾备 | 系统/数据中心 | 小时级 | 取决于备份 | 高 |
|
||||
| CI/CD 回滚 | 部署版本 | 分钟级 | 可能丢失 | 中 |
|
||||
| Kill Switch | 组件/功能 | 秒级 | 近零 | 低 |
|
||||
| **Micro-Recovery** | **单个功能** | **秒级** | **近零** | **低** |
|
||||
|
||||
## 实践建议
|
||||
|
||||
1. **功能分级**:不是所有功能都需要 Micro-Recovery 能力,关键路径必须有
|
||||
2. **Fallback 路径**:每个 Feature Flag 需要有明确的降级路径(Fallback)
|
||||
3. **可观测性**:每次切换需要有清晰的日志和监控
|
||||
4. **文档化**:哪些功能支持 Micro-Recovery?团队需要知道
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Feature Flag]] — Micro-Recovery 的技术基础
|
||||
- [[Kill Switch]] — Micro-Recovery 的紧急实现方式
|
||||
- [[RTO]] — Micro-Recovery 将 RTO 从小时降至秒级
|
||||
- [[RPO]] — Micro-Recovery 保护 RPO(不触碰数据层)
|
||||
- [[Progressive Rollout]] — Micro-Recovery 与渐进式放量结合实现精细化风险控制
|
||||
- [[Disaster Recovery]] — Micro-Recovery 是现代灾备的重要组成部分
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
48
wiki/concepts/Multi-Account-Deployment.md
Normal file
48
wiki/concepts/Multi-Account-Deployment.md
Normal file
@@ -0,0 +1,48 @@
|
||||
---
|
||||
title: Multi-Account Deployment
|
||||
type: concept
|
||||
tags: [AWS, CloudOps, Infrastructure-as-Code, DevOps]
|
||||
date: 2025-10-24
|
||||
---
|
||||
|
||||
## Definition
|
||||
Multi-Account Deployment(多账户部署)是指使用 AWS CloudFormation StackSets 或类似工具,跨多个 AWS 账户和区域自动化部署和管理基础设施的实践。AWS 推荐使用多账户策略来改善安全隔离、成本管理和运营治理。
|
||||
|
||||
## Core Properties
|
||||
- **自动化**:通过 StackSets 自动向目标账户推送配置
|
||||
- **一致性**:确保所有账户的配置保持一致
|
||||
- **可扩展性**:新增账户自动纳入部署范围(auto-deployment)
|
||||
- **治理**:通过 AWS Organizations OU 层次结构管理账户分组
|
||||
|
||||
## AWS Recommended Account Structure
|
||||
- **Management Account**:管理账户,承载中心监控、billing、 Organizations 管理
|
||||
- **Log Archive Account**:日志归档账户
|
||||
- **Security Tooling Account**:安全工具账户
|
||||
- **Workload Accounts**:工作负载账户,部署实际业务资源
|
||||
|
||||
## Key Mechanisms
|
||||
- **AWS CloudFormation StackSets**:原生跨账户/跨区域部署服务
|
||||
- **AWS Organizations**:账户组织和管理
|
||||
- **Service Control Policies (SCPs)**:定义 OU 级别的权限边界
|
||||
- **Trusted Access**:启用 StackSets 在成员账户中执行操作
|
||||
- **Auto-Deployment**:新增账户自动部署预设 StackSet
|
||||
|
||||
## Related Concepts
|
||||
- [[AWS CloudFormation StackSets]]:多账户部署的核心工具
|
||||
- [[AWS Organizations]]:账户管理和分组
|
||||
- [[StackSets Deployment Visibility]]:多账户部署的可观测性挑战和解决方案
|
||||
- [[Cross-Account Monitoring]]:多账户部署需要跨账户监控支撑
|
||||
- [[Centralized Logging]]:多账户场景是集中日志的主要驱动因素
|
||||
- [[Landing Zone Architecture]]:AWS Landing Zone 架构定义了多账户最佳实践
|
||||
- [[Infrastructure as Code]]:多账户部署是 IaC 的高级应用场景
|
||||
|
||||
## Operational Challenges
|
||||
1. **监控盲区**:跨50+账户部署故障时,逐账户排查效率低下
|
||||
2. **配置漂移**:手动配置导致账户间配置不一致
|
||||
3. **权限管理**:跨账户 IAM 权限配置的复杂性
|
||||
4. **成本追踪**:多账户成本归因和预算控制
|
||||
|
||||
## Solution Patterns
|
||||
- [[Centralized Logging]]:集中存储所有账户的 CloudFormation 事件
|
||||
- [[Cross-Account Monitoring]]:统一监控界面覆盖所有账户
|
||||
- [[StackSets Deployment Visibility]]:CloudWatch Logs Insights 跨账户查询
|
||||
@@ -1,163 +1,131 @@
|
||||
---
|
||||
title: Multi-Cloud Strategy
|
||||
source: https://www.bacancytechnology.com/blog/cloud-maturity-model
|
||||
tags: [Cloud, Multi-Cloud, Strategy, Hybrid-Cloud, Cloud-Adoption]
|
||||
---
|
||||
|
||||
# Multi-Cloud Strategy
|
||||
|
||||
## Overview
|
||||
> **Multi-Cloud Strategy** — 使用多个云服务提供商(公有云、私有云或两者结合)来托管工作负载和服务,以避免供应商锁定、增强弹性和优化成本。
|
||||
|
||||
**Multi-Cloud Strategy** refers to an organization's use of multiple cloud computing services from different providers — combining public, private, and hybrid cloud environments to optimize flexibility, performance, and cost-efficiency.
|
||||
## Definition
|
||||
|
||||
## Relationship with Cloud Maturity Model
|
||||
多云策略(Multi-Cloud Strategy)是指组织同时使用两个或多个云服务提供商的服务。这种策略可以结合不同云服务商的优势,实现:
|
||||
|
||||
The Cloud Maturity Model addresses multi-cloud at multiple levels:
|
||||
- **避免供应商锁定** — 不依赖单一提供商
|
||||
- **增强弹性和可用性** — 跨云冗余
|
||||
- **优化性能和成本** — 选择最适合每个工作负载的云
|
||||
- **满足合规和数据主权要求** — 地理位置控制
|
||||
|
||||
### Level 2 (Repeatable, Opportunistic)
|
||||
Organizations at this level consider diverse deployment models (private, hybrid, multi-cloud) to address:
|
||||
- Security and compliance worries
|
||||
- Need for flexibility in workload placement
|
||||
## Multi-Cloud vs Hybrid Cloud
|
||||
|
||||
### Level 4 (Measured)
|
||||
Companies at Level 4 adeptly use various cloud platforms and flexibly move workloads between them — this represents the **optimized state** of multi-cloud capability.
|
||||
| 维度 | Multi-Cloud | Hybrid Cloud |
|
||||
|------|------------|-------------|
|
||||
| **定义** | 使用多个公有/私有云 | 公有云 + 私有/本地 |
|
||||
| **连接** | 可选互联 | 强互联 |
|
||||
| **用例** | 避免锁定、优化、成本 | 核心在本地、弹性在云 |
|
||||
| **复杂性** | 中-高 | 中 |
|
||||
| **成本** | 取决于设计 | 可能更优化 |
|
||||
|
||||
### Level 5 (Optimized)
|
||||
The highest maturity level describes an organization that operates with an open and interoperable cloud environment across multiple providers.
|
||||
## 8 Business Benefits
|
||||
|
||||
## Key Benefits of Multi-Cloud
|
||||
### 1. Avoiding Vendor Lock-In
|
||||
|
||||
1. **Avoid Vendor Lock-in** — Freedom to choose best-of-breed services from each provider
|
||||
2. **Optimize Costs** — Select most cost-effective provider for each workload
|
||||
3. **Improve Resilience** — Redundancy across providers reduces single-point-of-failure risk
|
||||
4. **Compliance Flexibility** — Match data residency requirements with appropriate provider/region
|
||||
5. **Leverage Best Services** — Use unique capabilities from each cloud provider
|
||||
- 保留谈判筹码
|
||||
- 避免单一供应商涨价风险
|
||||
- 灵活迁移工作负载
|
||||
|
||||
## Multi-Cloud vs Related Concepts
|
||||
### 2. Enhanced Resilience
|
||||
|
||||
| Concept | Description |
|
||||
|---------|-------------|
|
||||
| **Multi-Cloud** | Using multiple cloud services from different providers (can be all public, all private, or mix) |
|
||||
| **Hybrid Cloud** | Combining private/public clouds with orchestration between them |
|
||||
| **Poly-Cloud** | Strategic selection of best services from multiple providers |
|
||||
| **Cross-Cloud** | Moving workloads seamlessly across cloud providers |
|
||||
- 跨云冗余消除单点故障
|
||||
- 99.99%+ 可用性目标
|
||||
- 灾难恢复能力增强
|
||||
|
||||
## Types of Cloud Maturity Models for Multi-Cloud
|
||||
### 3. Improved Security
|
||||
|
||||
The Cloud Maturity Model document references:
|
||||
- 隔离敏感工作负载
|
||||
- 利用各云最佳安全功能
|
||||
- 满足合规要求
|
||||
|
||||
| Model | Focus |
|
||||
|-------|-------|
|
||||
| **Public Cloud Maturity Model** | Leveraging external cloud services for scalability and cost-efficiency |
|
||||
| **Private Cloud Maturity Model** | Internal infrastructure for control and compliance |
|
||||
| **Hybrid Cloud Maturity Model** | Integrating public and private clouds for flexibility |
|
||||
### 4. Unlimited Scalability
|
||||
|
||||
## Challenges in Multi-Cloud Adoption
|
||||
- 按需扩展计算资源
|
||||
- 全球分布式部署
|
||||
- 应对流量高峰
|
||||
|
||||
1. **Complexity Management** — Managing multiple platforms, tools, and interfaces
|
||||
2. **Data Consistency** — Ensuring data synchronization across providers
|
||||
3. **Security Coordination** — Unified security policies across diverse environments
|
||||
4. **Cost Visibility** — Tracking and optimizing spending across providers
|
||||
5. **Skills Requirements** — Teams need expertise across multiple cloud platforms
|
||||
6. **Interoperability** — Ensuring seamless integration between providers
|
||||
### 5. Cost Optimization
|
||||
|
||||
## Best Practices for Multi-Cloud
|
||||
- 选择最具成本效益的云服务
|
||||
- 持续成本监控和优化
|
||||
- 避免过度配置
|
||||
|
||||
1. **Establish Clear Governance** — Define roles, responsibilities, and decision-making across providers
|
||||
2. **Standardize where Possible** — Use common APIs, formats, and management tools
|
||||
3. **Implement FinOps** — Cloud financial management across all providers
|
||||
4. **Develop Cross-Cloud Skills** — Train teams on multiple platforms
|
||||
5. **Use Cloud-Agnostic Tools** — Employ tools that work across providers (Kubernetes, Terraform, etc.)
|
||||
### 6. Accelerated Innovation
|
||||
|
||||
## Related Concepts
|
||||
- 访问最新云服务
|
||||
- 快速试验和原型
|
||||
- 缩短上市时间
|
||||
|
||||
- [[Cloud-Maturity-Model]]
|
||||
- [[Cloud-Adoption-Strategy]]
|
||||
- [[Cloud-Native]]
|
||||
- [[FinOps]]
|
||||
- [[Hybrid-Cloud]]
|
||||
### 7. Compliance Fulfillment
|
||||
|
||||
- 数据主权控制
|
||||
- 满足地区合规要求
|
||||
- 审计和报告能力
|
||||
|
||||
### 8. Performance Optimization
|
||||
|
||||
- 选择最近/最快的云区域
|
||||
- 针对工作负载优化
|
||||
- 降低延迟
|
||||
|
||||
## ROI Maximization Framework
|
||||
|
||||
Based on [[sources/how-can-a-multi-cloud-strategy-transform-your-business-roi]]:
|
||||
### 4 Paths to ROI
|
||||
|
||||
### Quantified Benefits
|
||||
- **30%** reduction in operations costs after optimizing resources and negotiating favorable prices (Forrester)
|
||||
- **78%** of businesses have workloads deployed in more than three public clouds for better agility and cost savings
|
||||
- **86%** of companies intend to adopt multi-cloud approach by end of 2024
|
||||
1. **Cost Reduction** — 30% 成本降低
|
||||
2. **Resource Optimization** — 资源利用率提升
|
||||
3. **Efficiency Gains** — 运营效率提升
|
||||
4. **Elastic Scaling** — 弹性扩展节省
|
||||
|
||||
### ROI Maximization Paths
|
||||
### Industry Adoption
|
||||
|
||||
1. **Cost Reduction**
|
||||
- Avoid high single-cloud pricing structures with one-size-fits-all models
|
||||
- Drive hard bargains for better rates by leveraging multi-vendor competition
|
||||
- Prevent paying for unnecessary resources through cross-cloud optimization
|
||||
|
||||
2. **Resource Optimization**
|
||||
- Allocate workloads to best-suited provider per task (e.g., Google Cloud for ML, AWS/Azure for general infra)
|
||||
|
||||
3. **Efficiency Gains**
|
||||
- Create tailored cloud architecture for specific needs
|
||||
- Reduce downtime, improve performance
|
||||
- Faster deployment times, better availability
|
||||
|
||||
4. **Flexibility in Scaling**
|
||||
- Dynamically allocate resources based on demand
|
||||
- Expand on one provider during spikes without capacity limits on all providers
|
||||
- Avoid overpaying for unused capacity
|
||||
|
||||
5. **Better Risk Management**
|
||||
- Eliminate single-provider dependency
|
||||
- Other providers step in when one goes down
|
||||
- 78% 的组织已采用多云策略
|
||||
- 平均使用 2-5 个云服务商
|
||||
- 混合云和多云是主流趋势
|
||||
|
||||
## Implementation Roadmap
|
||||
|
||||
Based on [[sources/how-can-a-multi-cloud-strategy-transform-your-business-roi]], a 4-step implementation approach:
|
||||
```
|
||||
Phase 1: Assessment & Strategy
|
||||
├── 云就绪度评估
|
||||
├── 工作负载分析
|
||||
└── 供应商选择
|
||||
|
||||
### Step 1: Assess Your Needs
|
||||
- Identify goals: resiliency, cost optimization, or scale
|
||||
- Budget analysis: initial and ongoing costs
|
||||
- Resource requirements assessment
|
||||
Phase 2: Foundation
|
||||
├── 网络连接设计
|
||||
├── 安全架构
|
||||
└── 治理框架
|
||||
|
||||
### Step 2: Choose Right Providers
|
||||
- Align services with needs (AWS for infra, Google Cloud for analytics, Azure for AI)
|
||||
- Evaluate features, security, compliance, cost, performance
|
||||
Phase 3: Migration
|
||||
├── 低风险工作负载优先
|
||||
├── 渐进式迁移
|
||||
└── 验证和测试
|
||||
|
||||
### Step 3: Integrate and Manage
|
||||
- Adopt multi-cloud management tools (Kubernetes, Terraform)
|
||||
- Ensure data interoperability, avoid data silos
|
||||
Phase 4: Optimization
|
||||
├── 成本监控
|
||||
├── 性能调优
|
||||
└── 自动化运维
|
||||
```
|
||||
|
||||
### Step 4: Monitor and Optimize
|
||||
- Track resource usage (CloudHealth, Datadog)
|
||||
- Implement cost-saving measures through workload optimization
|
||||
## Key Risks and Mitigation
|
||||
|
||||
## Industry Use Cases
|
||||
| 风险 | 缓解措施 |
|
||||
|------|---------|
|
||||
| 复杂性增加 | 统一管理平台 |
|
||||
| 数据一致性 | 跨云数据同步策略 |
|
||||
| 安全挑战 | 统一安全策略 |
|
||||
| 成本超支 | FinOps 实践 |
|
||||
| 技能缺口 | 培训和认证 |
|
||||
|
||||
### E-Commerce
|
||||
- High availability during peak seasons (Black Friday, Cyber Monday)
|
||||
- Scale resources across providers for traffic spikes
|
||||
- Fast customer load times
|
||||
## See Also
|
||||
|
||||
### Healthcare
|
||||
- HIPAA-compliant patient data storage
|
||||
- Distribute data across compliant cloud platforms
|
||||
- Reduce costs from single-cloud dependency
|
||||
|
||||
### Finance
|
||||
- Stringent regulatory requirements compliance
|
||||
- Use best security features of each provider
|
||||
- Reduce risk and vendor lock-in for better SLAs and ROI
|
||||
|
||||
## Challenges and Proven Solutions
|
||||
|
||||
| Challenge | Solution |
|
||||
|-----------|---------|
|
||||
| Integration Complexity | Kubernetes, Terraform, cloud APIs |
|
||||
| Security Risks | Centralized IAM, end-to-end encryption |
|
||||
| Lack of Expertise | Upskilling, hiring experts, managed providers |
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/cloud-maturity-model-a-detailed-guide-for-cloud-adoption.md]]
|
||||
- [[sources/public-vs-private-vs-hybrid-cloud-differences-explained.md]]
|
||||
- [[sources/how-can-a-multi-cloud-strategy-transform-your-business-roi.md]]
|
||||
- [[Cloud Adoption Strategy]] — 云采用策略
|
||||
- [[Cloud Maturity Model]] — 云成熟度模型
|
||||
- [[Cloud Governance]] — 云治理
|
||||
- [[Cloud Cost Optimization]] — 云成本优化
|
||||
- [[Cloud-Native]] — 云原生
|
||||
- [[Hybrid Cloud]] — 混合云
|
||||
- [[FinOps]] — 云财务管理
|
||||
|
||||
75
wiki/concepts/Multi-Tenancy.md
Normal file
75
wiki/concepts/Multi-Tenancy.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
title: Multi-Tenancy
|
||||
type: concept
|
||||
tags: [cloud-computing, architecture, efficiency]
|
||||
date: 2026-04-19
|
||||
---
|
||||
|
||||
# Multi-Tenancy
|
||||
|
||||
**Multi-Tenancy(多租户架构)** 是一种软件架构模式,多个租户(用户或组织)共享同一底层基础设施、应用实例或资源池,同时通过逻辑隔离确保各租户数据的私密性。
|
||||
|
||||
## Definition
|
||||
|
||||
多租户架构中,单个应用实例服务多个客户(租户),每个租户的数据和配置相互隔离,但共享计算、存储和网络资源。这是公有云实现规模经济效益的核心技术基础。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Cloud Provider Infrastructure │
|
||||
│ ┌─────────────────────────────────────┐ │
|
||||
│ │ Shared Application Instance │ │
|
||||
│ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌────┐ │ │
|
||||
│ │ │Tenant│ │Tenant│ │Tenant│ │Tn │ │ │
|
||||
│ │ │ A │ │ B │ │ C │ │.. │ │ │
|
||||
│ │ └──────┘ └──────┘ └──────┘ └────┘ │ │
|
||||
│ └─────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Isolation Levels
|
||||
|
||||
| 层级 | 说明 | 示例 |
|
||||
|------|------|------|
|
||||
| **数据隔离** | 租户数据逻辑/物理分离 | 独立数据库、独立 Schema、行级隔离 |
|
||||
| **计算隔离** | 租户工作负载资源分离 | 独立容器、VM、命名空间 |
|
||||
| **网络隔离** | 网络流量隔离 | VPC、虚拟网络、防火墙规则 |
|
||||
|
||||
## Multi-Tenancy vs Single-Tenancy
|
||||
|
||||
| 维度 | Multi-Tenancy | Single-Tenancy |
|
||||
|------|--------------|----------------|
|
||||
| **成本效率** | 高(资源共享) | 低(独占资源) |
|
||||
| **隔离性** | 逻辑隔离(依赖虚拟化) | 物理隔离(独立基础设施) |
|
||||
| **运维复杂度** | 高(需精细权限管理) | 低(独立部署) |
|
||||
| **安全顾虑** | 更高(侧信道风险) | 更低(物理隔离) |
|
||||
| **适用场景** | 公有云、SaaS | 私有云、高安全合规场景 |
|
||||
|
||||
## In Cloud Context
|
||||
|
||||
- **公有云**:默认多租户模式——多个组织共享同一批服务器
|
||||
- **私有云**:通常单租户,但也支持多租户配置(企业内部)
|
||||
- **混合云**:跨租户数据流动带来额外的安全风险和管理复杂性
|
||||
|
||||
## Security Implications
|
||||
|
||||
多租户环境下的安全挑战:
|
||||
- **侧信道攻击**:恶意租户通过共享资源窃取信息
|
||||
- **数据泄露风险**:隔离机制失效导致跨租户数据访问
|
||||
- **资源竞争**:某一租户的高负载影响其他租户性能
|
||||
- **合规冲突**:不同租户的合规要求可能相互矛盾
|
||||
|
||||
缓解措施:微分段(Micro-segmentation)、零信任架构、加密隔离、Kubernetes 命名空间隔离。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Public Cloud]] — 多租户的主要载体
|
||||
- [[Private Cloud]] — 可选择多租户或单租户
|
||||
- [[Hybrid Cloud]] — 跨租户数据流动带来复杂性
|
||||
- [[High-Availability]] — 多租户架构需考虑高可用设计
|
||||
- [[Cloud-Security]] — 多租户隔离是云安全的核心议题
|
||||
|
||||
## Sources
|
||||
|
||||
- [[Public vs Private vs Hybrid Cloud Differences Explained|sources/public-vs-private-vs-hybrid-cloud-differences-explained]]
|
||||
62
wiki/concepts/Multi-factor-Authentication.md
Normal file
62
wiki/concepts/Multi-factor-Authentication.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
title: "Multi-factor Authentication (MFA)"
|
||||
type: concept
|
||||
tags: [cloud-computing, security, identity]
|
||||
date: 2025-03-02
|
||||
---
|
||||
|
||||
# Multi-factor Authentication (MFA)
|
||||
|
||||
**MFA**(多因素认证)是云安全的基础机制,通过验证两个或多个独立身份凭证来确认用户身份,防止未经授权的访问。
|
||||
|
||||
## Definition
|
||||
|
||||
多因素认证要求用户提供两种或以上的身份验证因素:
|
||||
1. **知识因素**(Something you know):密码、PIN
|
||||
2. **持有因素**(Something you have):手机、硬件令牌
|
||||
3. **固有因素**(Something you are):指纹、面部识别
|
||||
|
||||
## MFA Methods
|
||||
|
||||
| Method | Type | Security Level |
|
||||
|--------|------|---------------|
|
||||
| **SMS OTP** | 持有因素 | 中 |
|
||||
| **TOTP** (Google Authenticator, Authy) | 持有因素 | 高 |
|
||||
| **Hardware Token** (YubiKey) | 持有因素 | 极高 |
|
||||
| **Biometrics** | 固有因素 | 高 |
|
||||
| **Push Notification** | 持有因素 | 高 |
|
||||
| **Adaptive/ Risk-based MFA** | 组合 | 极高 |
|
||||
|
||||
## Cloud Provider Support
|
||||
|
||||
| Provider | MFA Support |
|
||||
|----------|------------|
|
||||
| **AWS** | MFA via IAM, supports hardware tokens, virtual MFA, SMS |
|
||||
| **Azure** | Azure AD MFA, Conditional Access, passwordless (FIDO2) |
|
||||
| **Google Cloud** | 2FA, Security Keys, Google Prompt |
|
||||
|
||||
## Cloud Myths Context
|
||||
|
||||
MFA 是反驳"云不安全"误解的核心机制之一:
|
||||
- 云平台强制或推荐 MFA,显著降低账户被盗风险
|
||||
- 云 MFA 实现比大多数本地系统更先进(自适应、条件访问)
|
||||
- 云服务商的 MFA 通常免费或低成本提供
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **强制 MFA**:对所有用户强制启用 MFA
|
||||
- **优先无密码**:FIDO2/WebAuthn 优于传统 OTP
|
||||
- **条件访问**:高风险操作触发额外验证
|
||||
- **保护特权账户**:Admin 账户必须使用硬件令牌
|
||||
- **账户恢复**:安全的 MFA 恢复机制
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[cloud-security]] — 云安全
|
||||
- [[Identity-and-Access-Management]] — 身份与访问管理
|
||||
- [[Zero-Trust]] — 零信任
|
||||
- [[cloud-computing]] — 云计算
|
||||
|
||||
## Sources
|
||||
|
||||
- [[The Myths and Misconceptions About Cloud Computing (LinkedIn)|sources/the-myths-and-misconceptions-about-cloud-computing-linkedin]]
|
||||
46
wiki/concepts/NFS网络备份.md
Normal file
46
wiki/concepts/NFS网络备份.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "NFS网络备份"
|
||||
tags: [backup, nfs, network-storage, nas]
|
||||
date: 2026-04-28
|
||||
---
|
||||
|
||||
# NFS网络备份
|
||||
|
||||
## Definition
|
||||
NFS(Network File System)网络备份是指通过 NFS 协议将备份数据存储到网络存储设备(如 NAS)的方案。Clonezilla 通过 NFS 将磁盘镜像文件传输到 Synology NAS 等网络存储设备,实现跨设备的集中式备份存储。
|
||||
|
||||
## Why NFS for Clonezilla?
|
||||
| 方案 | 优点 | 缺点 |
|
||||
|------|------|------|
|
||||
| NFS | Linux 原生支持,稳定可靠 | 需要配置 NFS 服务端 |
|
||||
| SMB/CIFS | Windows 兼容性好 | 速度稍慢 |
|
||||
| 外置硬盘 | 简单直接 | 需手动携带,不够自动化 |
|
||||
|
||||
> Clonezilla 官方推荐 NFS 作为首选备份存储方案。
|
||||
|
||||
## Workflow
|
||||
```
|
||||
源机器 (Clonezilla Live)
|
||||
→ 选择 nfs_server
|
||||
→ DHCP 自动获取 IP(或手动配置静态 IP)
|
||||
→ 输入 NAS IP(如 192.168.3.17)
|
||||
→ 输入 NFS 共享路径(如 /volume2/backups)
|
||||
→ 挂载成功 → 保存镜像到 NAS
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
1. NAS 端开启 NFS 服务
|
||||
2. 配置 NFS 导出(/etc/exports)
|
||||
3. 源机器与 NAS 网络互通
|
||||
4. 防火墙放行 NFS 端口(2049/TCP)
|
||||
|
||||
## Related Concepts
|
||||
- [[全盘镜像备份]] — NFS 存储的对象类型
|
||||
- [[永久挂载]] — NFS 备份前需先完成永久挂载配置
|
||||
- [[增量备份]] — 互补方案(NFS 镜像 vs rsync 增量)
|
||||
- [[裸机恢复]] — 从 NFS 镜像还原系统
|
||||
|
||||
## Related Sources
|
||||
- [[clonezilla对ubuntu-server进行全盘镜像备份]]
|
||||
- [[ubuntu服务器通过rsync实现日常增量备份]]
|
||||
- [[如何在ubuntu-server上通过nfs挂载synology-nas上的共享文件夹]]
|
||||
37
wiki/concepts/NVMe硬盘分区.md
Normal file
37
wiki/concepts/NVMe硬盘分区.md
Normal file
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "NVMe硬盘分区"
|
||||
type: concept
|
||||
tags: [nvme, linux, partition, ubuntu]
|
||||
date: 2026-04-14
|
||||
aliases: [NVMe, NVMe SSD, PCIe SSD]
|
||||
---
|
||||
|
||||
# NVMe硬盘分区
|
||||
|
||||
## Definition
|
||||
针对 NVMe PCIe 固态硬盘的分区策略与对齐优化。Ubuntu 24.04 自动识别 ZBook 等设备上的 NVMe 硬盘并进行对齐优化,但仍建议手动分区时遵循标准规范。
|
||||
|
||||
## HP ZBook NVMe 分区方案
|
||||
| 分区 | 挂载点 | 大小 | 文件系统 | 说明 |
|
||||
|------|--------|------|----------|------|
|
||||
| EFI System | /boot/efi | 512MB - 1GB | FAT32 | 存储 UEFI 引导程序(必须) |
|
||||
| Root | / | 100GB - 200GB | ext4 | 系统文件、Docker、应用程序 |
|
||||
| Home | /home | 剩余空间 | ext4 | 独立分区,重装可保留数据 |
|
||||
| Swap | swap | 8GB - 32GB | swap | 根据内存大小,建议为内存 1 倍 |
|
||||
|
||||
## Key Alignment Rules
|
||||
- 分区起始扇区:默认 2048(与 NVMe 块大小对齐)
|
||||
- Ubuntu 24.04 自动应用最佳对齐
|
||||
- 分区方案:**GPT**(NVMe 2TB+ 支持必须)
|
||||
|
||||
## Why ext4 for NVMe
|
||||
虽然 Ubuntu 24.04 支持 ZFS/Btrfs 高级文件系统,但对 Docker 环境和 HP ZBook 来说,ext4 的兼容性、稳定性和性能预测性最优。
|
||||
|
||||
## Related
|
||||
- [[HP ZBook]] — 目标设备(NVMe 硬盘)
|
||||
- [[GPT分区表]] — 分区表方案
|
||||
- [[Ubuntu 24.04]] — 操作系统
|
||||
- [[NVMe硬盘分区]] ← 针对 [[HP ZBook]] ← [[GPT分区表]]
|
||||
|
||||
## Sources
|
||||
- [[安装ubuntu-24-04-2在hp-zbook工作站笔记本上]]
|
||||
106
wiki/concepts/OWASP-Top-Ten.md
Normal file
106
wiki/concepts/OWASP-Top-Ten.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# OWASP Top Ten
|
||||
|
||||
## Definition
|
||||
The OWASP Top Ten represents a broad consensus about the most critical security risks to web applications. It is a standard awareness document for developers and web application security.
|
||||
|
||||
## Aliases
|
||||
- OWASP Top 10
|
||||
- OWASP Top Ten Web Application Security Risks
|
||||
|
||||
## Purpose
|
||||
为开发者和安全团队提供最关键 web 应用安全风险的共识性列表,是安全编码和测试的基础标准。
|
||||
|
||||
## Current List (2021)
|
||||
|
||||
### A01:2021 – Broken Access Control
|
||||
访问控制失效,包括:
|
||||
- 越权访问
|
||||
- 绕过访问控制
|
||||
- 不安全的直接对象引用
|
||||
|
||||
### A02:2021 – Cryptographic Failures
|
||||
密码学失败,包括:
|
||||
- 敏感数据泄露
|
||||
- 弱加密算法
|
||||
- 不正确的密钥管理
|
||||
|
||||
### A03:2021 – Injection
|
||||
注入攻击,包括:
|
||||
- SQL 注入
|
||||
- NoSQL 注入
|
||||
- OS 命令注入
|
||||
- LDAP 注入
|
||||
|
||||
### A04:2021 – Insecure Design
|
||||
不安全设计,包括:
|
||||
- 缺失或无效的访问控制
|
||||
- 业务逻辑漏洞
|
||||
- 威胁建模缺失
|
||||
|
||||
### A05:2021 – Security Misconfiguration
|
||||
安全配置错误,包括:
|
||||
- 不安全的默认配置
|
||||
- 错误处理信息泄露
|
||||
- 云服务配置错误
|
||||
|
||||
### A06:2021 – Vulnerable and Outdated Components
|
||||
易受攻击和过时的组件,包括:
|
||||
- 使用有漏洞的库
|
||||
- 未更新依赖
|
||||
- 不支持组件
|
||||
|
||||
### A07:2021 – Identification and Authentication Failures
|
||||
身份识别和认证失败,包括:
|
||||
- 弱密码策略
|
||||
- 会话管理问题
|
||||
- 凭证泄露
|
||||
|
||||
### A08:2021 – Software and Data Integrity Failures
|
||||
软件和数据完整性失败,包括:
|
||||
- 不安全的 CI/CD
|
||||
- 依赖混淆攻击
|
||||
- 更新未签名
|
||||
|
||||
### A09:2021 – Security Logging and Monitoring Failures
|
||||
安全日志和监控失败,包括:
|
||||
- 未记录安全事件
|
||||
- 告警未处理
|
||||
- 响应延迟
|
||||
|
||||
### A10:2021 – Server-Side Request Forgery (SSRF)
|
||||
服务端请求伪造,包括:
|
||||
- 从应用获取内部资源
|
||||
- 绕过防火墙
|
||||
- 访问云元数据服务
|
||||
|
||||
## Integration with DevSecOps
|
||||
|
||||
### Development Phase
|
||||
- 安全编码培训以 OWASP Top Ten 为基础
|
||||
- SAST 工具检测相关漏洞
|
||||
- 代码审查关注常见问题
|
||||
|
||||
### Testing Phase
|
||||
- DAST 工具模拟 Top Ten 攻击
|
||||
- 渗透测试重点关注
|
||||
- 自动化测试集成
|
||||
|
||||
### Operations Phase
|
||||
- 监控 Top Ten 相关告警
|
||||
- 漏洞扫描覆盖
|
||||
- 补丁管理
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — OWASP Top Ten 是安全编码和测试的基础
|
||||
- [[SAST]] — 检测代码中的 OWASP 问题
|
||||
- [[DAST]] — 动态检测 OWASP 漏洞
|
||||
- [[SCA]] — 检测易受攻击的组件
|
||||
- [[Shift-Left-Security]] — 早期发现 OWASP 问题
|
||||
|
||||
## Resources
|
||||
- OWASP 官网:https://owasp.org/www-project-top-ten/
|
||||
- OWASP Cheat Sheets
|
||||
- OWASP WebGoat(学习工具)
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
55
wiki/concepts/Pay-as-you-go.md
Normal file
55
wiki/concepts/Pay-as-you-go.md
Normal file
@@ -0,0 +1,55 @@
|
||||
---
|
||||
title: "Pay-as-you-go"
|
||||
type: concept
|
||||
tags: [cloud-computing, billing, economics]
|
||||
date: 2025-03-02
|
||||
---
|
||||
|
||||
# Pay-as-you-go
|
||||
|
||||
**Pay-as-you-go**(按使用量付费)是云计算的核心经济模型,用户仅为实际使用的资源付费,无需长期承诺或前期投入。
|
||||
|
||||
## Definition
|
||||
|
||||
按需计费模式,允许用户根据实际资源消耗(计算、存储、网络)进行付费,无需预留容量或签订长期合同。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **无前期成本**:无需购买硬件或签订长期合同
|
||||
- **弹性计费**:资源使用量决定费用,粒度可到秒/分钟
|
||||
- **按需扩缩**:流量高峰时扩容,低谷时缩减
|
||||
- **成本可见性**:实时监控和成本分摊
|
||||
|
||||
## Cost Optimization Strategies
|
||||
|
||||
| Strategy | Description | Savings |
|
||||
|----------|-------------|---------|
|
||||
| **Reserved Instances/Spot** | 预留或抢占式实例 | 30-70% vs On-demand |
|
||||
| **Auto Scaling** | 根据负载自动调整容量 | 避免过度配置 |
|
||||
| **Serverless** | 按函数执行计费 | 仅在函数运行时计费 |
|
||||
| **Savings Plans** | 承诺使用量换取折扣 | 20-40% 折扣 |
|
||||
|
||||
## Cloud Myths Context
|
||||
|
||||
Pay-as-you-go 是反驳"云太贵"误解的核心证据:
|
||||
- **传统采购**:CapEx 模式,前期大量投入,利用率低时浪费
|
||||
- **云按需付费**:OpEx 模式,按需使用,成本与业务对齐
|
||||
- 消除本地硬件采购、维护和升级的隐性成本
|
||||
|
||||
## Challenges
|
||||
|
||||
- **Egress 费用**:数据流出云端时的高额流量费
|
||||
- **意外计费**:缺乏监控导致超预期费用
|
||||
- **长期成本**:稳定负载下预留实例的复杂性
|
||||
- **复杂性**:多种计费模式的优化需要专业知识
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Cost-Optimization]] — 云成本优化
|
||||
- [[cloud-computing]] — 云计算
|
||||
- [[Scalability]] — 可扩展性
|
||||
- [[FinOps]] — 云财务管理
|
||||
|
||||
## Sources
|
||||
|
||||
- [[The Myths and Misconceptions About Cloud Computing (LinkedIn)|sources/the-myths-and-misconceptions-about-cloud-computing-linkedin]]
|
||||
81
wiki/concepts/Penetration-Testing.md
Normal file
81
wiki/concepts/Penetration-Testing.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Penetration Testing
|
||||
|
||||
## Definition
|
||||
Penetration testing (pen testing) is an authorized simulated cyberattack on a computer system, performed to evaluate the security of the system.
|
||||
|
||||
## Aliases
|
||||
- Pen Testing
|
||||
- Ethical Hacking
|
||||
- Security Testing
|
||||
|
||||
## Concept
|
||||
渗透测试是授权的模拟网络攻击,用于评估系统的安全性。
|
||||
|
||||
## Types
|
||||
|
||||
### By Scope
|
||||
- **Black Box**:测试人员不了解目标内部结构
|
||||
- **White Box**:测试人员完全了解系统
|
||||
- **Grey Box**:部分了解系统信息
|
||||
|
||||
### By Target
|
||||
- Network Penetration Testing
|
||||
- Web Application Penetration Testing
|
||||
- Mobile Application Testing
|
||||
- Social Engineering
|
||||
- Physical Security Testing
|
||||
|
||||
## Methodology
|
||||
|
||||
### PTES (Penetration Testing Execution Standard)
|
||||
1. Pre-Engagement Interactions
|
||||
2. Intelligence Gathering
|
||||
3. Threat Modeling
|
||||
4. Vulnerability Analysis
|
||||
5. Exploitation
|
||||
6. Post-Exploitation
|
||||
7. Reporting
|
||||
|
||||
### OWASP Testing Guide
|
||||
- 信息收集
|
||||
- 配置和部署管理测试
|
||||
- 身份管理测试
|
||||
- 认证测试
|
||||
- 授权测试
|
||||
- 会话管理测试
|
||||
- 输入验证测试
|
||||
- 错误处理测试
|
||||
- 密码学测试
|
||||
- 业务逻辑测试
|
||||
- 客户端测试
|
||||
|
||||
## Tools
|
||||
- Metasploit — 渗透测试框架
|
||||
- Burp Suite — Web 应用测试
|
||||
- Nmap — 网络扫描
|
||||
- Wireshark — 网络协议分析
|
||||
- SQLmap — SQL 注入测试
|
||||
- Kali Linux — 渗透测试操作系统
|
||||
|
||||
## Integration with DevSecOps
|
||||
|
||||
### Continuous Pen Testing
|
||||
- 定期执行
|
||||
- 自动化工具集成
|
||||
- 关键时间点测试
|
||||
|
||||
### Red Team Operations
|
||||
- 模拟真实攻击
|
||||
- 全面评估防御能力
|
||||
- 团队对抗演练
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 渗透测试是安全评估的重要组成
|
||||
- [[Bug-Bounty]] — 持续外部安全测试
|
||||
- [[Vulnerability-Scanning]] — 自动化漏洞发现
|
||||
- [[DAST]] — 动态应用安全测试
|
||||
- [[Threat-Modeling]] — 威胁建模
|
||||
- [[Incident-Response]] — 事件响应
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
75
wiki/concepts/Policy-as-Code.md
Normal file
75
wiki/concepts/Policy-as-Code.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
title: "Policy as Code (PaC)"
|
||||
type: concept
|
||||
tags: [security, devops, compliance, automation]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
策略即代码(Policy as Code)是将安全、合规和运维策略编写为可执行代码的做法,通过自动化执行和持续验证替代人工审计和手动检查。
|
||||
|
||||
## Core Concept
|
||||
|
||||
```
|
||||
传统模式: PaC模式:
|
||||
───────── ─────────
|
||||
人工编写策略 → 文档化 → 人工检查 → 间歇性审计
|
||||
↓
|
||||
策略代码化 → 自动执行 → 持续验证 → 实时合规
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
| 优势 | 描述 |
|
||||
|------|------|
|
||||
| **一致性** | 每次执行使用相同规则,消除人为错误 |
|
||||
| **可版本控制** | 策略变更通过Git跟踪和审查 |
|
||||
| **自动化** | CI/CD集成,持续验证 |
|
||||
| **可测试** | 策略可单元测试和集成测试 |
|
||||
| **审计友好** | 自动生成审计日志 |
|
||||
|
||||
## Implementation Patterns
|
||||
|
||||
### 1. OPA (Open Policy Agent)
|
||||
```rego
|
||||
# OPA Rego策略示例
|
||||
package kubernetes.admission
|
||||
|
||||
deny[msg] {
|
||||
input.request.kind.kind == "Pod"
|
||||
not input.request.object.spec.hostIPC
|
||||
msg := "HostIPC is not allowed"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Terraform Sentinel
|
||||
```hcl
|
||||
# Terraform策略即代码
|
||||
policy "require-tags" {
|
||||
enforcement_level = "advisory"
|
||||
validate = func(resource) {
|
||||
all resource.values.tags != undefined
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. In ITSM Context
|
||||
|
||||
在[[ITSM]]中,PaC支撑[[Security-and-Compliance]]:
|
||||
|
||||
- **变更合规** — 自动验证变更符合安全策略
|
||||
- **配置基线** — 确保配置项符合基线
|
||||
- **访问控制** — 自动执行最小权限原则
|
||||
- **审计自动化** — 生成合规报告
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Zero-Trust-Architecture]] — ZTA依赖PaC实现自动化
|
||||
- [[Security-and-Compliance]] — PaC的核心应用场景
|
||||
- [[Infrastructure-as-Code]] — IaC与PaC的协同
|
||||
- [[DevSecOps]] — PaC在DevSecOps中的角色
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — Policy-as-Code在ITSM中的应用
|
||||
69
wiki/concepts/Predictive-Maintenance.md
Normal file
69
wiki/concepts/Predictive-Maintenance.md
Normal file
@@ -0,0 +1,69 @@
|
||||
---
|
||||
title: "Predictive Maintenance"
|
||||
tags:
|
||||
- devops
|
||||
- reliability
|
||||
- ai
|
||||
- operations
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Predictive Maintenance
|
||||
|
||||
## Definition
|
||||
|
||||
Predictive Maintenance 是基于历史故障模式学习,**主动建议补丁或变更**以预防非计划停机的方法。Agentic AI 分析历史运维数据,预测潜在故障并提前采取预防措施。
|
||||
|
||||
## Mechanism
|
||||
|
||||
```
|
||||
Historical Data → Pattern Learning → Failure Prediction → Proactive Action
|
||||
↓
|
||||
运维日志、告警历史、变更记录、监控数据
|
||||
↓
|
||||
ML 模型识别故障前兆模式
|
||||
↓
|
||||
- 磁盘 I/O 逐渐下降 → 预测磁盘故障 → 建议迁移
|
||||
- 内存使用率周期性峰值 → 预测 OOM → 建议扩容
|
||||
- API 响应时间逐步增加 → 预测容量瓶颈 → 建议扩缩容
|
||||
```
|
||||
|
||||
## 与 Self-Healing Systems 的关系
|
||||
|
||||
| 维度 | Reactive (Self-Healing) | Predictive (Predictive Maintenance) |
|
||||
|------|------------------------|-----------------------------------|
|
||||
| 时机 | 故障发生后修复 | 故障发生前预防 |
|
||||
| 目标 | 减少 MTTR | 减少 MTBF (Mean Time Between Failures) |
|
||||
| 成本 | 被动投入 | 主动投入,高 ROI |
|
||||
| 成熟度 | Level 4 AIOps | Level 5 AIOps |
|
||||
|
||||
## 示例
|
||||
|
||||
> Agentic AI analyzes 6 months of Kubernetes pod restart logs and identifies:
|
||||
> - Pods restart every 48-72 hours
|
||||
> - Pattern correlates with memory leak in v2.3.1 of service
|
||||
> - **Predicts**: Next scheduled restart will cause cascade failure
|
||||
> - **Proposes**: Patch to v2.3.2 + preventive restart during low-traffic window
|
||||
|
||||
## 与 [[AIOps]] 的关系
|
||||
|
||||
Predictive Maintenance 是 [[AIOps]] Level 5 (Optimizing) 的核心能力:
|
||||
|
||||
```python
|
||||
DevOps_Maturity_AIOps = {
|
||||
"Level 3 - Defined": "Smart Alerting",
|
||||
"Level 4 - Advanced": "Self-Healing: Automated Remediation",
|
||||
"Level 5 - Optimizing": "Predictive Maintenance ←" # ← 本页
|
||||
}
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Self-Healing Systems]] — Predictive 是 Reactive 的进化
|
||||
- [[AIOps]] — Predictive Maintenance 是 AIOps 的高级能力
|
||||
- [[MTTR]] — Predictive 改善 MTBF,MTTR 不变但故障减少
|
||||
- [[Availability]] — Predictive 直接提升可用性
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
144
wiki/concepts/Private-Cloud.md
Normal file
144
wiki/concepts/Private-Cloud.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# Private Cloud
|
||||
|
||||
> **Private Cloud** — 私有云是专属于单一组织的云环境,通过安全私有网络访问,可由组织本地托管或由第三方供应商托管,提供比公有云更高的性能、安全性和控制力。
|
||||
|
||||
## Definition
|
||||
|
||||
私有云(Private Cloud)是专为单个组织构建和运营的云基础设施。与公有云不同,私有云的资源不与其他组织共享,因此提供更强的安全性、可定制性和性能保证。私有云可以部署在组织自己的数据中心(on-premises),也可以由第三方供应商在其设施中托管(hosted private cloud)。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
| 特征 | 描述 |
|
||||
|------|------|
|
||||
| **独占环境** | 资源仅供单一组织使用,无多租户共享 |
|
||||
| **高安全性** | 自定义安全协议和配置,满足严格合规要求 |
|
||||
| **私有网络访问** | 通过专用连接访问,非公开互联网 |
|
||||
| **可定制性** | 根据组织需求深度定制基础设施 |
|
||||
| **强SLA保证** | 高性能和高效率的服务级别协议 |
|
||||
| **部署灵活性** | 可本地托管、托管托管或混合部署 |
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ 组织 A 私有云环境 │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ 私有数据中心 / 托管设施 │ │
|
||||
│ │ │ │
|
||||
│ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │
|
||||
│ │ │ Web │ │ App │ │ DB │ │ │
|
||||
│ │ │层 │ │ 层 │ │ 层 │ │ │
|
||||
│ │ └─────┘ └─────┘ └─────┘ │ │
|
||||
│ │ │ │
|
||||
│ │ 虚拟化层(VMware/KVM/OpenStack) │ │
|
||||
│ │ 专用物理服务器集群 │ │
|
||||
│ └──────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ↕ 私有网络/专用连接 │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ 组织内部用户 / 远程用户 │ │
|
||||
│ └──────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Advantages
|
||||
|
||||
### 1. 独占安全环境
|
||||
- 其他组织无法访问的专用安全环境
|
||||
- 定制安全协议满足严格合规需求
|
||||
- 不受其他租户安全事件影响("噪音邻居"问题消除)
|
||||
|
||||
### 2. 自定义安全配置
|
||||
- 运行自定义协议、配置和措施
|
||||
- 根据独特工作负载需求定制安全策略
|
||||
- 满足行业特定合规标准(HIPAA、PCI-DSS、ISO 27001)
|
||||
|
||||
### 3. 无性能权衡的扩展性
|
||||
- 高扩展性和效率满足不可预测需求
|
||||
- 不牺牲安全性和性能
|
||||
- 可根据业务增长弹性扩展容量
|
||||
|
||||
### 4. 高效性能
|
||||
- 可靠的高SLA性能和效率
|
||||
- 无资源竞争和延迟问题
|
||||
- 专属资源保证一致性性能
|
||||
|
||||
### 5. 灵活性
|
||||
- 根据组织变化的需求灵活转型基础设施
|
||||
- 支持定制化集成和工作负载优化
|
||||
- 适应业务和技术双重变革
|
||||
|
||||
### 6. 专属资源无争用
|
||||
- 不与其他组织共享资源
|
||||
- 不存在资源竞争导致的性能下降
|
||||
- 延迟可预测和可控
|
||||
|
||||
## Drawbacks
|
||||
|
||||
### 1. 成本较高
|
||||
- 相比公有云替代方案,TCO相对较高
|
||||
- 尤其对于短期使用场景不经济
|
||||
- 需要持续的资本投入和维护成本
|
||||
|
||||
### 2. 远程使用受限
|
||||
- 高安全措施可能导致远程用户访问受限
|
||||
- 移动办公支持不如公有云灵活
|
||||
- 需要额外的VPN或专线配置
|
||||
|
||||
### 3. 扩展性受限
|
||||
- 若数据中心局限于本地计算资源,基础设施可能无法高扩展
|
||||
- 应对突发流量需要提前容量规划
|
||||
- 扩展周期长于公有云
|
||||
|
||||
### 4. 管理复杂
|
||||
- 需要大量内部技术专业知识运营私有云
|
||||
- 维护、更新、安全需要专职团队
|
||||
- 运营成本持续较高
|
||||
|
||||
### 5. 潜在资源浪费
|
||||
- 可能无法充分利用资源,导致昂贵基础设施浪费
|
||||
- 容量规划不准确导致过度配置
|
||||
- 闲置资源成本高
|
||||
|
||||
## When to Use Private Cloud
|
||||
|
||||
| 场景 | 说明 |
|
||||
|------|------|
|
||||
| **高度监管行业** | 金融、医疗、政府等对数据安全有严格要求的行业 |
|
||||
| **敏感数据处理** | 包含个人隐私、商业机密、知识产权的数据 |
|
||||
| **强控制和安全性要求** | 需要对IT工作负载和底层基础设施的完全控制 |
|
||||
| **大型企业** | 需要先进技术数据中心高效运营的复杂组织 |
|
||||
| **高可用性需求** | 任务关键型应用需要99.99%+ SLA保证 |
|
||||
| **合规驱动** | 必须满足特定行业法规的数据驻留要求 |
|
||||
|
||||
## Cost Structure
|
||||
|
||||
| 成本维度 | 私有云 | 公有云 |
|
||||
|----------|--------|--------|
|
||||
| 前期资本投入 | 高(硬件采购) | 低 |
|
||||
| 扩展成本 | 需要新硬件 | 按使用付费 |
|
||||
| 人员成本 | 高(专职团队) | 低(供应商管理) |
|
||||
| 维护成本 | 内部承担 | 提供商承担 |
|
||||
| 适合场景 | 长期、稳定、高价值负载 | 短期、弹性、变化负载 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Public Cloud]] — 公有云部署模式对比
|
||||
- [[Hybrid Cloud]] — 混合云结合公私优势
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[Cloud Compliance]] — 云合规性
|
||||
- [[High Availability]] — 高可用性
|
||||
- [[SLA]] — 服务级别协议
|
||||
- [[Disaster Recovery Planning]] — 灾难恢复规划
|
||||
- [[ISO-27001]] — 信息安全管理体系标准
|
||||
- [[HIPAA]] — 美国医疗健康信息隐私法规
|
||||
- [[GDPR]] — 欧盟通用数据保护条例
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Computing]] — 云计算基础
|
||||
- [[Cloud-Adoption-Strategy]] — 云采用策略
|
||||
- [[Cloud-Maturity-Model]] — 云成熟度模型
|
||||
- [[Cloud-Migration]] — 云迁移
|
||||
- [[Shared-Responsibility-Model]] — 共享责任模型
|
||||
63
wiki/concepts/Problem-Management.md
Normal file
63
wiki/concepts/Problem-Management.md
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: "Problem Management"
|
||||
type: concept
|
||||
tags: [itsm, incident-management, operations]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
问题管理(Problem Management)是[[ITSM]]的核心流程之一,专注于**识别和分析IT服务问题的根本原因**,防止同类事件重复发生。与事件管理(Incident Management)处理症状不同,问题管理处理的是根本原因。
|
||||
|
||||
## Problem Management vs Incident Management
|
||||
|
||||
| 维度 | 事件管理 | 问题管理 |
|
||||
|------|---------|---------|
|
||||
| 目标 | 快速恢复服务 | 消除根本原因 |
|
||||
| 处理 | 症状 | 根因 |
|
||||
| KPI | MTTR | 问题消除率 |
|
||||
| 时效 | 即时 | 中长期 |
|
||||
|
||||
## Problem Management Process
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Problem │ → │ Root Cause │ → │ Known Error │
|
||||
│ Detection │ │ Analysis │ │ Document │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
↓ ↓ ↓
|
||||
AI Anomaly ML-enhanced Known Error
|
||||
Detection RCA Process Database (KEDB)
|
||||
```
|
||||
|
||||
## Modern Problem Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,问题管理由AI驱动:
|
||||
|
||||
### AI-Driven Features
|
||||
- **Anomaly Detection** — 自动识别异常模式
|
||||
- **Predictive Analytics** — 预测潜在问题
|
||||
- **ML-enhanced RCA** — 机器学习加速根因分析
|
||||
- **Automated KEDB Updates** — 自动更新已知错误库
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| Problem Resolution Rate | 问题解决率 |
|
||||
| Mean Time to Diagnose (MTTD) | 平均诊断时间 |
|
||||
| Recurring Incidents | 重复发生事件数 |
|
||||
| Known Error Accuracy | 已知错误准确率 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Incident-Management]] — 事件管理
|
||||
- [[Root-Cause-Analysis]] — 根因分析
|
||||
- [[AIOps]] — AI驱动的分析能力
|
||||
- [[MTTD]] — 平均诊断时间
|
||||
- [[Event-Correlation]] — 事件关联
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AI-driven Problem Management
|
||||
106
wiki/concepts/Progressive-Rollout.md
Normal file
106
wiki/concepts/Progressive-Rollout.md
Normal file
@@ -0,0 +1,106 @@
|
||||
---
|
||||
title: "Progressive Rollout (渐进式放量)"
|
||||
tags: [devops, continuous-delivery, feature-management, risk-mitigation]
|
||||
aliases: [Canary Deployment, 灰度发布, Canary Release]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Progressive Rollout (渐进式放量)
|
||||
|
||||
**Progressive Rollout**(渐进式放量/灰度发布)是一种通过 [[Feature Flag]] 控制新功能逐步向用户群发布的风险管理策略。与"全有或全无"的传统部署不同,Progressive Rollout 将影响范围控制在最小范围内,从而实现**可量化的 RTO**。
|
||||
|
||||
## Aliases
|
||||
- Canary Deployment
|
||||
- Canary Release
|
||||
- 灰度发布
|
||||
- Staged Rollout
|
||||
|
||||
## Core Mechanism
|
||||
|
||||
> "Instead of flipping the switch for everyone simultaneously, roll out gradually."
|
||||
|
||||
```
|
||||
1% 用户 → 观察错误率、性能指标
|
||||
5% 用户 → 监控转化率、用户反馈
|
||||
25% 用户 → 检查下游系统负载
|
||||
100% 用户 → 完成全量发布
|
||||
```
|
||||
|
||||
## Progressive Rollout vs. Big Bang Release
|
||||
|
||||
| 维度 | Big Bang(全量发布) | Progressive Rollout(渐进式放量) |
|
||||
|------|---------------------|----------------------------------|
|
||||
| 影响范围 | 全部用户 | 受控小群体 |
|
||||
| 问题发现 | 事后 | 事中(1% 阶段即可发现) |
|
||||
| RTO(如果出问题) | 小时级(紧急回滚) | 秒级(关闭开关) |
|
||||
| 回滚风险 | 可能丢失新事务 | 无数据损失 |
|
||||
| 团队压力 | 高(2AM 部署) | 低(白天放量) |
|
||||
| 反馈收集 | 事后分析 | 实时监控 |
|
||||
|
||||
## RTO 重新定义
|
||||
|
||||
> "If something breaks at the 5% mark, you've contained the damage. Your RTO is measured in seconds (flip the flag off) instead of hours (emergency rollback deployment)."
|
||||
|
||||
| 场景 | RTO(Big Bang) | RTO(Progressive Rollout) |
|
||||
|------|-----------------|---------------------------|
|
||||
| 发现问题 | 全量发布后 | 1% 阶段即可监控到 |
|
||||
| 止血时间 | 小时级(回滚部署) | 秒级(关闭开关) |
|
||||
| 受影响用户 | 100% | 最多 5%(当前阶段) |
|
||||
|
||||
## 放量策略
|
||||
|
||||
### 基于用户群体的定向放量
|
||||
|
||||
| 策略 | 说明 | 适用场景 |
|
||||
|------|------|----------|
|
||||
| 随机抽样 | 随机选取 X% 用户 | 通用场景 |
|
||||
| 地区定向 | 先在特定地区放量 | 法规合规、时区测试 |
|
||||
| 用户分层 | 优先向付费用户放量 | 降低高价值用户风险 |
|
||||
| 设备类型 | 先桌面后移动 | 移动端兼容性风险 |
|
||||
| Beta 用户 | 先向内部/Beta 用户开放 | 需要早期反馈 |
|
||||
|
||||
### 基于指标的自动 gates
|
||||
|
||||
```yaml
|
||||
rollout_stages:
|
||||
- percentage: 1
|
||||
gates:
|
||||
- error_rate < 0.1%
|
||||
- p95_latency < 500ms
|
||||
- percentage: 5
|
||||
gates:
|
||||
- conversion_rate > baseline - 5%
|
||||
- support_tickets < 10
|
||||
- percentage: 25
|
||||
gates:
|
||||
- downstream_api_latency < 200ms
|
||||
- no_critical_errors
|
||||
```
|
||||
|
||||
## 与 [[Kill Switch]] 的关系
|
||||
|
||||
Progressive Rollout 和 Kill Switch 是同一机制的两面:
|
||||
|
||||
- **Progressive Rollout**:控制功能如何到达用户(渐进式)
|
||||
- **Kill Switch**:在发现问题时紧急切断(防御性)
|
||||
|
||||
两者结合 → 既有渐进放量的可控性,又有 Kill Switch 的应急保障。
|
||||
|
||||
## 实践要点
|
||||
|
||||
1. **监控先行**:每次放量前确保监控仪表盘就绪
|
||||
2. **定义回退标准**:什么指标触发停止放量或回退?
|
||||
3. **自动化放量**:避免手动操作带来的错误
|
||||
4. **跨团队对齐**:产品、工程、运营需要共同定义放量节奏
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Feature Flag]] — Progressive Rollout 的技术基础
|
||||
- [[Kill Switch]] — Progressive Rollout 的应急保障
|
||||
- [[RTO]] — Progressive Rollout 将 RTO 从小时降至秒级
|
||||
- [[Deployment-vs-Release]] — Progressive Rollout 实现部署与发布的解耦
|
||||
- [[Micro-Recovery]] — Progressive Rollout 支持 feature 级别的精准恢复
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
83
wiki/concepts/PromQL.md
Normal file
83
wiki/concepts/PromQL.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
title: "PromQL"
|
||||
type: concept
|
||||
aliases: [Prometheus Query Language, Prometheus查询语言]
|
||||
tags: [prometheus, query, monitoring, time-series]
|
||||
date: 2025-11-11
|
||||
---
|
||||
|
||||
# PromQL
|
||||
|
||||
## Overview
|
||||
PromQL(Prometheus Query Language)是 Prometheus 内置的时序数据查询语言,用于从 Prometheus TSDB 中检索和处理指标数据。它支持即时向量查询(返回当前时间点的样本)、范围向量查询(返回一段时间内的样本序列)、标量转换、聚合操作和丰富的函数库。
|
||||
|
||||
## Query Types
|
||||
|
||||
### 即时向量查询(Instant Vector)
|
||||
返回当前时刻的一组时间序列,每个序列只有一个样本值:
|
||||
```promql
|
||||
node_memory_MemAvailable_bytes
|
||||
```
|
||||
### 范围向量查询(Range Vector)
|
||||
返回一段时间内的样本序列,用于计算速率:
|
||||
```promql
|
||||
rate(node_cpu_seconds_total{mode="user"}[2m])
|
||||
```
|
||||
### 标量(Scalar)
|
||||
没有时间序列的单个数值:
|
||||
```promql
|
||||
count(node_cpu_seconds_total)
|
||||
```
|
||||
|
||||
## Aggregation Operators
|
||||
PromQL 内置丰富的聚合操作符:
|
||||
| 操作符 | 说明 |
|
||||
|--------|------|
|
||||
| `sum()` | 求和 |
|
||||
| `avg()` | 平均值 |
|
||||
| `min()` / `max()` | 最小/最大值 |
|
||||
| `count()` | 计数 |
|
||||
| `stddev()` / `stdvar()` | 标准差/方差 |
|
||||
| `topk()` / `bottomk()` | 最大/最小的 k 个值 |
|
||||
| `rate()` | 平均速率(适合 Counter) |
|
||||
| `increase()` | 增量(适合 Counter) |
|
||||
| `irate()` | 瞬时速率(适合快速变化指标) |
|
||||
|
||||
## Common Patterns for Home Server Monitoring
|
||||
```promql
|
||||
# CPU 使用率(5分钟平均 > 85%)
|
||||
avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
|
||||
|
||||
# 磁盘可用空间 < 10%
|
||||
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
|
||||
|
||||
# 内存可用率 < 15%
|
||||
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
|
||||
|
||||
# HTTP 探测失败(黑盒)
|
||||
probe_success == 0
|
||||
|
||||
# TLS 证书剩余天数 < 14 天
|
||||
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
|
||||
|
||||
# 容器重启次数
|
||||
increase(container_restart_count[5m]) > 0
|
||||
```
|
||||
|
||||
## Label Matchers
|
||||
| Matcher | 说明 |
|
||||
|---------|------|
|
||||
| `=` | 精确匹配 |
|
||||
| `!=` | 不等于 |
|
||||
| `=~` | 正则匹配 |
|
||||
| `!~` | 正则不匹配 |
|
||||
|
||||
示例:`{job=~"node.*", mode!="idle"}`
|
||||
|
||||
## Related Entities
|
||||
- [[Prometheus]] — 查询引擎宿主
|
||||
|
||||
## Related Concepts
|
||||
- [[Prometheus告警规则]] — 基于 PromQL 的告警条件
|
||||
- [[时序数据库]] — 数据模型基础
|
||||
- [[Exporter]] — 指标来源
|
||||
116
wiki/concepts/Prometheus告警规则.md
Normal file
116
wiki/concepts/Prometheus告警规则.md
Normal file
@@ -0,0 +1,116 @@
|
||||
---
|
||||
title: "Prometheus告警规则"
|
||||
type: concept
|
||||
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
|
||||
tags: [prometheus, alerting, monitoring, devops, prometheus]
|
||||
date: 2025-11-11
|
||||
---
|
||||
|
||||
# Prometheus告警规则
|
||||
|
||||
## Overview
|
||||
Prometheus 告警规则(Alert Rules)是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
|
||||
|
||||
## Rule Format
|
||||
```yaml
|
||||
groups:
|
||||
- name: <group_name> # 告警组名称(全局唯一)
|
||||
interval: <duration> # 评估间隔(可选,默认 evaluation_interval)
|
||||
rules:
|
||||
- alert: <alert_name> # 告警名称(Alertmanager 中唯一标识)
|
||||
expr: <promql_expr> # 触发条件的 PromQL 表达式
|
||||
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
|
||||
labels: # 标签(用于 Alertmanager 路由和分类)
|
||||
severity: <level> # 如:critical / warning / info
|
||||
annotations: # 注解(人类可读的告警描述)
|
||||
summary: <text> # 简短摘要
|
||||
description: <text> # 详细描述,支持模板变量
|
||||
```
|
||||
|
||||
## Template Variables(注解模板)
|
||||
在 `description` 中可以使用 `$labels` 和 `$value` 等模板变量:
|
||||
```yaml
|
||||
annotations:
|
||||
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%)"
|
||||
```
|
||||
|
||||
## Home Server Alert Rules(alerts.yml 完整示例)
|
||||
```yaml
|
||||
groups:
|
||||
- name: system-alerts
|
||||
rules:
|
||||
|
||||
- alert: HostHighCPU
|
||||
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "高 CPU 使用率"
|
||||
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
|
||||
|
||||
- alert: HostLowDisk
|
||||
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "磁盘空间不足"
|
||||
description: "磁盘剩余空间低于 10%"
|
||||
|
||||
- alert: HostLowMemory
|
||||
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "内存使用率高"
|
||||
description: "可用内存低于 15%"
|
||||
|
||||
- alert: HTTPProbeFailed
|
||||
expr: probe_success == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "站点不可达"
|
||||
description: "HTTP 探测失败:{{ $labels.instance }}"
|
||||
|
||||
- alert: TLSCertExpiring
|
||||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "TLS 证书即将到期"
|
||||
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
|
||||
```
|
||||
|
||||
## Alert Lifecycle
|
||||
```
|
||||
Inactive(正常)→ Pending(等待确认,for 计时中)→ Firing(触发,发送给 Alertmanager)
|
||||
```
|
||||
|
||||
## Prometheus Configuration
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
rule_files:
|
||||
- "/etc/prometheus/alerts.yml"
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: ['alertmanager:9093']
|
||||
|
||||
global:
|
||||
evaluation_interval: 30s # 告警规则评估间隔
|
||||
```
|
||||
|
||||
## Related Entities
|
||||
- [[Prometheus]] — 告警引擎宿主
|
||||
- [[Alertmanager]] — 告警接收和分发
|
||||
|
||||
## Related Concepts
|
||||
- [[PromQL]] — 告警条件的查询语言
|
||||
- [[Alertmanager]] — 告警分发机制
|
||||
- [[System Monitoring]] — 上游应用领域
|
||||
129
wiki/concepts/Public-Cloud.md
Normal file
129
wiki/concepts/Public-Cloud.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Public Cloud
|
||||
|
||||
> **Public Cloud** — 公有云是通过互联网交付、由第三方提供商同时为多个组织(多租户)提供计算资源(服务器、存储、应用)的云服务模式。用户按需付费,无需前期硬件投入。
|
||||
|
||||
## Definition
|
||||
|
||||
公有云(Public Cloud)由第三方云服务提供商(如 AWS、Azure、GCP)通过公共互联网向所有用户提供共享的云基础设施和服务。所有资源在提供商的数据中心中运行,多个租户共享同一物理基础设施,但通过虚拟化技术实现逻辑隔离。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
| 特征 | 描述 |
|
||||
|------|------|
|
||||
| **多租户共享** | 多个组织共享底层物理资源,通过虚拟化隔离 |
|
||||
| **按需付费** | Pay-as-you-go 定价,根据实际消耗计费 |
|
||||
| **弹性扩展** | 快速扩展或收缩资源,无需硬件采购 |
|
||||
| **即服务交付** | IaaS / PaaS / SaaS 三层服务模型 |
|
||||
| **供应商管理** | 提供商负责硬件维护、安全更新、容量规划 |
|
||||
|
||||
## Deployment Model
|
||||
|
||||
```
|
||||
用户终端 (PC/平板/手机)
|
||||
↓
|
||||
互联网
|
||||
↓
|
||||
┌─────────────────────────────────┐
|
||||
│ 云服务提供商数据中心 │
|
||||
│ ┌─────┐ ┌─────┐ ┌─────┐ │
|
||||
│ │租户A│ │租户B│ │租户C│ ... │ ← 多租户共享
|
||||
│ └─────┘ └─────┘ └─────┘ │
|
||||
│ │
|
||||
│ 虚拟化层(VM/容器) │
|
||||
│ 物理服务器集群 │
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Advantages
|
||||
|
||||
### 1. 无前期资本投入 (CapEx → OpEx)
|
||||
- 无需采购和维护IT基础设施
|
||||
- 将资本支出转换为运营支出
|
||||
- 降低初始投资风险
|
||||
|
||||
### 2. 技术敏捷性
|
||||
- 高弹性和灵活性,应对不可预测的工作负载
|
||||
- 快速部署新服务和应用
|
||||
- 访问最新技术和版本
|
||||
|
||||
### 3. 全球可访问性
|
||||
- 任何有互联网连接的地方均可访问
|
||||
- 支持远程办公和分布式团队协作
|
||||
- 跨地域快速扩张
|
||||
|
||||
### 4. 专业管理与运维
|
||||
- 提供商负责硬件维护和更新
|
||||
- 始终运行在最新、正确配置的硬件上
|
||||
- 减少内部IT团队压力
|
||||
|
||||
### 5. 成本效率
|
||||
- 仅支付实际使用的资源
|
||||
- 灵活定价选项适应不同SLA需求
|
||||
- 支持精益增长战略
|
||||
|
||||
### 6. 快速灾难恢复
|
||||
- 数据和应用定期备份并存放在多个地理位置
|
||||
- 最小化数据丢失风险
|
||||
- 确保业务连续性
|
||||
|
||||
## Drawbacks
|
||||
|
||||
### 1. 缺乏成本控制
|
||||
- 大规模使用时总拥有成本 (TCO) 可能指数增长
|
||||
- 对中大型企业尤为明显
|
||||
- 需要 FinOps 实践控制成本
|
||||
|
||||
### 2. 安全性较低
|
||||
- 多租户共享环境存在潜在安全风险
|
||||
- 不适合敏感和关键任务的IT工作负载
|
||||
- 合规性控制可能不足
|
||||
|
||||
### 3. 技术控制有限
|
||||
- 对基础设施的可见性和控制度低
|
||||
- 可能无法满足特定合规需求
|
||||
- 定制化能力受限
|
||||
|
||||
### 4. 供应商依赖
|
||||
- 更换供应商的数据和服务迁移复杂且成本高昂
|
||||
- 可能被锁定在单一提供商
|
||||
- 议价能力受限
|
||||
|
||||
## When to Use Public Cloud
|
||||
|
||||
| 场景 | 说明 |
|
||||
|------|------|
|
||||
| **可预测的计算需求** | 通信服务、固定的业务运营应用 |
|
||||
| **IT和业务运营必需的应用** | 核心业务系统 |
|
||||
| **应对峰值负载** | 季节性流量波动、促销活动 |
|
||||
| **软件开发与测试** | 快速创建和销毁测试环境 |
|
||||
| **短期的特定项目** | 避免长期硬件投资 |
|
||||
|
||||
## TCO Comparison
|
||||
|
||||
| 成本维度 | 公有云 | 本地私有 |
|
||||
|----------|--------|---------|
|
||||
| 前期资本投入 | 低 | 高 |
|
||||
| 扩展成本 | 边际成本低 | 需要新硬件采购 |
|
||||
| 维护成本 | 提供商承担 | 内部承担 |
|
||||
| 人员成本 | 较低 | 需要专职团队 |
|
||||
| 峰值容量规划 | 自动弹性 | 需要过度配置 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Private Cloud]] — 私有云部署模式对比
|
||||
- [[Hybrid Cloud]] — 混合云结合公私优势
|
||||
- [[Multi-Cloud Strategy]] — 多云避免锁定
|
||||
- [[FinOps]] — 云成本管理
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[CapEx-vs-OpEx]] — 资本支出 vs 运营支出
|
||||
- [[Cloud Elasticity]] — 云弹性
|
||||
- [[SLA]] — 服务级别协议
|
||||
- [[Cloud Migration]] — 云迁移
|
||||
- [[Vendor-Lock-In]] — 供应商锁定风险
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Computing]] — 云计算基础(实体页面)
|
||||
- [[Cloud-Adoption-Strategy]] — 云采用策略
|
||||
- [[Cloud-Maturity-Model]] — 云成熟度模型
|
||||
- [[Shared-Responsibility-Model]] — 共享责任模型
|
||||
91
wiki/concepts/RPO.md
Normal file
91
wiki/concepts/RPO.md
Normal file
@@ -0,0 +1,91 @@
|
||||
---
|
||||
title: "RPO (Recovery Point Objective)"
|
||||
tags: [devops, disaster-recovery, sre, reliability, data-protection]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# RPO (Recovery Point Objective)
|
||||
|
||||
**RPO (Recovery Point Objective)** 是指系统发生故障时,能够接受的最大数据丢失量。它衡量的是数据保护程度——从故障时刻向前追溯,可接受丢失多长时间的数据。
|
||||
|
||||
## Definition
|
||||
|
||||
> "RPO is about protecting data. It's measured backwards from the moment of failure." — LaunchDarkly
|
||||
|
||||
RPO 是灾备规划的核心指标之一,与 [[RTO]](恢复时间目标)共同构成灾备目标体系。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
| 维度 | 说明 |
|
||||
|------|------|
|
||||
| **衡量对象** | 数据丢失量(Data Loss Amount) |
|
||||
| **测量方向** | 从故障时刻向后(Backwards)追溯 |
|
||||
| **关注点** | 数据完整性(How Much Data Can Be Lost) |
|
||||
|
||||
## Example
|
||||
|
||||
如果数据库在下午 3 点崩溃,而最后一次备份是下午 2 点,则:
|
||||
- **RPO = 1 小时**:这意味着过去 1 小时内的数据可能丢失
|
||||
- 2:00 到 3:00 之间发生的所有事务都面临丢失风险
|
||||
|
||||
## RTO vs. RPO
|
||||
|
||||
RTO 和 RPO 衡量的是不同维度,必须**同时优化**:
|
||||
|
||||
| 场景 | RTO 目标 | RPO 目标 | 说明 |
|
||||
|------|----------|----------|------|
|
||||
| 电商结账 | 2 分钟 | 0 秒 | 必须快速恢复,且不能丢失任何交易 |
|
||||
| 用户分析面板 | 30 分钟 | 1 小时 | 停机可接受,小时级数据丢失也可接受 |
|
||||
| 内部 CRM | 4 小时 | 15 分钟 | 停机可绕过,但近期客户更新很重要 |
|
||||
| 博客/营销站 | 2 小时 | 24 小时 | 访问者可以等,丢失一天评论可接受 |
|
||||
|
||||
**关键**:不能只优化其中一个指标。
|
||||
- 30 秒一次备份(RPO 优秀)但恢复需要 6 小时(RTO 极差)= 无效
|
||||
- 5 分钟拉起新服务器(RTO 优秀)但丢失 4 小时客户数据(RPO 极差)= 无效
|
||||
|
||||
## [[Feature Flag]] 对 RPO 的保护
|
||||
|
||||
传统回滚(Full Deployment Rollback)在回滚过程中可能丢失新事务数据。而 [[Feature Flag]] 回滚**不丢失数据**:
|
||||
|
||||
- Feature Flag 切换只改变代码执行路径,不触碰数据层
|
||||
- [[Kill Switch]] 关闭故障功能时,用户正在提交的数据不受影响
|
||||
- RPO 在整个 Feature Flag 操作期间始终保持近零
|
||||
|
||||
## Tiered RPO Targets
|
||||
|
||||
| Tier | 场景 | RPO 目标 | 说明 |
|
||||
|------|------|----------|------|
|
||||
| Critical | 支付处理、交易系统 | < 1 分钟 | 不能丢失任何金钱相关数据 |
|
||||
| Important | CRM、客户支持 | < 15 分钟 | 近期客户更新不可丢失 |
|
||||
| Nice-to-have | 文档站、内部工具 | < 1 小时 | 数据可重建或接受丢失 |
|
||||
|
||||
## 实现手段
|
||||
|
||||
| 方法 | RPO 效果 | 说明 |
|
||||
|------|----------|------|
|
||||
| 无备份 | ∞ | 完全不可接受 |
|
||||
| 每日备份 | 24 小时 | 成本低,RPO 差 |
|
||||
| 每小时备份 | 1 小时 | 中等成本和效果 |
|
||||
| CDB(持续数据保护) | 秒级 | 持续复制,RPO 接近零 |
|
||||
| 同步复制(Active-Active) | 0 | 零数据丢失,成本极高 |
|
||||
|
||||
## RPO 与 RTO 必须协同
|
||||
|
||||
最佳实践是同时设定 RTO 和 RPO,并将 [[Feature Flag]] / [[Kill Switch]] 纳入灾备工具链:
|
||||
|
||||
- RTO 短 → 系统快速恢复
|
||||
- RPO 小 → 数据损失少
|
||||
- Feature Flag → 两者兼得的性价比方案
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[RTO]] — Recovery Time Objective,停机时间指标
|
||||
- [[Disaster Recovery]] — 灾备策略,RTO/RPO 是其核心目标
|
||||
- [[Feature Flag]] — 保护 RPO 的配置级热修复机制
|
||||
- [[Kill Switch]] — 关闭故障功能,保护数据不被继续破坏
|
||||
- [[High Availability]] — 高可用性,降低 RPO 的基础设施
|
||||
- [[Data-Governance]] — 数据治理,包含 RPO 策略
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
81
wiki/concepts/RTO.md
Normal file
81
wiki/concepts/RTO.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
title: "RTO (Recovery Time Objective)"
|
||||
tags: [devops, disaster-recovery, sre, reliability]
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# RTO (Recovery Time Objective)
|
||||
|
||||
**RTO (Recovery Time Objective)** 是指系统发生故障后,业务能够容忍的最大停机时间。它衡量的是恢复速度——从系统下线到用户可以重新使用系统的时间窗口。
|
||||
|
||||
## Definition
|
||||
|
||||
> "RTO is about getting back online. It's the clock that starts ticking the moment your system goes down." — LaunchDarkly
|
||||
|
||||
RTO 是灾备规划的核心指标之一,与 [[RPO]](恢复点目标)共同构成灾备目标体系。
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
| 维度 | 说明 |
|
||||
|------|------|
|
||||
| **衡量对象** | 停机时间(Downtime Duration) |
|
||||
| **起点** | 系统故障发生的时刻 |
|
||||
| **终点** | 用户可以重新使用系统 |
|
||||
| **关注点** | 速度(How Fast) |
|
||||
|
||||
## RTO vs. RPO
|
||||
|
||||
RTO 和 RPO 经常被混淆,但衡量的是完全不同的维度:
|
||||
|
||||
- **RTO** — 关注速度:系统多久能恢复?
|
||||
- **RPO** — 关注数据:能承受多少数据丢失?
|
||||
|
||||
两者可以独立设定:快速恢复不代表数据不丢失,反之亦然。
|
||||
|
||||
## Tiered RTO Targets
|
||||
|
||||
| Tier | 场景 | RTO 目标 | 说明 |
|
||||
|------|------|----------|------|
|
||||
| Critical | 支付处理、用户认证 | < 5 分钟 | 业务立即停止,需要 3AM 告警 |
|
||||
| Important | 管理后台、报表 | < 1 小时 | 业务减速但不停止 |
|
||||
| Nice-to-have | 内部工具、文档站 | < 4 小时 | 仅造成不便 |
|
||||
|
||||
## RTO in Modern CD Context
|
||||
|
||||
传统灾备规划假设 RTO 针对的是硬件故障(服务器宕机、数据中心断电),但现代持续交付中最大的风险来自**代码变更**:
|
||||
|
||||
- 支付流程 Bug 导致结账失败
|
||||
- 数据库迁移锁死应用
|
||||
- AI 模型更新产生异常响应
|
||||
- 新功能在负载下性能骤降
|
||||
|
||||
**[[Feature Flag]]** 将 RTO 从小时级降至秒级:只需切换配置,无需重新部署代码。
|
||||
|
||||
## 实现手段
|
||||
|
||||
| 方法 | RTO 效果 | 说明 |
|
||||
|------|----------|------|
|
||||
| 传统灾备(从备份恢复) | 小时级 | 需要重建基础设施 |
|
||||
| [[Kill Switch]](Feature Flag) | 秒级 | 配置变更,无需部署 |
|
||||
| 容器化 + 自动扩缩容 | 分钟级 | 需要容器编排平台 |
|
||||
| Active-Active 多活 | 近零 | 成本极高 |
|
||||
|
||||
## RTO 与成本的关系
|
||||
|
||||
- 近零 RTO 需要"冗余一切"(服务器、数据库、网络、跨数据中心)
|
||||
- 大多数团队无法承担如此高的成本
|
||||
- **建议**:按应用分层级设定 RTO,而非对所有系统一刀切
|
||||
|
||||
> "What does an hour of downtime actually cost your business? If it's $10K, don't spend $100K/year on infrastructure to prevent it."
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[RPO]] — Recovery Point Objective,数据丢失量指标
|
||||
- [[Disaster Recovery]] — 灾备策略,RTO/RPO 是其核心目标
|
||||
- [[Kill Switch]] — 通过 Feature Flag 实现秒级 RTO
|
||||
- [[High Availability]] — 高可用性,降低 RTO 的基础设施
|
||||
- [[Feature Flag]] — 实现配置级热修复的核心机制
|
||||
|
||||
## Sources
|
||||
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
68
wiki/concepts/Release-Management.md
Normal file
68
wiki/concepts/Release-Management.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: "Release Management"
|
||||
type: concept
|
||||
tags: [devops, deployment, itsm]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
发布管理(Release Management)是[[ITSM]]的核心流程之一,负责**规划和协调软件从开发到生产的整个发布过程**,确保高质量、低风险的版本交付。
|
||||
|
||||
## Release Management Process
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Release │ → │ Build & │ → │ Testing & │
|
||||
│ Planning │ │ Package │ │ Validation │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
↓
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┘
|
||||
│ Monitoring │ ← │ Deployment │ ← │ Staged │
|
||||
│ & Rollback │ │ to Prod │ │ Release │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Modern Release Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,发布管理深度集成DevOps:
|
||||
|
||||
### DevOps-Integrated ITSM
|
||||
|
||||
| 实践 | 描述 |
|
||||
|------|------|
|
||||
| [[Canary-Release]] | 渐进式流量转移 |
|
||||
| [[Blue-Green-Deployment]] | 零停机双环境部署 |
|
||||
| Feature Flags | 特性开关控制 |
|
||||
| Automated Rollback | 自动回滚 |
|
||||
|
||||
### Progressive Delivery Patterns
|
||||
|
||||
```
|
||||
Traditional: v1.0 → v1.1 → v1.2 (Big Bang)
|
||||
Canary: v1.0 → [v1.1 → 5%] → [v1.1 → 20%] → v1.1
|
||||
Blue-Green: Blue[v1.0] ←→ Green[v1.1] (instant switch)
|
||||
Feature Flags: v1.0 + [Flag:NewFeature=ON] (dynamic control)
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| Deployment Frequency | 部署频率 |
|
||||
| Lead Time for Changes | 变更前置时间 |
|
||||
| Time to Market | 上市时间 |
|
||||
| Release Success Rate | 发布成功率 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Canary-Release]] — 金丝雀发布
|
||||
- [[Blue-Green-Deployment]] — 蓝绿部署
|
||||
- [[CI/CD-Pipeline]] — CI/CD流水线
|
||||
- [[Feature-Flag]] — 特性开关
|
||||
- [[Deployment-Automation]] — 部署自动化
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — DevOps-integrated Release Management
|
||||
79
wiki/concepts/Rightsizing.md
Normal file
79
wiki/concepts/Rightsizing.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
title: "Rightsizing"
|
||||
tags:
|
||||
- devops
|
||||
- finops
|
||||
- cost-optimization
|
||||
- cloud
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Rightsizing
|
||||
|
||||
## Definition
|
||||
|
||||
Rightsizing 是通过持续分析资源使用趋势,**动态调整云资源配置**以消除过度配置(over-provisioning)和不足配置(under-provisioning)的方法。Agentic AI 持续监控 CPU/内存/存储使用率,自动建议或执行资源配置变更。
|
||||
|
||||
## 与 [[FinOps]] 的关系
|
||||
|
||||
Rightsizing 是 [[FinOps]] 成本优化的核心实践之一:
|
||||
|
||||
```
|
||||
FinOps Framework:
|
||||
├── Understand (云成本归因)
|
||||
├── Optimize (Rightsizing ←) # ← 本页
|
||||
└── Operate (持续成本管理)
|
||||
```
|
||||
|
||||
## 传统 vs AI-Driven Rightsizing
|
||||
|
||||
| 维度 | 人工 Rightsizing | AI-Driven Rightsizing |
|
||||
|------|-----------------|----------------------|
|
||||
| 分析频率 | 季度/年度 | 实时/每日 |
|
||||
| 数据范围 | 有限指标 | 全量指标 + 历史趋势 |
|
||||
| 响应速度 | 数周 | 数小时 |
|
||||
| 准确性 | 基于经验估算 | 基于实际使用数据 |
|
||||
|
||||
## Agentic AI Rightsizing 能力
|
||||
|
||||
```python
|
||||
Rightsizing_Dimensions = {
|
||||
"Compute": "EKS/RDS/VMs 自动扩缩容",
|
||||
"Storage": "S3 生命周期策略 + 存储类型优化",
|
||||
"Network": "NAT Gateway 峰值优化",
|
||||
"Database": "RDS 实例类型调整 + 连接池优化"
|
||||
}
|
||||
```
|
||||
|
||||
## 示例
|
||||
|
||||
> An AI agent analyzes 30 days of AWS EKS cluster metrics:
|
||||
> - CPU utilization: 15% average, peaks at 40% during business hours
|
||||
> - Memory utilization: 22% average, 60% during batch jobs
|
||||
> - **Suggests**:
|
||||
> - Downsize from m5.xlarge to m5.large (saves 40% compute cost)
|
||||
> - Implement auto-scaling: 2-8 instances based on CPU > 60%
|
||||
> - **Result**: 40% cost reduction, zero performance impact
|
||||
|
||||
## 与 [[Multi-Cloud Cost Optimization]] 的关系
|
||||
|
||||
Rightsizing 是 [[Multi-Cloud Cost Optimization]] 的基础能力之一:
|
||||
|
||||
```
|
||||
Multi-Cloud Cost Optimization:
|
||||
├── Rightsizing ← (单云资源优化)
|
||||
├── Spot/Reserved Instance Optimization
|
||||
├── Multi-Cloud Resource Consolidation
|
||||
└── Pricing Model Selection
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[FinOps]] — Rightsizing 是 FinOps 框架的组成部分
|
||||
- [[Multi-Cloud Cost Optimization]] — Rightsizing 的扩展场景
|
||||
- [[Cloud Cost Optimization]] — Rightsizing 的广义概念
|
||||
- [[Scalability]] — Rightsizing 的技术基础
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
71
wiki/concepts/Root-Cause-Analysis.md
Normal file
71
wiki/concepts/Root-Cause-Analysis.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: "Root Cause Analysis"
|
||||
tags:
|
||||
- devops
|
||||
- troubleshooting
|
||||
- ai
|
||||
- observability
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# Root Cause Analysis (RCA)
|
||||
|
||||
## Definition
|
||||
|
||||
Root Cause Analysis (RCA) 是通过系统化方法追溯问题根本原因的过程,而非仅处理表面症状。Agentic AI 通过跨层日志关联(计算、网络、应用),比人工更快定位问题根因,显著加速事故解决。
|
||||
|
||||
## Traditional vs AI-Driven RCA
|
||||
|
||||
| 维度 | 传统 RCA | AI-Driven RCA |
|
||||
|------|---------|--------------|
|
||||
| 分析速度 | 数小时至数天 | 分钟级 |
|
||||
| 数据范围 | 有限日志样本 | 全量日志 + 跨源关联 |
|
||||
| 关联能力 | 依赖人工经验 | 自动跨层相关性分析 |
|
||||
| 准确性 | 受经验影响 | 基于模式匹配的一致性 |
|
||||
| 知识积累 | 个人经验为主 | 可学习的组织知识 |
|
||||
|
||||
## Agentic AI RCA 工作流
|
||||
|
||||
```
|
||||
1. 异常检测 → CloudWatch/Stackdriver/Azure Monitor 告警触发
|
||||
2. 数据收集 → 自动聚合相关时间段的所有日志
|
||||
3. 跨层关联 → 关联 compute/networking/application 日志
|
||||
4. 模式匹配 → 匹配历史故障模式
|
||||
5. 根因输出 → 输出结构化根因报告 + 修复建议
|
||||
```
|
||||
|
||||
## AI-Driven RCA 示例
|
||||
|
||||
> AI agent monitoring AWS EKS detects a spike in error rates. It correlates:
|
||||
> - Kubernetes pod logs (application layer)
|
||||
> - VPC flow logs (network layer)
|
||||
> - RDS metrics (database layer)
|
||||
> - → Identifies: External API timeout causing connection pool exhaustion
|
||||
> - → Suggests: Implement retry strategy with exponential backoff
|
||||
|
||||
## 与 [[AIOps]] 的关系
|
||||
|
||||
RCA 是 [[AIOps]] 能力矩阵的核心组件:
|
||||
|
||||
```python
|
||||
AIOps_Capabilities = {
|
||||
"Anomaly Detection": "检测异常模式",
|
||||
"Root Cause Analysis": "自动诊断 ←", # ← 本页
|
||||
"Predictive Maintenance": "预测性维护",
|
||||
"Smart Alerting": "减少告警疲劳",
|
||||
"Automated Remediation": "自动修复",
|
||||
"Capacity Optimization": "容量优化"
|
||||
}
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Self-Healing Systems]] — RCA 发现根因后触发自动修复
|
||||
- [[AIOps]] — RCA 是 AIOps 的核心能力
|
||||
- [[MTTR]] — RCA 速度直接影响 MTTR
|
||||
- [[Observability]] — RCA 依赖可观测性数据
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
- [[what-i-know-about-cloud-service-delivery-1]]
|
||||
52
wiki/concepts/SAST.md
Normal file
52
wiki/concepts/SAST.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# SAST (Static Application Security Testing)
|
||||
|
||||
## Definition
|
||||
SAST tools analyze an application's source code to identify security vulnerabilities without executing the code. They excel at spotting common issues such as SQL injection, cross-site scripting, and buffer overflows.
|
||||
|
||||
## Aliases
|
||||
- Static Application Security Testing
|
||||
- White-box testing
|
||||
- Static analysis
|
||||
|
||||
## Characteristics
|
||||
- **无需运行代码**:在静态状态下分析源代码
|
||||
- **白盒测试**:能看到代码内部结构
|
||||
- **开发阶段适用**:在编码和代码审查时使用
|
||||
- **速度快**:可以快速扫描大量代码
|
||||
|
||||
## Common Vulnerabilities Detected
|
||||
- SQL 注入(SQL Injection)
|
||||
- 跨站脚本(XSS, Cross-Site Scripting)
|
||||
- 缓冲区溢出(Buffer Overflow)
|
||||
- 硬编码凭证(Hardcoded Credentials)
|
||||
- 不安全的加密使用
|
||||
- 路径遍历(Path Traversal)
|
||||
|
||||
## Tools
|
||||
- [[SonarQube]] — 代码质量和安全分析
|
||||
- Checkmarx
|
||||
- Veracode
|
||||
- Fortify
|
||||
- Semgrep
|
||||
|
||||
## Integration
|
||||
SAST 工具通常集成到:
|
||||
- IDE 开发环境
|
||||
- CI/CD 构建管道
|
||||
- 代码审查流程
|
||||
|
||||
## Limitations
|
||||
- 可能产生误报(False Positives)
|
||||
- 无法检测运行时问题
|
||||
- 需要源代码访问权限
|
||||
- 不检测配置问题
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — SAST 是其重要组件
|
||||
- [[DAST]] — 动态应用安全测试(黑盒测试)
|
||||
- [[IAST]] — 交互式应用安全测试
|
||||
- [[SCA]] — 软件组成分析
|
||||
- [[Shift-Left-Security]] — SAST 是左移策略的重要工具
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
71
wiki/concepts/SCA.md
Normal file
71
wiki/concepts/SCA.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# SCA (Software Composition Analysis)
|
||||
|
||||
## Definition
|
||||
SCA tools focus on the various software components of an application, including libraries and frameworks, to find known security flaws. They help reveal vulnerabilities that may occur when using third-party components.
|
||||
|
||||
## Aliases
|
||||
- Software Composition Analysis
|
||||
- Dependency Analysis
|
||||
- Open Source Security
|
||||
|
||||
## Characteristics
|
||||
- **依赖分析**:扫描应用的所有第三方组件
|
||||
- **已知漏洞匹配**:与 CVE/NVD 数据库匹配
|
||||
- **许可证合规**:检查开源许可证合规性
|
||||
- **供应链安全**:关注依赖链中的安全问题
|
||||
|
||||
## What SCA Detects
|
||||
- **已知漏洞**(Known Vulnerabilities)
|
||||
- CVEs in dependencies
|
||||
- Security advisories
|
||||
- **过时组件**(Outdated Dependencies)
|
||||
- Known vulnerabilities in old versions
|
||||
- Missing security patches
|
||||
- **许可证问题**(License Issues)
|
||||
- GPL/AGPL restrictions
|
||||
- Incompatible licenses
|
||||
- **高风险依赖**(Risky Dependencies)
|
||||
- Unmaintained packages
|
||||
- Malicious packages
|
||||
|
||||
## Common CVE Databases
|
||||
- National Vulnerability Database (NVD)
|
||||
- GitHub Advisory Database
|
||||
- Snyk Vulnerability Database
|
||||
- OSV (Open Source Vulnerabilities)
|
||||
|
||||
## Tools
|
||||
- [[Snyk]] — 专注开源安全的 SCA 工具
|
||||
- OWASP Dependency-Check
|
||||
- WhiteSource (Mend)
|
||||
- FOSSA
|
||||
- Dependabot (GitHub)
|
||||
|
||||
## Integration Points
|
||||
- **CI/CD Pipeline**:在构建时自动扫描依赖
|
||||
- **IDE**:开发者本地实时检查
|
||||
- **Registry Scanning**:容器镜像仓库扫描
|
||||
- **SBOM Generation**:软件物料清单生成
|
||||
|
||||
## SBOM (Software Bill of Materials)
|
||||
SCA 工具常用于生成 SBOM:
|
||||
- 完整的依赖列表
|
||||
- 版本信息
|
||||
- 许可证信息
|
||||
- 漏洞状态
|
||||
|
||||
## Limitations
|
||||
- 仅检测已知漏洞(零日漏洞无法检测)
|
||||
- 需要保持漏洞数据库更新
|
||||
- 可能产生误报
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — SCA 是其重要组件
|
||||
- [[SAST]] — 静态应用安全测试
|
||||
- [[DAST]] — 动态应用安全测试
|
||||
- [[Supply-Chain-Security]] — 供应链安全
|
||||
- [[SBOM]] — 软件物料清单
|
||||
- [[Zero-Day-Vulnerability]] — 零日漏洞
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
72
wiki/concepts/Security-and-Compliance.md
Normal file
72
wiki/concepts/Security-and-Compliance.md
Normal file
@@ -0,0 +1,72 @@
|
||||
---
|
||||
title: "Security and Compliance"
|
||||
type: concept
|
||||
tags: [security, compliance, itsm]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
安全与合规管理(Security and Compliance)是[[ITSM]]的核心流程之一,通过[[Zero-Trust-Architecture]]、自动化风险评估和[[Policy-as-Code]]等手段,确保IT服务满足安全和监管要求。
|
||||
|
||||
## Security & Compliance Framework
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Security & Compliance Management │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
|
||||
│ │ Zero Trust │ │ Risk Scoring │ │ Compliance │ │
|
||||
│ │ Architecture │ │ (Automated) │ │ Automation │ │
|
||||
│ └───────────────┘ └───────────────┘ └───────────────┘ │
|
||||
│ ↓ ↓ ↓ │
|
||||
│ ┌─────────────────────────────────────────────────────┐ │
|
||||
│ │ AI-based Threat Intelligence │ │
|
||||
│ │ Behavior Analysis │ Anomaly Detection │ Response │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Modern Security & Compliance (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,安全与合规由AI和自动化驱动:
|
||||
|
||||
### Key Components
|
||||
|
||||
| 组件 | 描述 | 技术 |
|
||||
|------|------|------|
|
||||
| [[Zero-Trust-Architecture]] | 永不信任,始终验证 | IAM, MFA, 微分段 |
|
||||
| Automated Risk Scoring | 自动化风险评估 | ML Models |
|
||||
| AI Threat Intelligence | AI威胁情报 | Behavioral Analysis |
|
||||
| [[Policy-as-Code]] | 合规自动化 | OPA, Sentinel |
|
||||
| Compliance Automation | 审计自动化 | Continuous Monitoring |
|
||||
|
||||
### Automated Compliance Pipeline
|
||||
|
||||
```
|
||||
Code → Policy Check → Security Scan → Compliance Report → Audit
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
Git hooks OPA SAST/DAST Auto-generate Evidence
|
||||
PaC Security Report Pack
|
||||
```
|
||||
|
||||
## Key Frameworks & Standards
|
||||
|
||||
| 框架 | 描述 |
|
||||
|------|------|
|
||||
| [[ISO-27001]] | 信息安全管理体系 |
|
||||
| [[GDPR]] | 欧盟数据保护 |
|
||||
| [[HIPAA]] | 医疗健康数据保护 |
|
||||
| SOC 2 | 服务组织控制 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Zero-Trust-Architecture]] — 零信任架构
|
||||
- [[Policy-as-Code]] — 策略即代码
|
||||
- [[Cloud-Security]] — 云安全
|
||||
- [[Data-Governance]] — 数据治理
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — Security & Compliance in Modern ITSM
|
||||
73
wiki/concepts/Self-Healing-Systems.md
Normal file
73
wiki/concepts/Self-Healing-Systems.md
Normal file
@@ -0,0 +1,73 @@
|
||||
---
|
||||
title: "Self-Healing Systems"
|
||||
type: concept
|
||||
tags: [aiops, automation, reliability, agentic-ai]
|
||||
date: 2026-04-14
|
||||
aliases:
|
||||
- Self-Healing
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
自愈系统(Self-Healing Systems)是能够**自动检测异常、诊断问题并执行修复操作**的智能系统,无需人工干预即可恢复正常运行状态。这是[[Agentic AI]]和[[AIOps]]的核心能力之一。
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Anomaly │ → │ Diagnosis │ → │ Repair │
|
||||
│ Detection │ │ & Root │ │ Action │
|
||||
│ │ │ Cause │ │ │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
↓ ↓ ↓
|
||||
AI/ML Model Decision Tree Automated Script
|
||||
+ Metrics + Knowledge Base + Runbooks
|
||||
↓
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ Monitoring │ ← │ Verification │
|
||||
│ Close │ │ & Report │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Self-Healing Actions
|
||||
|
||||
| 动作类型 | 描述 | 示例 |
|
||||
|----------|------|------|
|
||||
| Restart | 服务重启 | Pod重启、进程重启 |
|
||||
| Scale | 扩缩容 | 增加Pod数量、扩容资源 |
|
||||
| Evict | 驱逐问题节点 | Kubernetes节点驱逐 |
|
||||
| Cleanup | 资源清理 | 清理磁盘、释放连接池 |
|
||||
| Rollback | 版本回滚 | 回到上一个稳定版本 |
|
||||
| Reroute | 流量切换 | DNS切换、负载均衡调整 |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM 2.0]]的[[Incident-Management]]中,自愈是关键能力:
|
||||
|
||||
### AIOps-Powered Self-Healing
|
||||
- Real-time observability drives rapid detection
|
||||
- ML models predict failure before it happens
|
||||
- Automated runbooks execute recovery
|
||||
- Continuous learning improves future responses
|
||||
|
||||
### Kubernetes Self-Healing
|
||||
[[Kubernetes]]提供原生自愈机制:
|
||||
- **Liveness Probes** — 自动重启不健康容器
|
||||
- **Readiness Probes** — 停止流量到不健康Pod
|
||||
- **Node Failure Detection** — 自动重新调度Pod
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Agentic AI]] — 自愈的驱动者
|
||||
- [[AIOps]] — 自愈的分析引擎
|
||||
- [[Incident-Management]] — 自愈的应用场景
|
||||
- [[Kubernetes]] — 自愈的主要载体
|
||||
- [[Root-Cause-Analysis]] — 自愈前的诊断过程
|
||||
- [[MTTR]] — 自愈改善的关键指标
|
||||
|
||||
## Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]] — Agentic AI自愈场景
|
||||
- [[understanding-complete-itsm]] — ITSM 2.0自愈能力
|
||||
- [[Agentic-AI]] — 实体页面中的自愈描述
|
||||
- [[Kubernetes]] — Kubernetes自愈机制
|
||||
76
wiki/concepts/Serverless-Computing.md
Normal file
76
wiki/concepts/Serverless-Computing.md
Normal file
@@ -0,0 +1,76 @@
|
||||
---
|
||||
title: "Serverless Computing"
|
||||
type: concept
|
||||
tags: [Cloud, Serverless, Cloud Native, Edge Computing]
|
||||
date: 2026-04-26
|
||||
---
|
||||
|
||||
# Serverless Computing (无服务器计算)
|
||||
|
||||
## Definition
|
||||
**Serverless Computing** is a cloud execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Developers can build and deploy applications without worrying about infrastructure management.
|
||||
|
||||
## Key Characteristics
|
||||
- **No server management**: Cloud provider handles infrastructure
|
||||
- **Automatic scaling**: Resources scale based on demand
|
||||
- **Pay-per-use**: Pay only for execution time
|
||||
- **Event-driven**: Functions respond to events/triggers
|
||||
|
||||
## Key Platforms
|
||||
|
||||
| Provider | Service |
|
||||
|----------|---------|
|
||||
| AWS | Lambda |
|
||||
| Azure | Azure Functions |
|
||||
| GCP | Cloud Functions |
|
||||
|
||||
## Benefits
|
||||
|
||||
### 1. Cost Efficiency
|
||||
- Eliminates unnecessary resource consumption
|
||||
- No idle capacity costs
|
||||
- Pay only for actual execution time
|
||||
|
||||
### 2. Scalability
|
||||
- Automatic scaling from zero to thousands of instances
|
||||
- Handles traffic spikes without provisioning
|
||||
- Global distribution ready
|
||||
|
||||
### 3. Developer Productivity
|
||||
- Focus on business logic, not infrastructure
|
||||
- Faster deployment cycles
|
||||
- Reduced operational overhead
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Event-Driven Automation
|
||||
- Real-time file processing
|
||||
- Automated backups
|
||||
- Scheduled tasks and cron jobs
|
||||
|
||||
### API Backends
|
||||
- Microservices architecture
|
||||
- Real-time data processing
|
||||
- IoT data ingestion
|
||||
|
||||
### AI/ML Inference
|
||||
- On-demand model inference
|
||||
- Image and video processing
|
||||
- Natural language processing
|
||||
|
||||
## Relationship to Green Computing
|
||||
- Serverless computing contributes to [[Green Computing]] by:
|
||||
- Eliminating idle resource consumption
|
||||
- Optimizing energy efficiency through shared infrastructure
|
||||
- Reducing data center carbon footprint
|
||||
|
||||
## Related Concepts
|
||||
- [[Cloud-Native]]
|
||||
- [[Green Computing]]
|
||||
- [[Event-Driven-Architecture]]
|
||||
- [[Edge-Computing]]
|
||||
|
||||
## Related Entities
|
||||
- [[AWS Lambda]]
|
||||
- [[Azure Functions]]
|
||||
- [[Google Cloud Functions]]
|
||||
174
wiki/concepts/Shared-Responsibility-Model.md
Normal file
174
wiki/concepts/Shared-Responsibility-Model.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Shared Responsibility Model
|
||||
|
||||
> **Shared Responsibility Model** — 共享责任模型是云安全的基本原则,定义了云服务提供商和客户组织之间在安全、运维、合规等方面的责任划分。无论使用公有云、私有云还是混合云,安全问题始终由双方共同承担。
|
||||
|
||||
## Definition
|
||||
|
||||
共享责任模型(Shared Responsibility Model)阐明了在云环境中,云服务提供商和客户组织各自承担的安全和管理职责。客户组织购买云服务并不意味着将所有责任转移给提供商——数据安全、访问控制、灾难恢复规划等关键领域仍然是客户的核心职责。
|
||||
|
||||
> "No matter which cloud environment you work in, your problems don't go away. Though you're purchasing services from third-party vendors, you still have to do your due diligence to reduce risk."
|
||||
|
||||
## Responsibility Matrix by Service Model
|
||||
|
||||
### IaaS (Infrastructure as a Service)
|
||||
|
||||
| 责任领域 | 提供商 | 客户 |
|
||||
|----------|--------|------|
|
||||
| 物理数据中心 | ✅ | — |
|
||||
| 服务器/存储/网络硬件 | ✅ | — |
|
||||
| 虚拟化层 | ✅ | — |
|
||||
| 操作系统 | — | ✅ |
|
||||
| 中间件/运行时 | — | ✅ |
|
||||
| 应用程序 | — | ✅ |
|
||||
| 数据 | — | ✅ |
|
||||
| 身份和访问管理 | — | ✅ |
|
||||
| 网络安全(配置) | — | ✅ |
|
||||
|
||||
### PaaS (Platform as a Service)
|
||||
|
||||
| 责任领域 | 提供商 | 客户 |
|
||||
|----------|--------|------|
|
||||
| 物理基础设施 | ✅ | — |
|
||||
| 运行时/中间件 | ✅ | — |
|
||||
| 操作系统补丁 | ✅ | — |
|
||||
| 开发框架 | ✅ | — |
|
||||
| 应用程序 | — | ✅ |
|
||||
| 数据 | — | ✅ |
|
||||
| 身份和访问管理 | — | ✅ |
|
||||
| 网络安全配置 | — | ✅ |
|
||||
|
||||
### SaaS (Software as a Service)
|
||||
|
||||
| 责任领域 | 提供商 | 客户 |
|
||||
|----------|--------|------|
|
||||
| 所有底层基础设施 | ✅ | — |
|
||||
| 应用程序 | ✅ | — |
|
||||
| 数据 | — | ✅ |
|
||||
| 身份和访问管理 | — | ✅ |
|
||||
| 用户设备安全 | — | ✅ |
|
||||
| 数据备份 | — | ✅ |
|
||||
|
||||
## Always the Customer's Responsibility
|
||||
|
||||
无论选择哪种云服务模型,以下领域**始终由客户组织负责**:
|
||||
|
||||
### 1. Identity and Access Management (身份和访问管理)
|
||||
- 定义谁可以访问什么资源(最小权限原则)
|
||||
- 配置多因素认证(MFA)
|
||||
- 定期审计访问权限
|
||||
- 管理服务账户和API密钥
|
||||
|
||||
### 2. Data Security and Encryption (数据安全和加密)
|
||||
- 确定哪些数据需要加密
|
||||
- 管理加密密钥(BYOK/KMS)
|
||||
- 配置传输加密(TLS/SSL)
|
||||
- 数据分类和标签策略
|
||||
|
||||
### 3. Disaster Recovery Planning (灾难恢复规划)
|
||||
- 制定业务连续性计划
|
||||
- 定义 RTO(恢复时间目标)和 RPO(恢复点目标)
|
||||
- 定期测试灾难恢复流程
|
||||
- 维护离线/异地备份
|
||||
|
||||
### 4. Compliance and Governance (合规和治理)
|
||||
- 确保符合行业法规(HIPAA、PCI-DSS、GDPR等)
|
||||
- 定期合规审计
|
||||
- 数据主权和驻留要求
|
||||
- 审计日志收集和保留
|
||||
|
||||
### 5. User Devices and Endpoints (用户设备和端点)
|
||||
- 端点安全(防病毒、EDR)
|
||||
- 设备合规策略
|
||||
- 远程工作安全标准
|
||||
|
||||
## Always the Provider's Responsibility
|
||||
|
||||
以下领域由云服务提供商负责:
|
||||
|
||||
### 1. Physical Security (物理安全)
|
||||
- 数据中心物理访问控制
|
||||
- 环境控制(温度、湿度)
|
||||
- 物理冗余和容错
|
||||
|
||||
### 2. Infrastructure Availability (基础设施可用性)
|
||||
- 底层网络可用性
|
||||
- 硬件故障恢复
|
||||
- 数据中心冗余
|
||||
|
||||
### 3. Hypervisor/Container Security (虚拟化安全)
|
||||
- 虚拟机/容器隔离
|
||||
- 虚拟化层漏洞修复
|
||||
|
||||
## Hybrid/Multi-Cloud Responsibility Boundaries
|
||||
|
||||
在混合云和多云环境中,责任划分更为复杂:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ 客户组织责任 │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ 数据 │ IAM │ DR │ 合规 │ 应用程序 │ 端点 │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────┘
|
||||
↑ ↑ ↑
|
||||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ 公有云 │ │ 私有云 │ │ 本地环境 │
|
||||
│ (AWS/Azure/ │ │ (自托管/托管) │ │ (数据中心) │
|
||||
│ GCP) │ │ │ │ │
|
||||
│ 提供商负责 │ │ 提供商/自管 │ │ 全部自管 │
|
||||
│ 物理+虚拟化 │ │ │ │ │
|
||||
└──────────────┘ └──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Key Risks Without Shared Responsibility Awareness
|
||||
|
||||
| 风险场景 | 后果 |
|
||||
|----------|------|
|
||||
| 假设提供商"全包"安全 | 数据泄露、访问失控 |
|
||||
| 未配置MFA | 账户被入侵 |
|
||||
| 未加密敏感数据 | 合规违规、数据泄露 |
|
||||
| 无灾难恢复计划 | 业务中断、数据永久丢失 |
|
||||
| 不了解合规要求 | 巨额罚款、品牌损害 |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Know Your Responsibilities
|
||||
- 阅读云提供商的SLA和安全文档
|
||||
- 理解服务模型对应的责任边界
|
||||
- 与提供商沟通不明确的领域
|
||||
|
||||
### 2. Implement Defense in Depth
|
||||
- 不依赖单一安全层
|
||||
- 多层次安全控制(网络、应用、数据、身份)
|
||||
- 假设任何层次都可能失败
|
||||
|
||||
### 3. Automate Security Controls
|
||||
- IaC(基础设施即代码)确保一致性
|
||||
- 自动合规检查
|
||||
- 持续监控和告警
|
||||
|
||||
### 4. Regular Security Training
|
||||
- 确保团队理解云安全责任
|
||||
- 关注社会工程和钓鱼攻击
|
||||
- 建立安全文化
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Public Cloud]] — 公有云部署模式
|
||||
- [[Private Cloud]] — 私有云部署模式
|
||||
- [[Hybrid Cloud]] — 混合云部署模式
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[SLA]] — 服务级别协议
|
||||
- [[Disaster Recovery Planning]] — 灾难恢复规划
|
||||
- [[Multi-Factor-Authentication]] — 多因素认证
|
||||
- [[Data-Governance]] — 数据治理
|
||||
- [[FinOps]] — 云财务管理
|
||||
- [[Cloud Compliance]] — 云合规性
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Computing]] — 云计算基础
|
||||
- [[Cloud-Maturity-Model]] — 云成熟度模型
|
||||
- [[ISO-27001]] — 信息安全管理体系
|
||||
- [[HIPAA]] — 医疗健康信息隐私
|
||||
- [[GDPR]] — 欧盟数据保护条例
|
||||
56
wiki/concepts/Shift-Left-Security.md
Normal file
56
wiki/concepts/Shift-Left-Security.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Shift-Left Security
|
||||
|
||||
## Definition
|
||||
"Shift left" means identifying security flaws early in the software development lifecycle. By focusing on these issues initially, teams can tackle and fix them before they become bigger problems.
|
||||
|
||||
## Core Principle
|
||||
将安全测试左移到软件开发生命周期的早期阶段,而非等到开发完成后才进行安全检查。
|
||||
|
||||
## Cost Efficiency
|
||||
| 发现阶段 | 相对修复成本 |
|
||||
|---------|------------|
|
||||
| 设计阶段 | 1x |
|
||||
| 开发/代码审查 | 5-10x |
|
||||
| 测试阶段 | 10-30x |
|
||||
| 生产环境 | 30-100x |
|
||||
|
||||
## Implementation
|
||||
|
||||
### Design Phase
|
||||
- 威胁建模(Threat Modeling)
|
||||
- 安全需求定义
|
||||
- 安全架构评审
|
||||
|
||||
### Development Phase
|
||||
- [[SAST]] 静态代码分析
|
||||
- [[SCA]] 依赖扫描
|
||||
- 安全编码规范检查
|
||||
- IDE 安全插件集成
|
||||
|
||||
### CI/CD Integration
|
||||
- 在构建阶段自动运行安全扫描
|
||||
- [[Break-the-Build]] 机制阻止高风险构建
|
||||
- 自动依赖更新和漏洞告警
|
||||
|
||||
## Best Practices
|
||||
1. 开发者编写安全代码,从一开始就重视安全
|
||||
2. 安全专家与开发团队紧密协作
|
||||
3. 使用自动化工具减少人工审查负担
|
||||
4. 建立安全编码标准并持续培训
|
||||
|
||||
## Relationship with Shift-Right
|
||||
- [[Shift-Left-Security]] ← complements → [[Shift-Right-Security]]
|
||||
- 左移处理开发阶段的安全问题
|
||||
- 右移处理生产环境特有的安全问题
|
||||
- 两者结合形成完整的安全覆盖
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 包含 Shift Left 策略的方法论
|
||||
- [[SAST]] — 静态应用安全测试
|
||||
- [[SCA]] — 软件组成分析
|
||||
- [[OWASP-Top-Ten]] — 常见安全漏洞标准
|
||||
- [[Threat Modeling]] — 威胁建模
|
||||
- [[Break-the-Build]] — 安全失败时停止构建
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
50
wiki/concepts/Shift-Right-Security.md
Normal file
50
wiki/concepts/Shift-Right-Security.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Shift-Right Security
|
||||
|
||||
## Definition
|
||||
"Shift right" highlights the need for ongoing security measures even after launching the application. Some security vulnerabilities may go unnoticed until customers start using the software. Monitoring and addressing these issues post-deployment is crucial.
|
||||
|
||||
## Core Principle
|
||||
安全不仅是开发阶段的任务,生产环境部署后仍需持续进行安全监控和响应。
|
||||
|
||||
## Why Shift-Right?
|
||||
|
||||
### Limitations of Pre-Production Testing
|
||||
- 测试环境无法完全模拟真实用户行为
|
||||
- 某些漏洞仅在特定使用场景下暴露
|
||||
- 第三方组件漏洞可能在运行时被发现
|
||||
- 依赖库的零日漏洞需要实时响应
|
||||
|
||||
## Implementation
|
||||
|
||||
### Production Monitoring
|
||||
- 安全信息和事件管理(SIEM)
|
||||
- 运行时应用自我保护(RASP)
|
||||
- 异常行为检测
|
||||
- 日志安全分析
|
||||
|
||||
### Post-Deployment Practices
|
||||
- 持续漏洞扫描
|
||||
- 威胁情报整合
|
||||
- 安全补丁管理
|
||||
- 事件响应计划
|
||||
|
||||
### Feedback Loop
|
||||
- 从生产环境收集安全数据
|
||||
- 反馈给开发团队改进安全实践
|
||||
- 更新威胁模型和安全测试用例
|
||||
|
||||
## Relationship with Shift-Left
|
||||
- [[Shift-Left-Security]] ← complements → [[Shift-Right-Security]]
|
||||
- 左移处理开发阶段的安全问题
|
||||
- 右移处理生产环境特有的安全问题
|
||||
- 两者结合形成完整的安全覆盖
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 包含 Shift Right 策略的方法论
|
||||
- [[RASP]] — 运行时应用自我保护
|
||||
- [[SIEM]] — 安全信息和事件管理
|
||||
- [[Vulnerability-Scanning]] — 持续漏洞扫描
|
||||
- [[Incident-Response]] — 安全事件响应
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
36
wiki/concepts/Socket-登录.md
Normal file
36
wiki/concepts/Socket-登录.md
Normal file
@@ -0,0 +1,36 @@
|
||||
# Socket 登录
|
||||
|
||||
## Concept Information
|
||||
- **Type**: Concept
|
||||
- **Status**: Active
|
||||
- **Source**: [[mysql-mariadb-数据库详细信息]]
|
||||
|
||||
## Definition
|
||||
Socket 登录是一种通过 Unix socket 文件进行本地数据库认证的方式,不需要网络连接,适用于服务器本地管理员访问。
|
||||
|
||||
## How It Works
|
||||
当使用 `-S /path/to/socket` 参数连接 MariaDB/MySQL 时,数据库服务器通过检查 socket 文件的进程所有权来验证用户身份,而不是通过网络传输密码。
|
||||
|
||||
## Example Command
|
||||
```bash
|
||||
sudo mysql -u root -p -S /run/mysqld/mysqld10.sock
|
||||
```
|
||||
|
||||
## Key Characteristics
|
||||
- **无需网络**:不经过 TCP/IP,直接通过文件系统通信
|
||||
- **更安全**:不暴露密码到网络,避免中间人攻击
|
||||
- **仅限本地**:只能从数据库服务器本机执行
|
||||
- **系统用户映射**:依赖操作系统用户身份
|
||||
|
||||
## Use Cases
|
||||
1. 数据库初始配置
|
||||
2. 密码重置
|
||||
3. 创建远程访问用户
|
||||
4. 紧急修复
|
||||
|
||||
## Related Concepts
|
||||
- [[用户权限]] — Host+User 组合权限模型
|
||||
- [[MariaDB]] — 使用 socket 登录进行本地管理
|
||||
|
||||
## Related Entities
|
||||
- [[群晖 NAS]] — MariaDB socket 登录的目标服务器
|
||||
134
wiki/concepts/Software-Assurance-Maturity-Model.md
Normal file
134
wiki/concepts/Software-Assurance-Maturity-Model.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Software Assurance Maturity Model (SAMM)
|
||||
|
||||
> **SAMM (Software Assurance Maturity Model)** — 一个开源框架,用于评估、制定和改进软件安全保障策略,覆盖软件开发生命周期全流程。
|
||||
|
||||
## Definition
|
||||
|
||||
SAMM 由 OWASP 维护,是一个供应商中立、可定制的安全框架:
|
||||
|
||||
- 评估组织软件安全保障成熟度
|
||||
- 定义安全改进路线图
|
||||
- 展示安全活动的价值
|
||||
|
||||
## SAMM Core Structure
|
||||
|
||||
### 4 Business Functions
|
||||
|
||||
| 函数 | 描述 | 涵盖活动 |
|
||||
|------|------|---------|
|
||||
| **Governance** | 安全治理和管理 | 策略、标准、指南、培训 |
|
||||
| **Construction** | 软件构建 | 安全需求、设计、编码 |
|
||||
| **Verification** | 软件验证 | 评审、测试、漏洞管理 |
|
||||
| **Deployment** | 软件部署 | 运营、托管、发布管理 |
|
||||
|
||||
### 3 Security Practices per Function
|
||||
|
||||
```
|
||||
Governance:
|
||||
├── Strategy & Metrics
|
||||
├── Policy & Compliance
|
||||
└── Education & Guidance
|
||||
|
||||
Construction:
|
||||
├── Secure Requirements
|
||||
├── Secure Architecture
|
||||
└── Secure Coding
|
||||
|
||||
Verification:
|
||||
├── Threat Assessment
|
||||
├── Security Testing
|
||||
└── Security Review
|
||||
|
||||
Deployment:
|
||||
├── Secure Environment
|
||||
├── Secure Release
|
||||
└── Operational Enablement
|
||||
```
|
||||
|
||||
## SAMM Maturity Levels
|
||||
|
||||
每个实践评估为 0-3 级:
|
||||
|
||||
| Level | 名称 | 描述 |
|
||||
|-------|------|------|
|
||||
| **0** | Implicit | 无正式实践 |
|
||||
| **1** | Initial | 意识,但 ad-hoc |
|
||||
| **2** | Practiced | 正式流程,部分覆盖 |
|
||||
| **3** | Comprehensive | 量化管理,完全覆盖 |
|
||||
|
||||
## Security Activities
|
||||
|
||||
### Governance
|
||||
|
||||
| Practice | L1 | L2 | L3 |
|
||||
|----------|----|----|----|
|
||||
| Strategy & Metrics | 安全指标定义 | 趋势分析 | 目标优化 |
|
||||
| Policy & Compliance | 基线政策 | 合规评估 | 持续合规 |
|
||||
| Education | 安全意识 | 开发者培训 | 安全冠军计划 |
|
||||
|
||||
### Construction
|
||||
|
||||
| Practice | L1 | L2 | L3 |
|
||||
|----------|----|----|----|
|
||||
| Secure Requirements | 安全需求清单 | 风险驱动需求 | 威胁建模集成 |
|
||||
| Secure Architecture | 安全设计原则 | 架构评审 | 安全参考架构 |
|
||||
| Secure Coding | 编码标准 | 代码扫描 | 实时分析 |
|
||||
|
||||
### Verification
|
||||
|
||||
| Practice | L1 | L2 | L3 |
|
||||
|----------|----|----|----|
|
||||
| Threat Assessment | 基础威胁识别 | 结构化威胁建模 | 持续威胁分析 |
|
||||
| Security Testing | 基本渗透测试 | 自动安全测试 | 交互式测试 |
|
||||
| Security Review | 代码审查 | 安全审查清单 | 专家审查 |
|
||||
|
||||
### Deployment
|
||||
|
||||
| Practice | L1 | L2 | L3 |
|
||||
|----------|----|----|----|
|
||||
| Secure Environment | 基础配置 | 配置基线 | 自动化配置管理 |
|
||||
| Secure Release | 发布检查 | 签名验证 | 安全发布流程 |
|
||||
| Operational Enablement | 运营安全意识 | 安全运维手册 | DevSecOps 集成 |
|
||||
|
||||
## Assessment Process
|
||||
|
||||
```
|
||||
1. 准备
|
||||
├── 利益相关者识别
|
||||
└── 信息收集
|
||||
|
||||
2. 评估
|
||||
├── 问卷调查
|
||||
├── 文档审查
|
||||
└── 访谈
|
||||
|
||||
3. 分析
|
||||
├── 评分计算
|
||||
├── 差距分析
|
||||
└── 风险评估
|
||||
|
||||
4. 路线图
|
||||
├── 改进计划
|
||||
├── 优先级排序
|
||||
└── 资源分配
|
||||
|
||||
5. 实施
|
||||
├── 迭代改进
|
||||
└── 效果验证
|
||||
```
|
||||
|
||||
## Comparison with Other Models
|
||||
|
||||
| 模型 | 焦点 | 适用场景 |
|
||||
|------|------|---------|
|
||||
| **SAMM** | 软件开发生命周期 | SDLC 安全改进 |
|
||||
| **BSIMM** | 实际安全活动 | 基准比较 |
|
||||
| **OWASP ASVS** | 应用安全验证 | 安全测试标准 |
|
||||
| **CSMM** | 云安全 | 云环境安全 |
|
||||
|
||||
## See Also
|
||||
|
||||
- [[Cloud Security]] — 云安全
|
||||
- [[DevSecOps]] — DevSecOps
|
||||
- [[Cloud Maturity Model]] — 云成熟度模型
|
||||
- [[CSPM]] — 云安全态势管理
|
||||
57
wiki/concepts/StackSets-Deployment-Visibility.md
Normal file
57
wiki/concepts/StackSets-Deployment-Visibility.md
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: StackSets Deployment Visibility
|
||||
type: concept
|
||||
tags: [AWS, CloudFormation, StackSets, Observability, CloudOps]
|
||||
date: 2025-10-24
|
||||
---
|
||||
|
||||
## Definition
|
||||
StackSets Deployment Visibility(StackSets 部署可观测性)是指在 AWS 多账户/多区域场景下,通过 EventBridge + CloudWatch Logs 实现对 CloudFormation StackSets 部署状态的集中监控和故障排查能力。核心目标是消除多账户部署中的监控盲区,提供跨账户的统一可观测性视图。
|
||||
|
||||
## Core Properties
|
||||
- **事件捕获**:EventBridge Rules 捕获所有 CloudFormation 操作事件(CREATE/UPDATE/DELETE)
|
||||
- **跨账户转发**:EventBridge Custom Event Bus 将事件从成员账户转发到管理账户
|
||||
- **集中存储**:CloudWatch Log Group 聚合所有账户的 CloudFormation 日志
|
||||
- **统一查询**:CloudWatch Logs Insights 支持跨账户、跨区域的结构化日志分析
|
||||
|
||||
## Event Flow
|
||||
```
|
||||
Member Account CloudFormation (CREATE/UPDATE/DELETE)
|
||||
→ EventBridge Rule (pattern: CloudFormation events)
|
||||
→ Event Bus (Custom, in Management Account)
|
||||
→ CloudWatch Log Group (central-cloudformation-logs)
|
||||
→ CloudWatch Logs Insights (aggregated queries)
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
- [[Multi-Account Deployment]]:StackSets 部署可观测性是跨账户部署运营的核心支撑
|
||||
- [[AWS CloudFormation StackSets]]:被监控的目标部署工具
|
||||
- [[Amazon EventBridge]]:事件捕获和跨账户路由的核心组件
|
||||
- [[Amazon CloudWatch Logs]]:集中日志存储
|
||||
- [[Centralized Logging]]:部署可观测性是集中日志的具体应用
|
||||
- [[Cross-Account Monitoring]]:共享同一套跨账户监控基础设施
|
||||
- [[Cloud Service Delivery]]:StackSets 部署可观测性是云服务交付运营的重要组成
|
||||
|
||||
## Monitorable Events
|
||||
- Stack CREATE operation started/completed/failed
|
||||
- Stack UPDATE operation started/completed/failed
|
||||
- Stack DELETE operation started/completed/failed
|
||||
- Resource creation/update/deletion events
|
||||
- Stack set operation preferences (parallelism, fault tolerance)
|
||||
|
||||
## Query Patterns (CloudWatch Logs Insights)
|
||||
```sql
|
||||
fields @timestamp, account, region
|
||||
| parse @message /"resource-type":"(?<resource_type>[^"]+)"/
|
||||
| parse @message /"status":"(?<status>[^"]+)"/
|
||||
| parse @message /"logical-resource-id":"(?<logical_resource_id>[^"]+)"/
|
||||
| filter status = "FAILED"
|
||||
| sort @timestamp desc
|
||||
```
|
||||
|
||||
## Key Metrics to Track
|
||||
- Deployment success/failure rate by account
|
||||
- Time-to-deploy by resource type
|
||||
- Regional distribution of deployments
|
||||
- Failed operations and affected accounts
|
||||
- Deployment timeline and operation duration
|
||||
72
wiki/concepts/Threat-Modeling.md
Normal file
72
wiki/concepts/Threat-Modeling.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Threat Modeling
|
||||
|
||||
## Definition
|
||||
Threat Modeling is a structured approach for identifying and prioritizing potential threats to a system, and determining the value that potential mitigations would have in reducing or neutralizing those threats.
|
||||
|
||||
## Concept
|
||||
威胁建模是一种系统化的方法,用于识别和优先处理系统的潜在威胁,并确定潜在缓解措施在减少或消除这些威胁方面的价值。
|
||||
|
||||
## When to Perform
|
||||
|
||||
### Design Phase (Shift-Left)
|
||||
- 新系统架构设计时
|
||||
- 重大功能变更时
|
||||
- 系统集成前
|
||||
|
||||
### Development Phase
|
||||
- 安全编码时
|
||||
- 安全评审时
|
||||
|
||||
### Operations Phase (Shift-Right)
|
||||
- 定期复审
|
||||
- 重大安全事件后
|
||||
- 系统退役评估
|
||||
|
||||
## Process (STRIDE Framework)
|
||||
|
||||
### S - Spoofing(欺骗)
|
||||
伪造身份,如会话劫持
|
||||
|
||||
### T - Tampering(篡改)
|
||||
修改数据或代码
|
||||
|
||||
### R - Repudiation(抵赖)
|
||||
否认执行的操作
|
||||
|
||||
### I - Information Disclosure(信息泄露)
|
||||
未授权访问敏感数据
|
||||
|
||||
### D - Denial of Service(拒绝服务)
|
||||
使系统不可用
|
||||
|
||||
### E - Elevation of Privilege(权限提升)
|
||||
获得超出预期的权限
|
||||
|
||||
## Tools
|
||||
- Microsoft Threat Modeling Tool
|
||||
- OWASP Threat Dragon
|
||||
- IriusRisk
|
||||
- draw.io + 威胁建模模板
|
||||
|
||||
## Output
|
||||
- 威胁文档
|
||||
- 风险矩阵(概率 × 影响)
|
||||
- 缓解措施清单
|
||||
- 安全需求
|
||||
|
||||
## Best Practices
|
||||
1. 从攻击者角度思考
|
||||
2. 覆盖所有信任边界
|
||||
3. 考虑依赖组件的安全
|
||||
4. 定期更新威胁模型
|
||||
5. 与安全专家协作
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 威胁建模是安全开发的重要实践
|
||||
- [[Shift-Left-Security]] — 早期安全分析
|
||||
- [[Zero-Trust-Architecture]] — 零信任架构
|
||||
- [[Risk-Management]] — 风险管理
|
||||
- [[Security-Design]] — 安全设计
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
34
wiki/concepts/UEFI-Only.md
Normal file
34
wiki/concepts/UEFI-Only.md
Normal file
@@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "UEFI Only"
|
||||
type: concept
|
||||
tags: [uefi, bios, boot, hp]
|
||||
date: 2026-04-14
|
||||
aliases: [UEFI Only 模式, UEFI Only Mode]
|
||||
---
|
||||
|
||||
# UEFI Only
|
||||
|
||||
## Definition
|
||||
BIOS/UEFI 固件中的一种启动模式设置,仅允许 UEFI 引导,跳过所有 Legacy/BIOS (BBS) 启动项。是 HP ZBook 等工作站解决"启动项混乱"问题的终极方案。
|
||||
|
||||
## The Problem: Hybrid Mode Pollution
|
||||
HP ZBook 从 Legacy 或 Hybrid 模式切换到 UEFI 安装 Ubuntu 后,BIOS 中会残留大量 `Boot0000-Boot0004` 类型的 BBS (BIOS Boot Specification) 遗留项,这些 Legacy 项会干扰 UEFI 启动项识别。
|
||||
|
||||
## The Solution
|
||||
在 BIOS 设置中 (`Boot Options` → `Boot Mode`) 将 `Legacy` 或 `Hybrid` 切换为 `UEFI Only`:
|
||||
- 无效的 0000-0004 BBS 项自动消失
|
||||
- BIOS 被迫只识别 UEFI 启动项
|
||||
- BootOrder 中仅剩 0005 (Ubuntu)
|
||||
|
||||
## Side Effects
|
||||
- 关闭 Legacy Support 后,机器无法从传统 MBR 磁盘启动
|
||||
- 适用于纯 UEFI 环境(如现代工作站、服务器)
|
||||
|
||||
## Related
|
||||
- [[HP ZBook]] — 受影响的硬件平台
|
||||
- [[efibootmgr]] — 配合工具
|
||||
- [[UEFI启动修复]] — 完整修复策略
|
||||
- [[UEFI Only]] ← 解决 [[HP ZBook]] ← [[UEFI启动修复]]
|
||||
|
||||
## Sources
|
||||
- [[安装ubuntu-24-04-2在hp-zbook工作站笔记本上]]
|
||||
42
wiki/concepts/UEFI启动.md
Normal file
42
wiki/concepts/UEFI启动.md
Normal file
@@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "UEFI启动"
|
||||
tags: [boot, uefi, bios, firmware]
|
||||
date: 2026-04-28
|
||||
---
|
||||
|
||||
# UEFI启动
|
||||
|
||||
## Definition
|
||||
UEFI(Unified Extensible Firmware Interface)是一种取代传统 BIOS 的现代固件接口标准,用于计算机启动过程。与 BIOS 相比,UEFI 提供图形界面、安全启动(Secure Boot)、支持超过 2TB 磁盘(GPT 分区)等先进特性。
|
||||
|
||||
## UEFI vs BIOS
|
||||
| 特性 | UEFI | BIOS |
|
||||
|------|------|------|
|
||||
| 分区表 | GPT | MBR |
|
||||
| 磁盘容量限制 | 无(>2TB) | 2TB |
|
||||
| 主分区数限制 | 无(最多 128) | 4 个 |
|
||||
| 启动速度 | 更快 | 较慢 |
|
||||
| 安全启动 | 支持 Secure Boot | 无 |
|
||||
| 界面 | 图形鼠标 | 文字界面 |
|
||||
| 启动设备选择 | 通常按 F8/F9/F12 | 类似 |
|
||||
|
||||
## UEFI Boot Process
|
||||
```
|
||||
Power On → UEFI Firmware (POST) → EFI System Partition (ESP)
|
||||
→ Boot Manager → UEFI OS Loader (.efi) → OS Kernel
|
||||
```
|
||||
|
||||
## Clonezilla 中的 UEFI 注意事项
|
||||
- 启动盘分区方案必须选择 **GPT**
|
||||
- 目标系统类型选择 **UEFI (非 CSM)**
|
||||
- U 盘文件系统必须是 **FAT32**(ESP 标准格式)
|
||||
- 若目标机器不支持 UEFI(老旧设备),降级使用 MBR + BIOS
|
||||
|
||||
## Related Entities
|
||||
- [[HP ZBook]] — 支持 UEFI 启动的笔记本
|
||||
- [[Rufus]] — 制作 UEFI 启动盘的工具
|
||||
|
||||
## Related Concepts
|
||||
- [[GPT分区表]] — UEFI 使用的分区标准
|
||||
- [[MBR分区表]] — BIOS 使用的传统分区标准
|
||||
- [[ISOHybrid镜像]] — Rufus 写入 U 盘时的镜像格式
|
||||
69
wiki/concepts/Vulnerability-Scanning.md
Normal file
69
wiki/concepts/Vulnerability-Scanning.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Vulnerability Scanning
|
||||
|
||||
## Definition
|
||||
Vulnerability scanning is the automated process of identifying and cataloging security weaknesses in systems, networks, or applications.
|
||||
|
||||
## Concept
|
||||
漏洞扫描是自动识别和分类系统、网络或应用程序安全弱点的过程。
|
||||
|
||||
## Types
|
||||
|
||||
### Network Vulnerability Scanning
|
||||
- 扫描网络设备和配置
|
||||
- 识别开放端口和服务
|
||||
- 检测配置弱点
|
||||
|
||||
### Web Application Scanning
|
||||
- 检测 Web 应用漏洞
|
||||
- 爬取和测试所有页面
|
||||
- 测试 API 端点
|
||||
|
||||
### Container Image Scanning
|
||||
- 检查镜像中的漏洞
|
||||
- 分析操作系统包
|
||||
- 检测应用依赖
|
||||
|
||||
### Database Scanning
|
||||
- 配置审计
|
||||
- 弱密码检测
|
||||
- 权限检查
|
||||
|
||||
## Tools
|
||||
- Nessus — 综合漏洞扫描器
|
||||
- OpenVAS — 开源漏洞扫描
|
||||
- Qualys — 云端漏洞管理
|
||||
- Trivy — 容器镜像扫描
|
||||
- Clair — 容器漏洞分析
|
||||
|
||||
## Integration with DevSecOps
|
||||
|
||||
### CI/CD Pipeline
|
||||
```yaml
|
||||
# 示例:Trivy 容器扫描
|
||||
security_scan:
|
||||
stage: security
|
||||
script:
|
||||
- trivy image myapp:latest
|
||||
allow_failure: true
|
||||
```
|
||||
|
||||
### Shift-Left Application
|
||||
- 早期发现漏洞
|
||||
- 集成到 IDE
|
||||
- 开发时实时检查
|
||||
|
||||
### Shift-Right Application
|
||||
- 持续监控生产环境
|
||||
- 定期扫描
|
||||
- 自动化补丁管理
|
||||
|
||||
## Related Concepts
|
||||
- [[DevSecOps]] — 漏洞扫描是持续安全的重要组成
|
||||
- [[SAST]] — 代码级漏洞检测
|
||||
- [[DAST]] — 动态漏洞检测
|
||||
- [[SCA]] — 依赖漏洞检测
|
||||
- [[Shift-Left-Security]] — 早期发现
|
||||
- [[Shift-Right-Security]] — 持续监控
|
||||
|
||||
## Sources
|
||||
- [[what-is-devsecops-best-practices-benefits-and-tools]]
|
||||
78
wiki/concepts/What-If-Simulation.md
Normal file
78
wiki/concepts/What-If-Simulation.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
title: "What-If Simulation"
|
||||
tags:
|
||||
- devops
|
||||
- architecture
|
||||
- decision-support
|
||||
- ai
|
||||
- cloud-migration
|
||||
created: 2026-04-25
|
||||
---
|
||||
|
||||
# What-If Simulation
|
||||
|
||||
## Definition
|
||||
|
||||
What-If Simulation 是 Agentic AI 模拟架构变更(如云迁移、实例类型变更)对**性能、成本和合规**的影响,支持数据驱动的决策。
|
||||
|
||||
## 应用场景
|
||||
|
||||
| 场景 | 模拟内容 | 决策支持 |
|
||||
|------|---------|---------|
|
||||
| 云迁移 | AWS → GCP 迁移影响 | 性能/成本/合规权衡 |
|
||||
| 实例变更 | m5.xlarge → m5.large | 性能影响评估 |
|
||||
| 架构重构 | 单体 → 微服务 | 复杂度/收益分析 |
|
||||
| 多云策略 | 工作负载分配 | 成本优化建议 |
|
||||
|
||||
## Agentic AI What-If Simulation 工作流
|
||||
|
||||
```
|
||||
1. Define Scenario → "Moving from AWS EKS to GCP GKE"
|
||||
2. Gather Baseline → 当前性能/成本/合规指标
|
||||
3. Model Impact → AI 基于历史数据和预测模型
|
||||
4. Generate Report → 性能差异/成本差异/风险评估
|
||||
5. Recommend → 最优方案 + 实施路径
|
||||
```
|
||||
|
||||
## 示例
|
||||
|
||||
> An AI agent simulates moving an AWS-based SaaS application to GCP's Private Cloud in KSA (Saudi Arabia):
|
||||
>
|
||||
> **Performance Impact**:
|
||||
> - Latency: +45ms (KSA → GCP EU)
|
||||
> - Availability: 99.9% → 99.95% (enhanced SLA)
|
||||
>
|
||||
> **Cost Impact**:
|
||||
> - Compute: -20% (GCP preemptible pricing)
|
||||
> - Egress: +15% (cross-region data transfer)
|
||||
> - Net: -12% annual savings
|
||||
>
|
||||
> **Compliance Impact**:
|
||||
> - Data Sovereignty: ✅ KSA data residency satisfied
|
||||
> - SLA: ✅ GCP private cloud meets enterprise agreement
|
||||
>
|
||||
> **Recommendation**: Proceed with migration, implement CDN for latency optimization
|
||||
|
||||
## 与 [[Multi-Cloud Strategy]] 的关系
|
||||
|
||||
What-If Simulation 是 [[Multi-Cloud Strategy]] 决策支持的核心工具:
|
||||
|
||||
```
|
||||
Multi-Cloud Strategy Process:
|
||||
├── Assess → 评估多云需求
|
||||
├── Plan → What-If Simulation ← # ← 本页
|
||||
├── Execute → 实施迁移
|
||||
└── Optimize → 持续成本/性能优化
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Multi-Cloud Strategy]] — What-If Simulation 支持多云决策
|
||||
- [[Cloud Migration]] — 主要应用场景
|
||||
- [[Cost-Benefit Analysis]] — 类似分析方法
|
||||
- [[Decision Framework]] — 决策支持的通用框架
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[how-agentic-ai-can-help-for-cloud-devops]]
|
||||
- [[how-can-a-multi-cloud-strategy-transform-your-business-roi]]
|
||||
67
wiki/concepts/Zero-Trust-Architecture.md
Normal file
67
wiki/concepts/Zero-Trust-Architecture.md
Normal file
@@ -0,0 +1,67 @@
|
||||
---
|
||||
title: "Zero Trust Architecture (ZTA)"
|
||||
type: concept
|
||||
tags: [security, cloud, compliance]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
零信任架构(Zero Trust Architecture)是一种安全框架,其核心原则是**"永不信任,始终验证"**(Never Trust, Always Verify)。与传统的边界安全模型不同,ZTA假设网络内部和外部都不可信,每个访问请求都必须经过验证。
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Never Trust, Always Verify
|
||||
```
|
||||
传统模型: 边界内 = 可信
|
||||
ZTA模型: 无论位置,均需验证
|
||||
```
|
||||
|
||||
### 2. Least Privilege Access
|
||||
- 仅授予完成任务所需的最小权限
|
||||
- 细粒度访问控制
|
||||
- Just-in-Time (JIT) 访问
|
||||
|
||||
### 3. Assume Breach
|
||||
- 假设系统已被攻破
|
||||
- 持续监控和检测
|
||||
- 微分段隔离
|
||||
|
||||
## Implementation Pillars
|
||||
|
||||
| 支柱 | 描述 | 技术示例 |
|
||||
|------|------|---------|
|
||||
| 身份认证 | 强身份验证 | MFA, SSO |
|
||||
| 设备健康 | 终端安全状态 | MDM, EDR |
|
||||
| 网络分段 | 微隔离 | VPC, Service Mesh |
|
||||
| 应用控制 | 最小权限 | RBAC, ABAC |
|
||||
| 数据加密 | 传输和静态加密 | TLS, KMS |
|
||||
|
||||
## In ITSM Context
|
||||
|
||||
在[[ITSM]]中,ZTA是[[Security-and-Compliance]]的核心:
|
||||
|
||||
```
|
||||
Security & Compliance Management (ITSM 8.0)
|
||||
├── Zero Trust Architecture (ZTA)
|
||||
│ ├── 持续身份验证
|
||||
│ ├── 微分段隔离
|
||||
│ └── 最小权限原则
|
||||
├── AI-based Threat Intelligence
|
||||
│ ├── 行为分析
|
||||
│ └── 异常检测
|
||||
└── Policy-as-Code
|
||||
├── 合规自动化
|
||||
└── 审计追踪
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Policy-as-Code]] — 策略即代码,合规自动化
|
||||
- [[Security-and-Compliance]] — 安全与合规管理
|
||||
- [[Multi-factor-Authentication]] — 多因素认证
|
||||
- [[Cloud Security]] — 云安全
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — ZTA在现代ITSM中的应用
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user