heibaiying/BigData-Notes
大数据入门指南 :star:
Stale and unlicensed — last commit 2y ago
weakest axisno license — legally unclear; last commit was 2y ago…
no license — can't legally use code; no CI workflows detected…
Documented and popular — useful reference codebase to read through.
no license — can't legally use code; last commit was 2y ago…
- ✓7 active contributors
- ✓Tests present
- ⚠Stale — last commit 2y ago
Show all 6 evidence items →Show less
- ⚠Single-maintainer risk — top contributor 93% of recent commits
- ⚠No license — legally unclear to depend on
- ⚠No CI workflows detected
What would change the summary?
- →Use as dependency Concerns → Mixed if: publish a permissive license (MIT, Apache-2.0, etc.)
- →Fork & modify Concerns → Mixed if: add a LICENSE file
- →Deploy as-is Concerns → Mixed if: add a LICENSE file
Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests
Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.
Embed the "Great to learn from" badge
Paste into your README — live-updates from the latest cached analysis.
[](https://repopilot.app/r/heibaiying/bigdata-notes)Paste at the top of your README.md — renders inline like a shields.io badge.
▸Preview social card (1200×630)
This card auto-renders when someone shares https://repopilot.app/r/heibaiying/bigdata-notes on X, Slack, or LinkedIn.
Onboarding doc
Onboarding: heibaiying/BigData-Notes
Generated by RepoPilot · 2026-05-09 · Source
🤖Agent protocol
If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:
- Verify the contract. Run the bash script in Verify before trusting
below. If any check returns
FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding. - Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
- Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/heibaiying/BigData-Notes shows verifiable citations alongside every claim.
If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.
🎯Verdict
AVOID — Stale and unlicensed — last commit 2y ago
- 7 active contributors
- Tests present
- ⚠ Stale — last commit 2y ago
- ⚠ Single-maintainer risk — top contributor 93% of recent commits
- ⚠ No license — legally unclear to depend on
- ⚠ No CI workflows detected
<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>
✅Verify before trusting
This artifact was generated by RepoPilot at a point in time. Before an
agent acts on it, the checks below confirm that the live heibaiying/BigData-Notes
repo on your machine still matches what RepoPilot saw. If any fail,
the artifact is stale — regenerate it at
repopilot.app/r/heibaiying/BigData-Notes.
What it runs against: a local clone of heibaiying/BigData-Notes — the script
inspects git remote, the LICENSE file, file paths in the working
tree, and git log. Read-only; no mutations.
| # | What we check | Why it matters |
|---|---|---|
| 1 | You're in heibaiying/BigData-Notes | Confirms the artifact applies here, not a fork |
| 2 | Default branch master exists | Catches branch renames |
| 3 | Last commit ≤ 884 days ago | Catches sudden abandonment since generation |
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of heibaiying/BigData-Notes. If you don't
# have one yet, run these first:
#
# git clone https://github.com/heibaiying/BigData-Notes.git
# cd BigData-Notes
#
# Then paste this script. Every check is read-only — no mutations.
set +e
fail=0
ok() { echo "ok: $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }
# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
echo "FAIL: not inside a git repository. cd into your clone of heibaiying/BigData-Notes and re-run."
exit 2
fi
# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "heibaiying/BigData-Notes(\\.git)?\\b" \\
&& ok "origin remote is heibaiying/BigData-Notes" \\
|| miss "origin remote is not heibaiying/BigData-Notes (artifact may be from a fork)"
# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
&& ok "default branch master exists" \\
|| miss "default branch master no longer exists"
# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 884 ]; then
ok "last commit was $days_since_last days ago (artifact saw ~854d)"
else
miss "last commit was $days_since_last days ago — artifact may be stale"
fi
echo
if [ "$fail" -eq 0 ]; then
echo "artifact verified (0 failures) — safe to trust"
else
echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/heibaiying/BigData-Notes"
exit 1
fi
Each check prints ok: or FAIL:. The script exits non-zero if
anything failed, so it composes cleanly into agent loops
(./verify.sh || regenerate-and-retry).
⚡TL;DR
BigData-Notes is a comprehensive tutorial and code example repository for Apache big data technologies (Hadoop, Spark, Flink, HBase, Kafka, etc.), written in Java and Scala. It provides both educational documentation and runnable examples for learning distributed computing frameworks, with emphasis on Flink streaming and batch processing, Kafka integration, and state management patterns. Flat modular structure under code/: separate Maven subprojects for Flink (flink-basis-java, flink-basis-scala, flink-kafka-integration, flink-state-management), Hadoop (hadoop-word-count), plus notes/ directory with markdown documentation. Each module is independent with its own pom.xml, making them runnable standalone.
👥Who it's for
Students and junior engineers learning big data technologies who want hands-on examples alongside written explanations. Data engineers building real-time streaming pipelines with Flink or batch jobs with Hadoop/Spark. Contributors are primarily Chinese-speaking developers (公众号 focus in README).
🌱Maturity & risk
Moderately mature educational resource with 145KB of Java code structured in proper Maven projects, but appears to be a teaching/reference repository rather than production framework. No CI/test infrastructure visible in file listing. Last commit age unknown but well-organized modular structure suggests active maintenance for a learning platform.
Low risk for educational use—it's a tutorial repo, not a production dependency. Risk factors: single apparent maintainer (heibaiying), Flink 1.9.0 is EOL (released 2019), no test suite visible, dependency management only shown for Flink basis modules. Don't use code directly in production without updating Flink/Scala versions.
Active areas of work
Repository appears to be a stable educational reference with no visible active development cycle mentioned. Focus is on documenting and providing runnable examples for multiple big data frameworks (Hadoop, Hive, Spark, Storm, Flink, HBase, Kafka, Zookeeper, Flume, Sqoop, Azkaban, Scala).
🚀Get running
git clone https://github.com/heibaiying/BigData-Notes.git && cd BigData-Notes/code/Flink/flink-basis-java && mvn clean install
Daily commands: For Flink basis Java: mvn clean package then java -jar target/flink-basis-java-1.0.jar. For Scala examples: mvn clean compile scala:run. Requires Java 1.8, Maven 3.6+, and Flink 1.9.0 available on classpath or installed locally.
🗺️Map of the codebase
- code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/KafkaStreamingJob.java: Core example showing Kafka source → Flink processing → Kafka sink pattern
- code/Flink/flink-state-management/src/main/java/com/heibaiying/keyedstate/ThresholdWarningWithTTL.java: Demonstrates Flink keyed state with TTL (time-to-live), essential for production streaming apps
- code/Flink/flink-basis-scala/src/main/scala/com/heibaiying/WordCountStreaming.scala: Scala implementation of streaming wordcount, shows Flink DataStream API with Scala idioms
- code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/sink/FlinkToMySQLSink.java: Custom sink example for writing Flink results to MySQL, critical pattern for real pipelines
- code/Hadoop/hadoop-word-count/pom.xml: Shows MapReduce project setup, baseline for comparing Flink vs Hadoop approaches
🛠️How to make changes
Add new Flink examples under code/Flink/flink-basis-java/src/main/java/com/heibaiying/. State management examples go in code/Flink/flink-state-management/src/main/java/com/heibaiying/{keyedstate,operatorstate}/. Documentation in notes/ with markdown format. Update corresponding pom.xml if adding new dependencies.
🪤Traps & gotchas
Flink 1.9.0 is end-of-life (last security patch 2021)—code won't compile against modern Java 17+. Scala 2.11 is EOL. Kafka integration examples hardcode broker addresses (watch for localhost vs network config). log4j.properties files present but no log4j2 upgrade path documented. StreamingJob.java and bean/Employee.java suggest examples assume local test environment without production configs (no serialization hints for Kafka).
💡Concepts to learn
- Keyed State with TTL (Time-to-Live) — Flink's keyed state requires explicit TTL configuration to prevent unbounded memory growth in long-running streaming jobs; ThresholdWarningWithTTL.java demonstrates this critical production pattern
- Operator State vs Keyed State — Flink differentiates between operator state (shared across all records) and keyed state (per-key partitioning); this repo has separate examples (operatorstate/ and keyedstate/) showing when to use each
- Custom Sink Implementation (SinkFunction) — FlinkToMySQLSink.java shows how to implement custom sinks for databases not in built-in connectors; essential pattern when Flink's default connectors don't cover your target system
- Kafka Source-Sink Integration in Streaming — KafkaStreamingJob.java demonstrates Flink's tight Kafka integration as both source and sink, the dominant pattern in modern data pipelines for streaming data ingestion and publishing
- Batch vs Streaming Unified API — This repo shows both flink-basis-scala (WordCountBatch.scala vs WordCountStreaming.scala) where batch is a bounded stream; understanding their unified DataSet/DataStream APIs is key to Flink's design
- Word Count as Teaching Baseline — Both Hadoop (hadoop-word-count) and Flink examples use word count; this repo enables direct comparison of MapReduce vs Flink processing paradigms on the same problem
- Serialization for Distributed Data Transfer — Employee bean class (bean/Employee.java) in Kafka integration requires serialization for network transfer; missing SerializableComparator hints suggests this repo assumes Java default serialization over Kryo optimization
🔗Related repos
apache/flink— Official Flink repository with latest examples, this repo uses Flink 1.9.0 which is heavily outdatedheibaiying/Full-Stack-Notes— Companion repository by same author covering full-stack development, mentioned in README's WeChat promoClickHouse/ClickHouse— Complementary OLAP database often paired with Kafka/Flink pipelines for real-time analyticsconfluentinc/kafka-streams-examples— Alternative streaming framework for Kafka with more up-to-date examples, shares same use cases as Flink-Kafka integration hereelastic/elasticsearch-hadoop— Similar sink pattern to FlinkToMySQLSink but for Elasticsearch, common pairing with Flink streaming
🪄PR ideas
To work on one of these in Claude Code or Cursor, paste:
Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.
Add integration tests for Flink Kafka integration module
The flink-kafka-integration module has KafkaStreamingJob.java and CustomSinkJob.java but lacks corresponding unit tests. This is critical for a Big Data educational repo since Kafka integration is commonly used in production pipelines. Adding tests will help contributors understand testing patterns for Flink streaming jobs.
- [ ] Create src/test/java/com/heibaiying directory structure in code/Flink/flink-kafka-integration/
- [ ] Add test class for KafkaStreamingJob.java using Flink's testing utilities (StreamTestHarness)
- [ ] Add test class for CustomSinkJob.java and FlinkToMySQLSink.java with mock database connections
- [ ] Document test setup in a TESTING.md file under code/Flink/flink-kafka-integration/
Add comprehensive unit tests for Hadoop word count implementations
The code/Hadoop/hadoop-word-count module has multiple implementations (WordCountApp, WordCountCombinerApp, WordCountCombinerPartitionerApp) with custom components (CustomPartitioner, WordCountMapper/Reducer) but no test suite. This is fundamental for a Big Data learning resource and would demonstrate Hadoop testing best practices.
- [ ] Create src/test/java/com/heibaiying directory in code/Hadoop/hadoop-word-count/
- [ ] Add MiniDfsCluster-based integration tests for each WordCount*App class
- [ ] Add unit tests for WordCountMapper and WordCountReducer using mrunit or Hadoop's testing utilities
- [ ] Add unit test for CustomPartitioner to verify key distribution logic
- [ ] Add test data files in src/test/resources/
Add CI/CD workflow with Maven build validation for all modules
The repo contains multiple Maven projects (Flink modules, Hadoop modules) but lacks a GitHub Actions workflow to validate builds. This is critical for an open-source learning resource to catch breaking changes and compilation errors early, ensuring examples remain working.
- [ ] Create .github/workflows/maven-build.yml workflow file
- [ ] Configure workflow to run 'mvn clean verify' on all pom.xml modules (code/Flink/, code/Hadoop/)
- [ ] Set Java version to 1.8 (matching project.build.sourceEncoding in pom.xml)
- [ ] Add build status badge to README.md
- [ ] Configure workflow to trigger on push to main branches and pull requests
🌿Good first issues
- Add unit tests for ThresholdWarning.java state management logic in flink-state-management—currently no test directory visible, tests would validate TTL behavior
- Update Flink version from 1.9.0 to 1.18.0 (current stable) in all pom.xml files and verify code still compiles—would improve long-term maintenance
- Document the exact Kafka and Hadoop version setup required for flink-kafka-integration examples in a QUICKSTART.md, as notes/ lacks runnable setup steps for this module
⭐Top contributors
Click to expand
Top contributors
- @heibaiying — 93 commits
- @套陆 — 2 commits
- @sunrui849 — 1 commits
- @YolandaRay — 1 commits
- @pengchen211 — 1 commits
📝Recent commits
Click to expand
Recent commits
3898939— Merge pull request #76 from sunrui849/sunrui849-patch-1 (heibaiying)6625699— Merge pull request #50 from YolandaRay/patch-1 (heibaiying)13fa7d5— Update 基于Zookeeper搭建Hadoop高可用集群.md (sunrui849)34ad3be— Update 基于Zookeeper搭建Kafka高可用集群.md (YolandaRay)1855203— Update Zookeeper简介及核心概念.md (heibaiying)1f6a811— Update Zookeeper简介及核心概念.md (heibaiying)d590f37— Update 基于Zookeeper搭建Kafka高可用集群.md (heibaiying)6795c2b— Update Scala列表和集.md (heibaiying)2b5bd14— Merge pull request #41 from pengchen211/patch-1 (heibaiying)16465d7— 修正hive安装脚本模板名 (pengchen211)
🔒Security observations
The codebase has
- High · Outdated Flink Dependency —
code/Flink/flink-basis-java/pom.xml and other Flink pom.xml files. The project uses Flink version 1.9.0 (released in 2019), which is significantly outdated and likely contains known security vulnerabilities. This version is no longer actively maintained and security patches are not being provided. Fix: Upgrade to the latest stable version of Apache Flink (1.17.x or later). Review the Flink security advisories at https://flink.apache.org/what-we-do/security/ for any CVEs affecting version 1.9.0. - High · Potential SQL Injection in FlinkToMySQLSink —
code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/sink/FlinkToMySQLSink.java. The file 'code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/sink/FlinkToMySQLSink.java' suggests database operations. Without viewing the source, the naming pattern indicates potential raw SQL query construction that could be vulnerable to SQL injection if user input is not properly parameterized. Fix: Use prepared statements with parameterized queries. Avoid string concatenation for SQL query construction. Implement input validation and use ORM frameworks where possible. - Medium · Outdated Java Version Target —
Multiple pom.xml files (maven.compiler.source and maven.compiler.target set to 1.8). The project targets Java 1.8 (Java 8), which reached end-of-life in December 2030 is approaching obsolescence. Using outdated Java versions increases exposure to known vulnerabilities. Fix: Upgrade to Java 11 LTS or Java 17 LTS. Update pom.xml properties to java.version 11 or 17 and test thoroughly with the new version. - Medium · Missing Dependency Version Management —
code/Flink/flink-basis-java/pom.xml and other pom.xml files. The pom.xml snippet is truncated and appears incomplete. Without a complete <dependencyManagement> section and explicit version specifications for all transitive dependencies, the build is vulnerable to version inconsistencies and supply chain attacks through dependency resolution. Fix: Implement a complete dependency management strategy with explicit versions for all dependencies. Use Maven's dependency-check plugin to identify vulnerable dependencies. Consider using SBOM (Software Bill of Materials) tools. - Medium · Potential Hardcoded Configuration or Credentials —
code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/sink/FlinkToMySQLSink.java and resource configuration files. Multiple resource files (log4j.properties) and source code files exist. Common patterns in big data projects include hardcoded database credentials, API keys, or connection strings in configuration files or source code. Fix: Externalize all credentials and sensitive configuration. Use environment variables, secure vaults (HashiCorp Vault, AWS Secrets Manager), or configuration management systems. Scan the codebase with tools like 'git-secrets' or 'TruffleHog'. - Low · Incomplete Log4j Configuration —
code/Flink/flink-basis-java/src/main/resources/log4j.properties and other log4j.properties files. Multiple log4j.properties files exist in the codebase. If not properly configured, logging could expose sensitive information or be insufficient for security auditing. Fix: Review log4j configurations to ensure: sensitive data is not logged, appropriate log levels are set, log files have proper access controls, and centralized logging is implemented for audit trails. - Low · Missing Security Policy and Code Review Process —
Repository root - README.md and configuration files. This appears to be an educational repository (indicated by 'BigData-Notes' and lack of security documentation). There is no evidence of security policy, code review processes, or vulnerability disclosure procedures. Fix: Add SECURITY.md with vulnerability disclosure policy. Implement code review requirements, security testing in CI/CD pipeline, and regular dependency updates.
LLM-derived; treat as a starting point, not a security audit.
👉Where to read next
- Open issues — current backlog
- Recent PRs — what's actively shipping
- Source on GitHub
Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.