heibaiying/BigData-Notes

Item: heibaiying/BigData-Notes
Rating: 1
Author: RepoPilot

大数据入门指南 :star:

Concerns

Stale and unlicensed — last commit 2y ago

weakest axis

Use as dependencyConcerns

no license — legally unclear; last commit was 2y ago…

Fork & modifyConcerns

no license — can't legally use code; no CI workflows detected…

Learn fromHealthy

Documented and popular — useful reference codebase to read through.

Deploy as-isConcerns

no license — can't legally use code; last commit was 2y ago…

✓7 active contributors
✓Tests present
⚠Stale — last commit 2y ago

Show all 6 evidence items →

⚠Single-maintainer risk — top contributor 93% of recent commits
⚠No license — legally unclear to depend on
⚠No CI workflows detected

What would change the summary?

→Use as dependency Concerns → Mixed if: publish a permissive license (MIT, Apache-2.0, etc.)
→Fork & modify Concerns → Mixed if: add a LICENSE file
→Deploy as-is Concerns → Mixed if: add a LICENSE file

Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Great to learn from" badge

Paste into your README — live-updates from the latest cached analysis.

[![RepoPilot: Great to learn from](https://repopilot.app/api/badge/heibaiying/bigdata-notes?axis=learn)](https://repopilot.app/r/heibaiying/bigdata-notes)

Paste at the top of your README.md — renders inline like a shields.io badge.

▸Preview social card (1200×630)

This card auto-renders when someone shares https://repopilot.app/r/heibaiying/bigdata-notes on X, Slack, or LinkedIn.

Onboarding doc

Onboarding: heibaiying/BigData-Notes

Generated by RepoPilot · 2026-05-09 · Source

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/heibaiying/BigData-Notes shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

🎯Verdict

AVOID — Stale and unlicensed — last commit 2y ago

7 active contributors
Tests present
⚠ Stale — last commit 2y ago
⚠ Single-maintainer risk — top contributor 93% of recent commits
⚠ No license — legally unclear to depend on
⚠ No CI workflows detected

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

✅Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live heibaiying/BigData-Notes repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/heibaiying/BigData-Notes.

What it runs against: a local clone of heibaiying/BigData-Notes — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in heibaiying/BigData-Notes | Confirms the artifact applies here, not a fork | | 2 | Default branch master exists | Catches branch renames | | 3 | Last commit ≤ 884 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>heibaiying/BigData-Notes</code></summary>

#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of heibaiying/BigData-Notes. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/heibaiying/BigData-Notes.git
#   cd BigData-Notes
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of heibaiying/BigData-Notes and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "heibaiying/BigData-Notes(\\.git)?\\b" \\
  && ok "origin remote is heibaiying/BigData-Notes" \\
  || miss "origin remote is not heibaiying/BigData-Notes (artifact may be from a fork)"

# 3. Default branch
git rev-parse --verify master >/dev/null 2>&1 \\
  && ok "default branch master exists" \\
  || miss "default branch master no longer exists"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 884 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~854d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/heibaiying/BigData-Notes"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

⚡TL;DR

BigData-Notes is a comprehensive tutorial and code example repository for Apache big data technologies (Hadoop, Spark, Flink, HBase, Kafka, etc.), written in Java and Scala. It provides both educational documentation and runnable examples for learning distributed computing frameworks, with emphasis on Flink streaming and batch processing, Kafka integration, and state management patterns. Flat modular structure under code/: separate Maven subprojects for Flink (flink-basis-java, flink-basis-scala, flink-kafka-integration, flink-state-management), Hadoop (hadoop-word-count), plus notes/ directory with markdown documentation. Each module is independent with its own pom.xml, making them runnable standalone.

👥Who it's for

Students and junior engineers learning big data technologies who want hands-on examples alongside written explanations. Data engineers building real-time streaming pipelines with Flink or batch jobs with Hadoop/Spark. Contributors are primarily Chinese-speaking developers (公众号 focus in README).

🌱Maturity & risk

Moderately mature educational resource with 145KB of Java code structured in proper Maven projects, but appears to be a teaching/reference repository rather than production framework. No CI/test infrastructure visible in file listing. Last commit age unknown but well-organized modular structure suggests active maintenance for a learning platform.

Low risk for educational use—it's a tutorial repo, not a production dependency. Risk factors: single apparent maintainer (heibaiying), Flink 1.9.0 is EOL (released 2019), no test suite visible, dependency management only shown for Flink basis modules. Don't use code directly in production without updating Flink/Scala versions.

Active areas of work

Repository appears to be a stable educational reference with no visible active development cycle mentioned. Focus is on documenting and providing runnable examples for multiple big data frameworks (Hadoop, Hive, Spark, Storm, Flink, HBase, Kafka, Zookeeper, Flume, Sqoop, Azkaban, Scala).

🚀Get running

git clone https://github.com/heibaiying/BigData-Notes.git && cd BigData-Notes/code/Flink/flink-basis-java && mvn clean install

Daily commands: For Flink basis Java: mvn clean package then java -jar target/flink-basis-java-1.0.jar. For Scala examples: mvn clean compile scala:run. Requires Java 1.8, Maven 3.6+, and Flink 1.9.0 available on classpath or installed locally.

🗺️Map of the codebase

code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/KafkaStreamingJob.java: Core example showing Kafka source → Flink processing → Kafka sink pattern
code/Flink/flink-state-management/src/main/java/com/heibaiying/keyedstate/ThresholdWarningWithTTL.java: Demonstrates Flink keyed state with TTL (time-to-live), essential for production streaming apps
code/Flink/flink-basis-scala/src/main/scala/com/heibaiying/WordCountStreaming.scala: Scala implementation of streaming wordcount, shows Flink DataStream API with Scala idioms
code/Flink/flink-kafka-integration/src/main/java/com/heibaiying/sink/FlinkToMySQLSink.java: Custom sink example for writing Flink results to MySQL, critical pattern for real pipelines
code/Hadoop/hadoop-word-count/pom.xml: Shows MapReduce project setup, baseline for comparing Flink vs Hadoop approaches

🛠️How to make changes

Add new Flink examples under code/Flink/flink-basis-java/src/main/java/com/heibaiying/. State management examples go in code/Flink/flink-state-management/src/main/java/com/heibaiying/{keyedstate,operatorstate}/. Documentation in notes/ with markdown format. Update corresponding pom.xml if adding new dependencies.

🪤Traps & gotchas

Flink 1.9.0 is end-of-life (last security patch 2021)—code won't compile against modern Java 17+. Scala 2.11 is EOL. Kafka integration examples hardcode broker addresses (watch for localhost vs network config). log4j.properties files present but no log4j2 upgrade path documented. StreamingJob.java and bean/Employee.java suggest examples assume local test environment without production configs (no serialization hints for Kafka).

💡Concepts to learn

Keyed State with TTL (Time-to-Live) — Flink's keyed state requires explicit TTL configuration to prevent unbounded memory growth in long-running streaming jobs; ThresholdWarningWithTTL.java demonstrates this critical production pattern
Operator State vs Keyed State — Flink differentiates between operator state (shared across all records) and keyed state (per-key partitioning); this repo has separate examples (operatorstate/ and keyedstate/) showing when to use each
Custom Sink Implementation (SinkFunction) — FlinkToMySQLSink.java shows how to implement custom sinks for databases not in built-in connectors; essential pattern when Flink's default connectors don't cover your target system
Kafka Source-Sink Integration in Streaming — KafkaStreamingJob.java demonstrates Flink's tight Kafka integration as both source and sink, the dominant pattern in modern data pipelines for streaming data ingestion and publishing
Batch vs Streaming Unified API — This repo shows both flink-basis-scala (WordCountBatch.scala vs WordCountStreaming.scala) where batch is a bounded stream; understanding their unified DataSet/DataStream APIs is key to Flink's design
Word Count as Teaching Baseline — Both Hadoop (hadoop-word-count) and Flink examples use word count; this repo enables direct comparison of MapReduce vs Flink processing paradigms on the same problem
Serialization for Distributed Data Transfer — Employee bean class (bean/Employee.java) in Kafka integration requires serialization for network transfer; missing SerializableComparator hints suggests this repo assumes Java default serialization over Kryo optimization

apache/flink — Official Flink repository with latest examples, this repo uses Flink 1.9.0 which is heavily outdated
heibaiying/Full-Stack-Notes — Companion repository by same author covering full-stack development, mentioned in README's WeChat promo
ClickHouse/ClickHouse — Complementary OLAP database often paired with Kafka/Flink pipelines for real-time analytics
confluentinc/kafka-streams-examples — Alternative streaming framework for Kafka with more up-to-date examples, shares same use cases as Flink-Kafka integration here
elastic/elasticsearch-hadoop — Similar sink pattern to FlinkToMySQLSink but for Elasticsearch, common pairing with Flink streaming

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for Flink Kafka integration module

The flink-kafka-integration module has KafkaStreamingJob.java and CustomSinkJob.java but lacks corresponding unit tests. This is critical for a Big Data educational repo since Kafka integration is commonly used in production pipelines. Adding tests will help contributors understand testing patterns for Flink streaming jobs.

[ ] Create src/test/java/com/heibaiying directory structure in code/Flink/flink-kafka-integration/
[ ] Add test class for KafkaStreamingJob.java using Flink's testing utilities (StreamTestHarness)
[ ] Add test class for CustomSinkJob.java and FlinkToMySQLSink.java with mock database connections
[ ] Document test setup in a TESTING.md file under code/Flink/flink-kafka-integration/

Add comprehensive unit tests for Hadoop word count implementations

The code/Hadoop/hadoop-word-count module has multiple implementations (WordCountApp, WordCountCombinerApp, WordCountCombinerPartitionerApp) with custom components (CustomPartitioner, WordCountMapper/Reducer) but no test suite. This is fundamental for a Big Data learning resource and would demonstrate Hadoop testing best practices.

[ ] Create src/test/java/com/heibaiying directory in code/Hadoop/hadoop-word-count/
[ ] Add MiniDfsCluster-based integration tests for each WordCount*App class
[ ] Add unit tests for WordCountMapper and WordCountReducer using mrunit or Hadoop's testing utilities
[ ] Add unit test for CustomPartitioner to verify key distribution logic
[ ] Add test data files in src/test/resources/

Add CI/CD workflow with Maven build validation for all modules

The repo contains multiple Maven projects (Flink modules, Hadoop modules) but lacks a GitHub Actions workflow to validate builds. This is critical for an open-source learning resource to catch breaking changes and compilation errors early, ensuring examples remain working.

[ ] Create .github/workflows/maven-build.yml workflow file
[ ] Configure workflow to run 'mvn clean verify' on all pom.xml modules (code/Flink/, code/Hadoop/)
[ ] Set Java version to 1.8 (matching project.build.sourceEncoding in pom.xml)
[ ] Add build status badge to README.md
[ ] Configure workflow to trigger on push to main branches and pull requests

🌿Good first issues

Add unit tests for ThresholdWarning.java state management logic in flink-state-management—currently no test directory visible, tests would validate TTL behavior
Update Flink version from 1.9.0 to 1.18.0 (current stable) in all pom.xml files and verify code still compiles—would improve long-term maintenance
Document the exact Kafka and Hadoop version setup required for flink-kafka-integration examples in a QUICKSTART.md, as notes/ lacks runnable setup steps for this module

⭐Top contributors

Click to expand

@heibaiying — 93 commits
@套陆 — 2 commits
@sunrui849 — 1 commits
@YolandaRay — 1 commits
@pengchen211 — 1 commits

📝Recent commits